Lanciaux Maxime | BI | DWH | Hadoop | DevOps | Google Cloud | DataOps

Showing posts with label Stinger. Show all posts

Monday, December 29, 2014

Hadoop & Late binding !

Late binding is one of the key capability of Hadoop. It allows users to parse raw data (gzip, snappy, bzip2, csv, xml, json, pdf, jpeg, doc, others) stored in HDFS and to apply a structure on the fly.

Unfortunately, there may be some problems which link with data quality. Depending on the data and the SerDe used, you can loose some data, find it in the wrong column or the entire job can fail (which is annoying when working with 4TB of data ;-))

So my advice is to do a minimum checking [on the edge node] before data ingestion.

Friday, October 17, 2014

Hive development !

Hive 0.14 is now supporting ACID (atomicity, consistency, isolation and durability) transaction which lead to :

UPDATE, DELETE
BEGIN, COMMIT, ROLLBACK
INSERT ... VALUES

Stinger.next will bring more SQL compliance (non-equi joins, more sub-queries, materialized views and others) and Apache Optiq is bringing cost-based optimization to improve performance.

This is really impressive !

Monday, February 10, 2014

HDP [>] 2.1 natively available applications !

Stack components :

MapReduce (API v1 & v2) : software framework for processing vast amounts of data
Tez : more powerful framework for executing DAG (directed acyclic graph) of tasks
HOYA, HBase on YARN : distributed, column oriented database
Accumulo : (Linux only) sorted, distributed key / value store
Hue : web application interface for Hadoop ecosystem (Hive, Pig, HDFS, ...)
HDFS : hadoop distributed file system
WebHDFS : interact to HDFS using HTTP (no need for library)
WebHCat : interact to HCatalog using HTTP (no need for library)
YARN : Yet Another Resource Negotiator, allows more applications to run on Hadoop
Oozie : workflow / coordination system
Mahout : Machine-Learning libraries which use MapReduce for computing
Zookeeper : centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
Flume : data ingestion and streaming tool
Sqoop : extract and push down data to databases
Pig : scripting platform for analyzing large data sets
Hive : tool to query the data using a SQL-like language
SolR : plateform for indexing and search
HCatalog : meta-data management service
Ambari : set up, monitor and configure your Hadoop cluster
Phoenix : sql layer over HBase

Components being developed / integrated :

Spark : in memory engine for large-scale data processing
Falcon : data management framework
Knox : single point of secure access for Apache Hadoop clusters (use WebHDFS)
Storm : distributed realtime computation system
Kafka : publish-subscribe messaging system
Giraph : iterative graph processing system
OpenMPI : high performance message passing library
S4 : stream computing platform
Samza : distributed stream processing framework
R : software programming language for statistical computing and graphics

What else ;-) ?

Friday, October 18, 2013

Hadoop 2.0 !

Apache Hadoop 2.0 has just been released some days ago ! Hadoop is no longer only a MapReduce container but a multi data-framework container and provides High Availability, HDFS Federation, NFS and snapshot !

Monday, February 25, 2013

Stinger : Apache Hive 100 Times Faster !

It is awesome ! Read more here !

Lanciaux Maxime | BI | DWH | Hadoop | DevOps | Google Cloud | DataOps | PostgreSQL

Labels