Showing posts with label Stinger. Show all posts
Showing posts with label Stinger. Show all posts

Monday, December 29, 2014

Hadoop & Late binding !

Late binding is one of the key capability of Hadoop. It allows users to parse raw data (gzip, snappy, bzip2, csv, xml, json, pdf, jpeg, doc, others) stored in HDFS and to apply a structure on the fly.

Unfortunately, there may be some problems which link with data quality. Depending on the data and the SerDe used, you can loose some data, find it in the wrong column or the entire job can fail (which is annoying when working with 4TB of data ;-))

So my advice is to do a minimum checking [on the edge node] before data ingestion.

Friday, October 17, 2014

Hive development !

Hive 0.14 is now supporting ACID (atomicity, consistency, isolation and durability) transaction which lead to :
  • UPDATE, DELETE
  • BEGIN, COMMIT, ROLLBACK
  • INSERT ... VALUES
Stinger.next will bring more SQL compliance (non-equi joins, more sub-queries, materialized views and others) and Apache Optiq is bringing cost-based optimization to improve performance.

This is really impressive !

Monday, February 10, 2014

HDP [>] 2.1 natively available applications !

Stack components :
  • MapReduce (API v1 & v2) : software framework for processing vast amounts of data
  • Tez : more powerful framework for executing DAG (directed acyclic graph) of tasks
  • HOYA, HBase on YARN : distributed, column oriented database
  • Accumulo : (Linux only) sorted, distributed key / value store
  • Hue : web application interface for Hadoop ecosystem (Hive, Pig, HDFS, ...)
  • HDFS : hadoop distributed file system
  • WebHDFS : interact to HDFS using HTTP (no need for library)
  • WebHCat : interact to HCatalog using HTTP (no need for library)
  • YARN : Yet Another Resource Negotiator, allows more applications to run on Hadoop
  • Oozie : workflow / coordination system
  • Mahout : Machine-Learning libraries which use MapReduce for computing
  • Zookeeper : centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
  • Flume : data ingestion and streaming tool
  • Sqoop : extract and push down data to databases
  • Pig : scripting platform for analyzing large data sets
  • Hive : tool to query the data using a SQL-like language
  • SolR : plateform for indexing and search
  • HCatalog : meta-data management service
  • Ambari : set up, monitor and configure your Hadoop cluster
  • Phoenix : sql layer over HBase
Components being developed / integrated :
  • Spark : in memory engine for large-scale data processing
  • Falcon : data management framework
  • Knox : single point of secure access for Apache Hadoop clusters (use WebHDFS)
  • Storm : distributed realtime computation system
  • Kafka : publish-subscribe messaging system
  • Giraph : iterative graph processing system
  • OpenMPI : high performance message passing library
  • S4 : stream computing platform
  • Samza : distributed stream processing framework
  • R : software programming language for statistical computing and graphics
What else ;-) ?

Friday, October 18, 2013

Hadoop 2.0 !

Apache Hadoop 2.0 has just been released some days ago ! Hadoop is no longer only a MapReduce container but a multi data-framework container and provides High Availability, HDFS Federation, NFS and snapshot !

Monday, February 25, 2013

Stinger : Apache Hive 100 Times Faster !

It is awesome ! Read more here !