Showing posts with label SolR. Show all posts
Showing posts with label SolR. Show all posts

Monday, March 16, 2015

There are bugs but it is normal life !

Hadoop is evolving very fast and sometimes you can find bugs. Be sure to check for your version / component what are the bugs :

Tuesday, December 30, 2014

Collaborative datalake !

It is holidays now so let's relax a little and imagine some funny things ! 

Why not a collaborative datalake based on Hadoop and web technology which allows users to share both their dataset and the code story to create it ? I would add a vote system too !

Let's see ;-)

Wednesday, July 23, 2014

My Hadoop is not working, what can I do ?

Keep calm and ;-)
  • First check your logs
  • Is the service is running ? (netstat -nat | grep ...)
  • Is it possible to access it ? (telnet ip port)
  • Is there a problem linked with path, java libraries, environment variable or exec ?
  • Am I using the correct user ? 
  • What is the security system in place ?
  • Are nodes well synchronized ?
  • What about memory issue ? (swap should be desactivated also)

Monday, February 10, 2014

HDP [>] 2.1 natively available applications !

Stack components :
  • MapReduce (API v1 & v2) : software framework for processing vast amounts of data
  • Tez : more powerful framework for executing DAG (directed acyclic graph) of tasks
  • HOYA, HBase on YARN : distributed, column oriented database
  • Accumulo : (Linux only) sorted, distributed key / value store
  • Hue : web application interface for Hadoop ecosystem (Hive, Pig, HDFS, ...)
  • HDFS : hadoop distributed file system
  • WebHDFS : interact to HDFS using HTTP (no need for library)
  • WebHCat : interact to HCatalog using HTTP (no need for library)
  • YARN : Yet Another Resource Negotiator, allows more applications to run on Hadoop
  • Oozie : workflow / coordination system
  • Mahout : Machine-Learning libraries which use MapReduce for computing
  • Zookeeper : centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
  • Flume : data ingestion and streaming tool
  • Sqoop : extract and push down data to databases
  • Pig : scripting platform for analyzing large data sets
  • Hive : tool to query the data using a SQL-like language
  • SolR : plateform for indexing and search
  • HCatalog : meta-data management service
  • Ambari : set up, monitor and configure your Hadoop cluster
  • Phoenix : sql layer over HBase
Components being developed / integrated :
  • Spark : in memory engine for large-scale data processing
  • Falcon : data management framework
  • Knox : single point of secure access for Apache Hadoop clusters (use WebHDFS)
  • Storm : distributed realtime computation system
  • Kafka : publish-subscribe messaging system
  • Giraph : iterative graph processing system
  • OpenMPI : high performance message passing library
  • S4 : stream computing platform
  • Samza : distributed stream processing framework
  • R : software programming language for statistical computing and graphics
What else ;-) ?

Friday, October 18, 2013

Hadoop 2.0 !

Apache Hadoop 2.0 has just been released some days ago ! Hadoop is no longer only a MapReduce container but a multi data-framework container and provides High Availability, HDFS Federation, NFS and snapshot !

Wednesday, May 8, 2013

SolR & ElasticSearch !

SolR and ElasticSearch are both great way to add search capability (and more) to your projects. And behind that, there is Lucene !