Lanciaux Maxime | BI | DWH | Hadoop | DevOps | Google Cloud | DataOps

Showing posts with label SolR. Show all posts

Monday, March 16, 2015

There are bugs but it is normal life !

Hadoop is evolving very fast and sometimes you can find bugs. Be sure to check for your version / component what are the bugs :

Tuesday, December 30, 2014

Collaborative datalake !

It is holidays now so let's relax a little and imagine some funny things !

Why not a collaborative datalake based on Hadoop and web technology which allows users to share both their dataset and the code story to create it ? I would add a vote system too !

Let's see ;-)

Wednesday, July 23, 2014

My Hadoop is not working, what can I do ?

Keep calm and ;-)

First check your logs
Is the service is running ? (netstat -nat | grep ...)
Is it possible to access it ? (telnet ip port)
Is there a problem linked with path, java libraries, environment variable or exec ?
Am I using the correct user ?
What is the security system in place ?
Are nodes well synchronized ?
What about memory issue ? (swap should be desactivated also)

Monday, February 10, 2014

HDP [>] 2.1 natively available applications !

Stack components :

MapReduce (API v1 & v2) : software framework for processing vast amounts of data
Tez : more powerful framework for executing DAG (directed acyclic graph) of tasks
HOYA, HBase on YARN : distributed, column oriented database
Accumulo : (Linux only) sorted, distributed key / value store
Hue : web application interface for Hadoop ecosystem (Hive, Pig, HDFS, ...)
HDFS : hadoop distributed file system
WebHDFS : interact to HDFS using HTTP (no need for library)
WebHCat : interact to HCatalog using HTTP (no need for library)
YARN : Yet Another Resource Negotiator, allows more applications to run on Hadoop
Oozie : workflow / coordination system
Mahout : Machine-Learning libraries which use MapReduce for computing
Zookeeper : centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
Flume : data ingestion and streaming tool
Sqoop : extract and push down data to databases
Pig : scripting platform for analyzing large data sets
Hive : tool to query the data using a SQL-like language
SolR : plateform for indexing and search
HCatalog : meta-data management service
Ambari : set up, monitor and configure your Hadoop cluster
Phoenix : sql layer over HBase

Components being developed / integrated :

Spark : in memory engine for large-scale data processing
Falcon : data management framework
Knox : single point of secure access for Apache Hadoop clusters (use WebHDFS)
Storm : distributed realtime computation system
Kafka : publish-subscribe messaging system
Giraph : iterative graph processing system
OpenMPI : high performance message passing library
S4 : stream computing platform
Samza : distributed stream processing framework
R : software programming language for statistical computing and graphics

What else ;-) ?

Friday, October 18, 2013

Apache Hadoop 2.0 has just been released some days ago ! Hadoop is no longer only a MapReduce container but a multi data-framework container and provides High Availability, HDFS Federation, NFS and snapshot !

Wednesday, May 8, 2013

SolR & ElasticSearch !

SolR and ElasticSearch are both great way to add search capability (and more) to your projects. And behind that, there is Lucene !

Lanciaux Maxime | BI | DWH | Hadoop | DevOps | Google Cloud | DataOps | PostgreSQL

Labels