Monday, November 16, 2020

Data Mesh on Google Cloud Platform (and this is excellent !)

Hello from Home 🏡,

Quick update as I am leading the creation of a new Deck explaining Data Mesh Architecture and why this is an amazing opportunity to adopt Google Cloud Platform for this approach :
  • GCP Project per Data Domain / Team (<-> direct Data Mesh mapping 👌)
  • Serverless
  • Pay as you go
  • Google APIs
  • Looker (esp. Looker API and the semantic layer)
  • Scalability
  • BigQuery / Cloud Storage / Cloud Pub/Sub
  • Ephemeral Hadoop cluster (Dataproc)
  • IAM
  • Cloud Source Repositories / Cloud Build
This [new] architecture is not a huge revolution (and this is great), it comes from 40+ years of data platform innovation and it follows the same approach as Microservice / Kubernetes.



Stay Data Mesh tuned !

Sunday, March 29, 2020

Stay safe !

Good luck to you and your family during this strange period.

Sunday, March 15, 2020

How to DataOps with Google Cloud Platform !

What do we want to achieve ?

Use DataOps to monitor information from twitter about Google.
  • Without doing IaaS (Infrastructure), so using Google Cloud managed service or Serverless Technologies
  • Making sure all asset are stored in a repository with dev and master branch
  • No manual step to test or push content to our Google Cloud Project
  • Ensure I can adapt to data structure change and so replay all data processing from scratch
  • Keep all data and compress them

What do we need :

Let's do it !

  1. Schedule a task every minute to gather tweets from twitter API then store information to GCS
  2. Schedule a task every day to compress all previous data in a tar.gz file 
  3. Read compress archive and load it to BigQuery with adaptive schema capabilities
  4. Build the according reporting
More information and code soon !

Saturday, September 21, 2019

How to DataOps with Google Cloud Platform !

Hello from Budapest,

It's a long time since I didn't have the chance to look at my blog, so little news, I will now restart to share here and the first topic gonna be DataOps on Google Cloud Platform using BigQuery, DataStudio, Jenkins, StackDriver, Google Cloud Storage and more !



Stay Data tuned !

Wednesday, May 30, 2018

Getting Things Done

Hello from Langkawi,

A quick one before boarding, I had the chance to work at several customers using several technologies on several environments and several versions... Thing skeeps changing / evolving especially when customer and / or management is changing priorities or bugs occurs on Production (can happen with Hadoop).

How to keep track of your tasks (from customer questions, to admin / expense tasks, to private item to achieve) ? I tried different ways from emails, post-it, Wunderlist, Trello, Google keep and the only that worked for me is Todoist.

Why ?
  • Easy to setup recurring tasks
  • Karma / Graph view to see number of tasks achieved per day / per week
  • You can assign a color to each project and also create hierarchy of project
  • Possible to share with another user a project
Hope it will help ;-)
Cheers


Tuesday, July 18, 2017

Wednesday, April 5, 2017

Monday, March 20, 2017

Chief DevOps Officer ! #automation

This is the new trendy job, this is what I think his/her mission should be :

  • Automation using DevOps
  • Improving metrics gathering & reporting
  • Quality improvement by Pareto

Sunday, March 5, 2017

Friday, March 3, 2017

NetData !

Very cool tool to monitor your lovely servers : NetData ! #Enjoy #Thanks

Monday, January 9, 2017

Open source

A fantastic Open Source Tableau like reporting and data exploration tool to look at : Superset from the fantastic ladies and gentlemen from Airbnb.

Wednesday, October 12, 2016

How to install Ansible on Ubuntu !

sudo apt-get install software-properties-common
sudo apt-add-repository ppa:ansible/ansible
sudo apt-get update
sudo apt-get install ansible

Monday, October 3, 2016

A normal process while doing automation #enterpr1se 3.0 !

Every time you want to automate an IT process :
  • Look for Pareto
  • Create the according test
  • Set up quality control / performance metrics and dashboards
  • Gather the logs
  • Configure backups
  • Ensure the code / configuration is pushed to the configuration management tool
  • Be sure to be compliant with your security

What is DevOps ?


Friday, August 12, 2016

Tuesday, July 26, 2016

Minimum set of tools to do devOps with Hadoop !

DevOps is a way to do / frameworks to use to ease life of IT teams from developer to admin / prod in a complex, multi-parameter, collaborative environment.
  • Continuous integration tool : Jenkins
  • Build automation tool : Maven
  • Team code collaboration tool : Gerrit Code Review
  • Database : PostgreSQL
    • Admin and application monitoring
  • Visualisation tool : Zeppelin
    • CEO / admin / dev dashboards
  • A Versionning tool : Git
    • code
    • configuration
    • template

Thursday, July 21, 2016

How to install Zeppelin in a few lines !

wget http://mirrors.ukfast.co.uk/sites/ftp.apache.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
tar xvfz apache-maven-3.3.9-bin.tar.gz
sudo apt-get install -y r-base
sudo apt-get install -y libcurl4-openssl-dev libssl-dev libxml2-dev libcurl4-gnutls-dev
sudo R -e "install.packages('evaluate')"
sudo R -e "install.packages('devtools', dependencies = TRUE)"
export JAVA_HOME=/usr/jdk/jdk1.8.0_60
git clone https://github.com/apache/zeppelin.git
cd zeppelin
../apache-maven-3.3.9/bin/mvn clean package -DskipTests -Pspark-1.6 -Pr -Psparkr
./bin/zeppelin-daemon.sh start

So cool to finally have an easy Open Source tool to do reporting and visualisation !

Wednesday, July 6, 2016

List of Jenkins plugins and configuration for Hadoop automatic deployment !

Configure :
  • JDK
  • Maven
  • Security
  • Share ssh public key from the jenkins hosts
Plugins : 
Currently in testing :

Sunday, June 26, 2016

Don't forget the basics !

Working as Hadoop DBA, I have noticed several times that the previous admin did forget to :

Sunday, June 19, 2016

Saturday, June 18, 2016

Friday, June 10, 2016

My Hadoop is not efficient enough, what can I do ?

1. Review your memory configuration to maximize CPU utilisation
2. Review your YARN settings especially the Capacity Scheduler
3. Review your application design, parameter used, join strategy, file format

Of course with checking your ganglia / Ambari Metrics, voilà !

PS : For those who don't trust Multi-tenant Hadoop cluster, please call me ;-)

Saturday, May 28, 2016

How to automate Data Analysis ? #part2

Here we go, so I code a prototype to

  • help parse CSV file (will add database, JSON supports later)
  • load the data into a Hadoop
  • create the corresponding Hive ORC table
  • run simple query to extract information 
    • MIN, MAX, AVG
    • Top 10
    • COUNT(DISTINCT ), COUNT(*) (if timestamp by YEAR, YEAR / MONTH) and NULL value
    • Regex matching the record
You can find the code here !

Next step will probably to add Spark code generation.

Tuesday, April 19, 2016

My Hadoop application get stuck !

If you are using a multiple application environment, you can reach the point when you can't allocate any more mapper / reducer / container and so some of your application are wainting for ressource and so get stuck.

In that case, review you Capacity Scheduler queue settings (capacity and elasticity), check the mapreduce.job.reduce.slowstart.completedmaps and enable preemption !

Monday, March 7, 2016

A list of useful R package !

Here you go :
  1. sqldf (for selecting from data frames using SQL)
  2. forecast (for easy forecasting of time series)
  3. plyr (data aggregation)
  4. stringr (string manipulation)
  5. Database connection packages RPostgreSQL, RMYSQL, RMongo, RODBC, RSQLite
  6. lubridate (time and date manipulation)
  7. ggplot2 (data visulization)
  8. qcc (statistical quality control and QC charts)
  9. reshape2 (data restructuring)
  10. randomForest (random forest predictive models)
  11. xgboost (Extreme Gradient Boosting)
  12. RHadoop (Connect R with Hadoop)
And don't forget http://statmethods.net/ !

Saturday, February 27, 2016

Postgres !

PostreSQL is an excellent Open Source database for small and medium projects. You will find a lot of amazing features like HA, statistics functions,  Row Level Security, JSON support, UPSERT.

Tuesday, February 23, 2016

Tableau !

For those who still don't have an amazing great reporting tool ;-) : Tableau.

Thursday, January 21, 2016

How to automate Data Analysis ? #part1

When I first started this project, I was wondering how to speed up the work done by analyst. I figure out there is a lot to do here :
  • What can be pre-processed (script to load a file, create the corresponding table, and so as soon as a new file / table is created, what is the statistics for every column, the regex, NULL values)
  • What can be automatically discovered (this column can be used to join with this table you have on element more though, user in the group [female] has an average .... compare to group [male]
  • Things that can be generated on the fly (mainly code : R code, SAS code, SQL code, mainly based on template)
  • Things that can be parameterized
    • How often I have heard, "I will try that later with this assumption" what if we can parameterised easily every step of the data workflow, useful as well for multi-dimension matrix based test case 
  • Things that can be automated / trigger
    • I just created a new variable what does that bring to my workflow
    • Variable reduction / transformation

Wednesday, January 13, 2016

The HadoopAutomator !

2016 will be the year of automation, I am currently working on several projects to automate almost everything (from installation to automatic data analysis and reporting) mainly using :

Friday, December 18, 2015

Photo #20 !


Merry Christmas and Happy New Year to all of you !

Saturday, November 14, 2015

Wednesday, November 11, 2015

Jenkins, Maven, SVN and Hortonworks HDP2.3 sandbox !

If you are also an automation and Open Source fan and you are (or not) in the process to build Hadoop application, I strongly suggest to use (minimum) :
  • Continuous integration tool (Jenkins, TeamCity, Travis CI)
  • Build tool (Maven, Ant, Gradle)
  • Provisionning tool (Chef, Ansible, shell script, Puppet)
  • Versionning system (Git, SVN, CVS)
In order to improve overall quality project / to stop loosing time / to ease Hadoop migration and testing / to be more efficient (yes a lot of good reasons).

I have the pleasure to use SVNJenkins + Maven + few Shell script + HDP sandbox on my laptop and this is really awesome.

Thanks ;-)

Friday, November 6, 2015

Hadoop Mini Clusters !

A nice project here to do some local test/development ! You can find others interesting projects in the Hortonworks gallery.

Sunday, October 25, 2015

Wednesday, October 21, 2015

Hadoop and version control software !

For those who don't use Ambari, or for those [edge] nodes which are not synced, please be sure to use a version control software so your team / admin will know which library / configuration file / links have been modified by who / when / why [and life will be easier]

Monday, October 19, 2015

Sunday, October 11, 2015

Meta-development and [automatic] analytics !

It can sound crazy to some of you but I think it is time for our IT platforms to become smarter and  to start to do some data management/data preparation/analytics by themselves or at least do more suggestion.

For most of our BI-like/analytics project we spend around 70% of the time to do data preparation. We, human, need time to understand the data, the way it is produced, how to ingest/transform it to get a structure, how to enrich it to get the semantic/standards for our business users/reporting. 

What if I dedicate 20% space/CPU of my [Hadoop] platform, what if my platform knows some heuristics and makes some assumption ? What if I have an embedded-portal [ambari view] which shows metrics about data/asking question ?

  • Those files seem to be received as daily full dump, do you agree ?
  • This dataset can be map to this schema <CREATE TABLE .... schema>
  • This column contain 20% of NULL especially when this <columnname> = "October", do you want to create a rule ?
  • This file contains on average 45000 lines +/- 5% except for 3 days
  • <column name> can be used to join these two tables, the matching will be 74% 
  • This column can be predicted using <analytics cube name> with a 70% accuracy, the best model is <model name>, top variable are <list of variable name>, do you think of a new variable ?

Wednesday, August 26, 2015

Apache Zeppelin !

Apache zeppelin is the really cool open-source web-based notebook which supports collaborative edition, basic reporting capabilities, data discovery and multiple language backend !

Saturday, May 23, 2015

Sunday, March 29, 2015

Photo #17 !


How to structure my datalake ?

Nowadays I am working on an interesting topic : How to build a datalake. Actually it can be straightforward, this is one possible way to do it :
  • Start by creating a directory "datalake" at the root
  • Depending on your company / team / project add sub folder to represent your organisation
  • Create technical users (for example etl...)
  • Use your HDFS security system to configure permission
  • Set up quota 
  • Add sub-directory for every 
    • source
    • version of the source
    • data type
    • version of the data type
    • data quality
    • partition [Year / Month / Day | Category | Type | Geography]
  • Configure workload management for each [technical] users / groups
  • For every folder create [once] the metadata to allow users to [many] query it from Hive / HCatalog / Pig / others
  • Keep it mind your HDFS block size [nowadays between 128MB and 1GB]
    • in order to avoid small file problem
  • Use naming standard on the file level to allow data lineage, guarantee one processing 
And so datalake becomes organized and smart raw dump.

Monday, March 16, 2015

There are bugs but it is normal life !

Hadoop is evolving very fast and sometimes you can find bugs. Be sure to check for your version / component what are the bugs :

Sunday, February 8, 2015

Meta-development !

It has been more then five years I am working on BI / IT projects. There is a lot of things to cover :
  • Data [capture / quality / preparation / management / usage / monetization]
  • Security
  • Workload management, scheduling
  • Backups strategy
  • Optimisation, design pattern, good practices 
  • SLAs, service & access
  • Analytics
  • Migration
  • Team, role, organization, responsibilities
And technology is evolving very quickly (especially Hadoop). I have seen a lot of smart people effectively working to make it done.

(Yes for now you didn't learn anything), what I am wondering is why the technology we are using is not smarter.

I think we should be more meta-development oriented, our technology should be able to understand more data and propose patterns to work with it. I don't think we gain any value by rewriting / reconnecting data and system or by indicating the way to parse a date... The same for project, some task should be automatically created. Last but not the least, it should be the same for analytic, instead of writing some great SQL-like code, I am more looking for a kind of "Find correlation", "Identify trends", "Build multi-axis reporting on this dataset".

It is quite slippy and cold in Stockholm but have a good week-end ;-) !

Friday, January 23, 2015

What makes you a better man / woman ?

Last year I took some time to think about my condition and life. I am quite motivating human and enjoy to live and work hard (and there is a lot to do / to learn). 

I am not going to debate on possible answers because it is really depending on people / culture / [basic / complex] needs but I invite you, my dear reader, to think about that.

For me, I would like one day to help mankind and I daily like to feed my brain.

All the best ;-)

Tuesday, December 30, 2014

Collaborative datalake !

It is holidays now so let's relax a little and imagine some funny things ! 

Why not a collaborative datalake based on Hadoop and web technology which allows users to share both their dataset and the code story to create it ? I would add a vote system too !

Let's see ;-)

Monday, December 29, 2014

Photo #15 !


Hadoop & Late binding !

Late binding is one of the key capability of Hadoop. It allows users to parse raw data (gzip, snappy, bzip2, csv, xml, json, pdf, jpeg, doc, others) stored in HDFS and to apply a structure on the fly.

Unfortunately, there may be some problems which link with data quality. Depending on the data and the SerDe used, you can loose some data, find it in the wrong column or the entire job can fail (which is annoying when working with 4TB of data ;-))

So my advice is to do a minimum checking [on the edge node] before data ingestion.

Friday, December 19, 2014

I want the last Hadoop improvement / new tool !

For those who are really excited by the latest improvement of Hadoop, I invite them to use it first on a development environment. Sometimes Hadoop is a little too fresh and they are some bugs.

That is why also having Hadoop, most of the time, implies to do migration every 6 months to get the latest patches.

Thursday, December 18, 2014

WebHCat !

WebHCat is the REST API for HCatalog, so for those who want to use REST and summit Hive query ;-) !

Wednesday, December 17, 2014

Hadoop Deprecated Properties !

Sometimes it is important to know if your property is deprecated and then which other one to use : Hadoop Deprecated Properties.

Friday, December 12, 2014

PIG lipstick !

This is a great project : GUI for pig !

Thursday, December 11, 2014

Saturday, December 6, 2014

Using Open Data & Machine Learning !

Before I was not convinced that Open Data brings more value to my project. Lately, just using Open Data, I am able to build an efficient model to predict dengue rate in Brezil with Least Angle Regression algorithm. To do so, we used meteo (wind, temperature, precipitation, thunder / rain rates, ...), altitude, localisation, urbanization, twitter / wikipedia frequency and custom variables (mostly lag).

Monday, November 17, 2014

IPython Notebook !

After RStudio server, there is also a way to use Python from a website : IPython Notebook.

Friday, October 17, 2014

Hive development !

Hive 0.14 is now supporting ACID (atomicity, consistency, isolation and durability) transaction which lead to :
  • UPDATE, DELETE
  • BEGIN, COMMIT, ROLLBACK
  • INSERT ... VALUES
Stinger.next will bring more SQL compliance (non-equi joins, more sub-queries, materialized views and others) and Apache Optiq is bringing cost-based optimization to improve performance.

This is really impressive !

Friday, October 10, 2014

Meta-development with R !

I am now going deeper into R which is great ! I recommend to look at this functions when you want to write code which generate and evaluate code directly.
  • assign()
  • eval()
  • parse()
  • quote()
  • new.env()
It is not the best in term of performance but can be really useful for dynamic coding. Have a great week- end ;-) !

Monday, September 22, 2014

Sunday, September 21, 2014

Summingbird !

Last week, I went to a meetup about streaming platform and there was a great guy who presents Summingbird : library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.

Friday, August 22, 2014

Jazz !

Let's relax with a little Jazz !

Tuesday, August 19, 2014

Unlockyourbrain !

This application is quite nice if you want to improve the way to unlock your phone !

Wednesday, August 6, 2014

How to be more productive !

Take 10 minutes and read the Aaron Swartz post !

Tuesday, August 5, 2014

Scale Open Source R with AsterR or Teradata 15 !

I recently contribute to a great project which deals with using R in a distributed way within Aster and Teradata. I rediscover that R is really permissive, flexible, powerful.

Monday, August 4, 2014

RStudio Server

You perhaps know RStudio IDE which is really nice. But if you want to use the RAM and the CPU of another server you can also install RStudio Server and access your R environment using a browser based interface, and it rocks !

Saturday, August 2, 2014

Mankind are quite stupid...

I mean we have done a lot of amazing studies and innovations in the field of Science, Technology and Health. But the fact that we have created this garbage continent, the way we run most of our business by enriching shareholders or all the political corruption you can discover in daily life... It deserves a very big WTF !

Have a good week-end ;-)

Thursday, July 31, 2014

Partial Redistribution Partial Duplication

PRPD is a new feature in Teradata since 14.10 and it improves joining with skew tables (so it depends on statictics to identify skewed values). This is a smart way to avoid DBA to create Surrogate key !

Wednesday, July 30, 2014

NVD3 : Re-usable charts for d3.js

If you don't want to start from scratch with D3.js, have a look at NVD3.js ;-)

Tuesday, July 29, 2014

Dataiku !

Dataiku is a French startup which is providing a great web-based plateform to accelerate data-science projects and there is an open-source version !

Thursday, July 24, 2014

Wednesday, July 23, 2014

My Hadoop is not working, what can I do ?

Keep calm and ;-)
  • First check your logs
  • Is the service is running ? (netstat -nat | grep ...)
  • Is it possible to access it ? (telnet ip port)
  • Is there a problem linked with path, java libraries, environment variable or exec ?
  • Am I using the correct user ? 
  • What is the security system in place ?
  • Are nodes well synchronized ?
  • What about memory issue ? (swap should be desactivated also)

Monday, July 14, 2014

Virtual Desktop on Windows !

For those who come from Linux or MacOs and would like virtual desktop on Windows :-)

Wednesday, May 14, 2014

Chief Data Officer

I would like to meet people who are working as CDO : Chief Data Officer. It's look like it is a very interesting job (data quality, data management, data wwwww) and it should be very helpful for data preparation I need before running analytics workflow / discovery process.

Monday, May 5, 2014

Hive development !

A lot of improvment for this new release of Hive !
  • [NOT] EXIST and [NOT] IN are available
  • WITH t_table_name AS ... well know as Common Table Expressions too
  • SELECT ... WHERE (SELECT c_column1 FROM ...) as Correlated Subqueries
  • SQL authorization system (GRANT, REVOKE) is now working
  • The Tez engine which can be enable thanks to set hive.execution.engine=tez;ee

Thursday, April 24, 2014

Python !

Python is already almost everywhere and used in production in Google. It is a very powerful programming langage to map your wish (from Web to GUI) in a script !

Wednesday, April 23, 2014

HBase coprocessor !

If you need to execute some custom code in your HBase cluster, you can use HBase coprocessor :
  • Observers : like triggers in RDBMS
    • RegionObserver : To pick up every DML statement : get, scan, put, delete, flush, split
    • WALObserver : To intercept WAL writing and reconstruction events
    • MasterObserver : To detect every DDL operation
  • EndPoints : kind of stored procedure

Wednesday, April 2, 2014

Scikit-learn !

Scikit-learn is an open-source machine-learning library written in Python. It is fast and handles memory well and thanks to Python is very flexible !

Monday, March 31, 2014

Teradata & Hadoop !

Teradata and Hadoop interacts well together especially inside UDA with InfiniBand interconnect. To know which platform to use when you should look at your needs, where is the largest volume and platform's capabilities.

If you want to transfert data, you can consider :

Friday, March 28, 2014

Machine Learning with Aster !

I am now working with Aster to do Machine Learning and statistics. Here are the functions you can use :
  • Approximate Distinct Count : to quickly estimates the number of distinct values
  • Approximate Percentile :  to computes approximate percentiles
  • Correlation : to determine if one variable is useful for predicting an other
  • Generalized Linear Regression & Prediction : to perform linear regression analysis
  • Principal Component Analysis : for dimensionality reduction 
  • Simple | Weighted | Exponential Moving Average : compute average with special algortihm
  • K-Nearest Neighbor : classification algorithm based on proximity
  • Support Vector Machines : build a SVM model and do prediction 
  • Confusion Matrix [Plot] : visualize ML algorithm performance
  • Kmeans : famous clustering algorithm
  • Minhash : Another clustering technic which depends on the set of products bought by users
  • Naïve Bayes : useful classification method especially for documents
  • Random Forest Functions : predictive modelling approaches broadly used for supervised classification learning

Tuesday, March 11, 2014

Teradata’s SNAP Framework !

Teradata’s Seamless Network Analytic Processing Framework is one of the great ideas inside Aster 6 database. It allows user to query different analytical engines and multiple type of storage using a SQL-like programming interface. It is composed by a query optimizer, a layer that integrates and manages resources, an execution engine and the unified SQL interface. These are the main components and their goals :
  • SQL-GR & Graph Engine : provide functions to work with edge, vertex, [un|bi|]directed or cyclic graph
  • SQL-MR : library (Machine Learning, Statistics, Search behaviour, Pattern matching, Time series, Text analysis, Geo-spatial, Parsing) to process data using MapReduce framework
  • SQL-H : easy to use connection to HDFS for loading data from Hadoop
  • SQL : join, filter, aggregation, OLAP, insert, update, delete, CASE WHEN, table
  • AFS connector : SQL-MR function to map AFS file to table
  • Teradata connector : SQL-MR function to load data from / to Teradata RDBMS
  • Stream API : plug your Python, Ruby, Perl, C[|++|#] scripts and use Aster CPU workers node to process it

Tuesday, February 25, 2014

Sunday, February 23, 2014

Scam #1

I rent a car using locationdevoiture.fr, they called pretending there is a problem during website registration and they changed the date so when I arrived to take the car, no voucher, no reservation... #becareful #scam

Friday, February 21, 2014

Taxi !

I just took the taxi this morning and because I am a stranger, the guy took the wrong direction... I like to travel but not before going to work ;-). But thanks to Google Map and because I remember the price I payed the first day everything went well. These are advices I would like to share

  • Take your phone, use GMap and show the driver where you want to go to
  • Don't take taxi near the tourism spot or your hotel taxi
  • Ask for the (approximation) price before leaving

Saturday, February 15, 2014

My Hapiness receipe !

  • Travelling every 4 months
  • Save money by using smart website
  • Drink prêt a manger soup, hot hazelnut chocolate in Starbucks or tea (especially green one)
  • Drive motorbike when it is sunny
  • Enjoy your family
  • Share knowledge & keep discovering (not only IT)
  • Try new food or new restaurants
  • Gather with your friends and have some drink ;-)
  • Eat healthy (yes you are what you do but also what you eat !)
  • Take photo using Instagram and share our friends !
  • Write down what it is important for you
  • Read, read, read especially before going to sleep
  • Wake up at 07h07, go to sleep at 22h22 (at least try)
  • Close your computer now and do some sports ;)

Monday, February 10, 2014

HDP [>] 2.1 natively available applications !

Stack components :
  • MapReduce (API v1 & v2) : software framework for processing vast amounts of data
  • Tez : more powerful framework for executing DAG (directed acyclic graph) of tasks
  • HOYA, HBase on YARN : distributed, column oriented database
  • Accumulo : (Linux only) sorted, distributed key / value store
  • Hue : web application interface for Hadoop ecosystem (Hive, Pig, HDFS, ...)
  • HDFS : hadoop distributed file system
  • WebHDFS : interact to HDFS using HTTP (no need for library)
  • WebHCat : interact to HCatalog using HTTP (no need for library)
  • YARN : Yet Another Resource Negotiator, allows more applications to run on Hadoop
  • Oozie : workflow / coordination system
  • Mahout : Machine-Learning libraries which use MapReduce for computing
  • Zookeeper : centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
  • Flume : data ingestion and streaming tool
  • Sqoop : extract and push down data to databases
  • Pig : scripting platform for analyzing large data sets
  • Hive : tool to query the data using a SQL-like language
  • SolR : plateform for indexing and search
  • HCatalog : meta-data management service
  • Ambari : set up, monitor and configure your Hadoop cluster
  • Phoenix : sql layer over HBase
Components being developed / integrated :
  • Spark : in memory engine for large-scale data processing
  • Falcon : data management framework
  • Knox : single point of secure access for Apache Hadoop clusters (use WebHDFS)
  • Storm : distributed realtime computation system
  • Kafka : publish-subscribe messaging system
  • Giraph : iterative graph processing system
  • OpenMPI : high performance message passing library
  • S4 : stream computing platform
  • Samza : distributed stream processing framework
  • R : software programming language for statistical computing and graphics
What else ;-) ?

Tuesday, February 4, 2014

Basic statistics with R !

I am quite sure you already know but it is really useful (especially with na.rm=TRUE) :
And don't forget t.test and prop.test !

Saturday, February 1, 2014

Lego & Chrome !

For now, there is not a lot of piece but it can let you have greats moments with your child / nephew ;-)

Monday, January 27, 2014

Main Hadoop 2.0 daemons !

  • NameNode : one per Namespace, stores & handles HDFS metadata
  • Secondary NameNode : (for now) still in use if no HA
  • Checkpoint node : (later) multiple checkpoint node is possible, performs periodic checkpoints
  • Backup node / Standby node : allows high availability, keep updated copy of namespace in its memory (if using no checkpoint allowed)
  • DataNode : stores HDFS data

  • ResourceManager : a global pure scheduler
  • ApplicationMaster : one per application, negociates ressource with RM, monitors and asks task execution to NodeManager 
  • NodeManager : one per slave server, a task application container launcher and reporting agent
  • Application Container : a separate processing unit, it can be a Map, a Reduce or a Storm Bolt, etc

Thursday, January 16, 2014

S.A.R.A.H

I will set up S.A.R.A.H soon, let's have an intelligent house and enjoy IoT ;-)

Monday, January 13, 2014

Google Keep !

In 2013, I tried some softwares to improve my organisation, I found one quite smart & useful : Google Keep. You can create task or task list, add color, picture, and reminder (date or location) and it is synchronise with your androïd device !

Tuesday, January 7, 2014

Hadoop & Java !

Thanks to UD[AT]F or MapReduce you can work directly with Java and use your Hadoop ressources. Because of the huge number of Java library, you can imagine extract directly data from HTML / XML files, mix it with reference / parameter data (JDBC loading), and transform it in Excel files in one Job !

Friday, January 3, 2014

Can we eat to starve cancer ?

William Li: http://on.ted.com/tbWi

Never give up !

Diana Nyad : http://on.ted.com/hyPR

Sunday, December 29, 2013

What is your passions ?

I always like to hear about others passions.

My passions/interests are :
  • IT
  • Solving problem
  • Discovering & Travelling
  • Human & Sharing & Realisation
  • Health
  • Build things to make this worl smarter and safer
  • Running (marathon for 2014) & sports in general !
  • Aviation & aeromodelling & flight simulation
  • English
  • Motorbiking
  • Eating & discovering food
  • Movies
  • Automatic watches
  • Having fun ;-)
So if we meet, tell me yours !

Saturday, December 28, 2013

Thursday, December 26, 2013

Thursday, December 5, 2013

Decision trees & R | Mahout !

Yesterday, I was asked "how can we visualise what leads to problems" ? To me, one of the best way is using decision tree with R or Mahout !

And you can do prediction & draw nicely !

Sunday, December 1, 2013

Orion

Sometimes I use Eclipse Orion it's really useful to have a cloud-based IDE !

AngularJS

Try AngularJS a pratical javascript framework !

Saturday, November 30, 2013

Lambda architecture

On this post I would like to present one of the possible software lambda-architecture :

Speed layer :  Storm, HBase

Storm is the real-time ETL and HBase because of random, realtime read/write capability is the storage !

Batch layer : Hadoop, HBase, Hive / Pig / [your datawarehouse]

To allow recomputation, just copy your data, har / compress and plug a partitioned Hive external table. So you can create complex Hive workflow and why not push some data (statistics, machine learning) to HBase again !

Serving layer : HBase, JEE & JS web application

JEE is convenient because of HBase java API and JDBC if you need to cache some ref data. And you can use some javascrip chart library.

Stay KISS ;-)

Photo #11


Nagios script & Hadoop !

useful link to help you monitor your Hadoop cluster !

LAMP became MEAN !

MEAN is the new javascript-powerful way to develop web application !

Sunday, November 3, 2013

Friday, October 18, 2013

Hadoop 2.0 !

Apache Hadoop 2.0 has just been released some days ago ! Hadoop is no longer only a MapReduce container but a multi data-framework container and provides High Availability, HDFS Federation, NFS and snapshot !

Big-LambData Architecture !

Nathan Marz proposed to apply the lambda philosophy to big data architecture and so it can help when you have to solve use cases using batch and real time processing systems.

Lambda architecture is based on three main design principle :
  • human fault-tolerance
  • data immutability
  • recomputation

Quality Function Development !

I like to use Japanese methods, QFD is one of my favourite for improving / solving complex IT issues !

Sunday, October 13, 2013

Kafka !

Kafka is a good solution for high throughput large scale message processing applications !

Hive development !

Lateral view simplifies the use of UDTF :

SELECT column1, column_udtf1
FROM t_table

LATERAL VIEW explode(array_column) ssreq_lv1 AS column_udtf1 
;

And with Hive 0.11, you now have ORC Files and windowing functions :
  • LEAD / LAG
  • FIRST_VALUE / LAST_VALUE
  • RANK / RANK / ROW_NUMBER / DENSE_RANK
  • CUME_DIST / PERCENT_RANK / NTILE
which is convenient for our BI needs !

Sunday, September 29, 2013

Friday, September 27, 2013

Business Intelligence & Hadoop

Most of the time BI means snowflake or star schema (or hybrid or complex ER). But with Hadoop you should rather think about denormalization, a big ODS, powerful ETL, a great place for your fact data and a new way (Hive / Mahout / Pig / Cascading) to tackle your normal / semi / non-structured data using real-time (HBase, Storm, Flume) or not !

Music #2

Stromae - Papaoutai | Rammstein - Ohne Dich | Bruno Mars - Locked Out Of Heaven | Selah Sue - Raggamuffin | c2c - Down The Road | Martin Garrix - Animals | Macklemore & Ryan Lewis - Can't Hold Us | Avicii - Wake Me Up | Kavinsky - Roadgame | Imelda May - Tainted Love

Wednesday, September 25, 2013

Hadoop & compression !

Compression with Hadoop is great ! You can reduce IO, network exchange and store more data, and most of the time your Hive/Pig/MapReduce jobs will be a little faster.

Depending on what your needs are, you should think about Snappy, lzo, lz4, bzip or gzip.
 
SET hive.exec.compress.output=true;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

Flume daemons !

  • Source (consumes events delivered to it by an external source)
  • Channel (stores temporarily event's data and help to provide end-to-end reliability of the flow)
  • Sink (removes the event from the channel and transfer/write it)
Both of them run asynchronously with the events staged in the channel.

Thursday, September 12, 2013

Node.js

Node.js is a great way to write web application in javascript ! It very useful especially with express and socket.io !

Wednesday, September 11, 2013

CouchSurfing !

CouchSurfing is a super cool & free way to travel and meet new friends around the world. I like it and I joined the community !

Monday, September 9, 2013

VirtualBox !

VirtualBox is a virtualization software package, it is very convenient when you want to test OS, to do web development or build your first Hadoop cluster :-) !

Sunday, September 8, 2013

Rattle !

I was wondering if there is a GUI for doing datamining / machine learning tasks and I found Rattle.

If you want to install and try :
install.packages("rattle", dependencies=TRUE)
library(rattle)
rattle()

And you can take a cofe during the first step ^^ !

How to make stress your friend !


Friday, August 30, 2013

MRUnit !

MRUnit (Blog) is the java library for MapReduce jobs testing !

Thursday, August 29, 2013

Speculative execution && Hadoop !

I usually disable speculative execution for MapReduce task when I write to RDBMS in Hive user defined table function.

set mapred.map.tasks.speculative=false;
set mapred.reduce.tasks.speculative.execution=false;
set hive.mapred.reduce.tasks.speculative.execution=false;


And if you tune the mapred.reduce.tasks, you can control RDBMS session-running number.

It is good also to use Batch mode and control the commit !

Photo #9


Monday, August 26, 2013

Recursive SQL !

A quite powerful way to handle hierarchical model data : recursive SQL !

WITH tmp_table AS
 SELECT column1, column2, ...
 FROM src_table
 WHERE src_table.hierarch_column_id is NULL
 UNION ALL
 SELECT column1, column2, ...
 FROM src_table
 INNER JOIN tmp_table
 ON src_table.hierarch_column_id = tmp_table.column_id
SELECT *
FROM tmp_table

You can also add meta-information like 1 as n in the first select using src_table and n + 1 in the second join which lets you filter the level.

R !

R is a free software for doing statistics, analytics, machine learning and data visualization.

If you want to start learning R, watch Google Developers videos, read machine learning or statistical models. You can find an IDE and an easy-way to create web-reporting using R and Shiny.

And don't forget library(rmr2) & library(rhdfs) to plug it with Hadoop !

Sunday, August 25, 2013

Guava !

Guava is a open-source java multi-library, very useful and time-saving especially when you want to work with collection !

Wednesday, August 21, 2013

Principal components analysis with R !

If you want to reduce the dimensial aspect of your n-variable problem and get the main uncorellated axis, try PCA and start with the generic function princomp !

Sunday, August 4, 2013

Trello !

A great, free, web-based way to organize your project with your collegue/friends : Trello !

Thursday, July 18, 2013

Photo #8


Monday, July 15, 2013

Sunday, May 26, 2013

RunKeeper !

I like to do running ! To record and share with my friends I use Runkeeper !

Saturday, May 25, 2013

Photo #7


Wednesday, May 8, 2013

SolR & ElasticSearch !

SolR and ElasticSearch are both great way to add search capability (and more) to your projects. And behind that, there is Lucene !

Thursday, May 2, 2013

Coursera !

A very cool and free way to learn : Coursera !

Monday, April 29, 2013

What can I do with Mahout ?

  • Clustering
    • Canopy
    • K-Means
    • Fuzzy K-Means
    • Dirichlet Process
    • Latent Dirichlet Allocation
    • Mean-shift
    • Expectation Maximization
    • Spectral
    • Minhash
    • Top Down
  • Classification
    • Logistic Regression
    • Bayesian
    • Support Vector Machines  
    • Random Forests
  • Decision forest
  • Machine learning
  • Recommendation
  • Dimension reduction
  • Your own business ! (If you understand how MapReduce and Mahout class work together, you can code your own logic)