Lanciaux Maxime | BI | DWH | Hadoop | DevOps | Google Cloud | DataOps

Tuesday, March 30, 2021

An image is sometimes better than a long story !

Monday, November 16, 2020

Data Mesh on Google Cloud Platform (and this is excellent !)

Hello from Home 🏡,

Quick update as I am leading the creation of a new Deck explaining Data Mesh Architecture and why this is an amazing opportunity to adopt Google Cloud Platform for this approach :

GCP Project per Data Domain / Team (<-> direct Data Mesh mapping 👌)
Serverless
Pay as you go
Google APIs
Looker (esp. Looker API and the semantic layer)
Scalability
BigQuery / Cloud Storage / Cloud Pub/Sub
Ephemeral Hadoop cluster (Dataproc)
IAM
Cloud Source Repositories / Cloud Build

This [new] architecture is not a huge revolution (and this is great), it comes from 40+ years of data platform innovation and it follows the same approach as Microservice / Kubernetes.

Stay Data Mesh tuned !

Wednesday, April 8, 2020

Google Cloud Data Management Update

Few update this week :

Stay tuned !

Sunday, March 29, 2020

Stay safe !

Good luck to you and your family during this strange period.

Sunday, March 15, 2020

How to DataOps with Google Cloud Platform !

What do we want to achieve ?

Use DataOps to monitor information from twitter about Google.

Without doing IaaS (Infrastructure), so using Google Cloud managed service or Serverless Technologies
Making sure all asset are stored in a repository with dev and master branch
No manual step to test or push content to our Google Cloud Project
Ensure I can adapt to data structure change and so replay all data processing from scratch
Keep all data and compress them

What do we need :

use a repository, let's try Cloud Source Repository (private git hosted by Google Cloud)
schedule basics tasks : Cloud Scheduler (for more complex pipeline we have another option : Cloud Composer)
Act on commit to the [dev] branch : Cloud Build
Send message between task : Cloud Pub/Sub
React on event or notification : Cloud Function
Store information between tasks : Cloud Datastore
Read / Write data to multiple sources & targets : Dataflow
Use the best fully Serverless Datawarehouse : BigQuery
Monitor technical / business / FinOps related information : Data Studio, Stackdriver

Let's do it !

Schedule a task every minute to gather tweets from twitter API then store information to GCS
Schedule a task every day to compress all previous data in a tar.gz file
Read compress archive and load it to BigQuery with adaptive schema capabilities
Build the according reporting

More information and code soon !

Saturday, September 21, 2019

How to DataOps with Google Cloud Platform !

Hello from Budapest,

It's a long time since I didn't have the chance to look at my blog, so little news, I will now restart to share here and the first topic gonna be DataOps on Google Cloud Platform using BigQuery, DataStudio, Jenkins, StackDriver, Google Cloud Storage and more !

Stay Data tuned !

Wednesday, May 30, 2018

Getting Things Done

Hello from Langkawi,

A quick one before boarding, I had the chance to work at several customers using several technologies on several environments and several versions... Thing skeeps changing / evolving especially when customer and / or management is changing priorities or bugs occurs on Production (can happen with Hadoop).

How to keep track of your tasks (from customer questions, to admin / expense tasks, to private item to achieve) ? I tried different ways from emails, post-it, Wunderlist, Trello, Google keep and the only that worked for me is Todoist.

Why ?

Easy to setup recurring tasks
Karma / Graph view to see number of tasks achieved per day / per week
You can assign a color to each project and also create hierarchy of project
Possible to share with another user a project

Hope it will help ;-)

Cheers

Monday, May 7, 2018

Photo #24 !

Tuesday, December 12, 2017

While on holidays in SF ! With big lov from CA !

Tuesday, July 18, 2017

Photo #23 !

Wednesday, April 5, 2017

DevOps tools !

Monday, March 20, 2017

Chief DevOps Officer ! #automation

This is the new trendy job, this is what I think his/her mission should be :

Automation using DevOps
Improving metrics gathering & reporting
Quality improvement by Pareto

Sunday, March 5, 2017

Photo #22 !

Friday, March 3, 2017

NetData !

Very cool tool to monitor your lovely servers : NetData ! #Enjoy #Thanks

Wednesday, March 1, 2017

Hadoop & DevOps ! #FullAutomation

Monday, February 27, 2017

Be ready for tomorrow prez #hadoop #devops !

Friday, February 10, 2017

Monitoring Hadoop application deployment per environment #DevOps !

Monday, January 9, 2017

Open source

A fantastic Open Source Tableau like reporting and data exploration tool to look at : Superset from the fantastic ladies and gentlemen from Airbnb.

Friday, October 14, 2016

This guy is a genius !

Enjoying Ambari View !

Wednesday, October 12, 2016

How to install Ansible on Ubuntu !

sudo apt-get install software-properties-common
sudo apt-add-repository ppa:ansible/ansible
sudo apt-get update
sudo apt-get install ansible

Monday, October 3, 2016

A normal process while doing automation #enterpr1se 3.0 !

Every time you want to automate an IT process :

Look for Pareto
Create the according test
Set up quality control / performance metrics and dashboards
Gather the logs
Configure backups
Ensure the code / configuration is pushed to the configuration management tool
Be sure to be compliant with your security

What is DevOps ?

Monday, September 26, 2016

I find your lack of automation disturbing !

Sunday, September 25, 2016

The fourth UI you need to enjoy your Hadoop cluster !

Monday, August 22, 2016

Photo #21 !

Friday, August 12, 2016

I like this one !

Tuesday, July 26, 2016

Minimum set of tools to do devOps with Hadoop !

DevOps is a way to do / frameworks to use to ease life of IT teams from developer to admin / prod in a complex, multi-parameter, collaborative environment.

Continuous integration tool : Jenkins
Build automation tool : Maven
Team code collaboration tool : Gerrit Code Review
Database : PostgreSQL

Admin and application monitoring

Visualisation tool : Zeppelin

CEO / admin / dev dashboards

A Versionning tool : Git

code
configuration
template

Thursday, July 21, 2016

How to install Zeppelin in a few lines !

wget http://mirrors.ukfast.co.uk/sites/ftp.apache.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
tar xvfz apache-maven-3.3.9-bin.tar.gz
sudo apt-get install -y r-base
sudo apt-get install -y libcurl4-openssl-dev libssl-dev libxml2-dev libcurl4-gnutls-dev
sudo R -e "install.packages('evaluate')"
sudo R -e "install.packages('devtools', dependencies = TRUE)"
export JAVA_HOME=/usr/jdk/jdk1.8.0_60
git clone https://github.com/apache/zeppelin.git

cd zeppelin
../apache-maven-3.3.9/bin/mvn clean package -DskipTests -Pspark-1.6 -Pr -Psparkr

./bin/zeppelin-daemon.sh start

So cool to finally have an easy Open Source tool to do reporting and visualisation !

Wednesday, July 6, 2016

List of Jenkins plugins and configuration for Hadoop automatic deployment !

Configure :

JDK
Maven
Security
Share ssh public key from the jenkins hosts

Plugins :

Locale Plugin (en_GB and Ignore browser preference and force this language to all users)
GitHub Plugin (for Git interaction)
Nested View (to allow grouping job views into multiple levels)
SafeRestart (This plugin allows you to restart Jenkins safely)
Conditional BuildStep (It will allow you to define a condition controling the execution of the step(s))
Maven Integration plugin (Jenkins plugin for building Maven 2/3 jobs via a special project type)
JobConfigHistory (Saves copies of all job and system configurations)
Email-ext plugin (email notification functionality)
PostgreSQL+Database+Plugin (This is a driver plugin for Database Plugin)
thinBackup (simply backs up the global and job specific configurations)
Dynamic Parameter Plug-in (dynamic generation of default build parameter values)
Plot Plugin (This plugin provides generic plotting (or graphing) capabilities in Jenkins)
Build Pipeline (Build Pipeline View of upstream and downstream connected jobs)
View Job Filters (Create smart views with exactly the jobs you want)
Folder Plugin
xUnit Plugin
jUnit Plugin
R Plugin
Ansible plugin
Python Plugin
Vagrant Plugin

Currently in testing :

Sunday, June 26, 2016

Don't forget the basics !

Working as Hadoop DBA, I have noticed several times that the previous admin did forget to :

Configure HDFS quota
Setup YARN Capacity Scheduler
Check that Configuration Group is well defined
Security policies are enforced
Use Jenkins ;-)

Sunday, June 19, 2016

f.lux: software to make your life better !

I started using f.lux and it is great !

Saturday, June 18, 2016

Screen !

I use a lot nohup but now I enjoy screen ;-) Have a wonderful week-end !

Thursday, June 16, 2016

Simple example of Jenkins-HDP integration !

I just create a How-to about Jenkins-HDP on Hortonworks Community Connection, please vote for it ;-)

Cheers

Friday, June 10, 2016

My Hadoop is not efficient enough, what can I do ?

1. Review your memory configuration to maximize CPU utilisation
2. Review your YARN settings especially the Capacity Scheduler
3. Review your application design, parameter used, join strategy, file format

Of course with checking your ganglia / Ambari Metrics, voilà !

PS : For those who don't trust Multi-tenant Hadoop cluster, please call me ;-)

Saturday, May 28, 2016

How to automate Data Analysis ? #part2

Here we go, so I code a prototype to

help parse CSV file (will add database, JSON supports later)
load the data into a Hadoop
create the corresponding Hive ORC table
run simple query to extract information

MIN, MAX, AVG
Top 10
COUNT(DISTINCT ), COUNT(*) (if timestamp by YEAR, YEAR / MONTH) and NULL value
Regex matching the record

You can find the code here !

Next step will probably to add Spark code generation.

Tuesday, April 19, 2016

My Hadoop application get stuck !

If you are using a multiple application environment, you can reach the point when you can't allocate any more mapper / reducer / container and so some of your application are wainting for ressource and so get stuck.

In that case, review you Capacity Scheduler queue settings (capacity and elasticity), check the mapreduce.job.reduce.slowstart.completedmaps and enable preemption !

Monday, March 7, 2016

A list of useful R package !

Here you go :

sqldf (for selecting from data frames using SQL)
forecast (for easy forecasting of time series)
plyr (data aggregation)
stringr (string manipulation)
Database connection packages RPostgreSQL, RMYSQL, RMongo, RODBC, RSQLite
lubridate (time and date manipulation)
ggplot2 (data visulization)
qcc (statistical quality control and QC charts)
reshape2 (data restructuring)
randomForest (random forest predictive models)
xgboost (Extreme Gradient Boosting)
RHadoop (Connect R with Hadoop)

And don't forget http://statmethods.net/ !

Saturday, February 27, 2016

Postgres !

PostreSQL is an excellent Open Source database for small and medium projects. You will find a lot of amazing features like HA, statistics functions, Row Level Security, JSON support, UPSERT.

Friday, February 26, 2016

eXtreme Gradient Boosting, XGBoost !

One of the new algorithm, XGBoost which brings excellent results.

Tuesday, February 23, 2016

Tableau !

For those who still don't have an amazing great reporting tool ;-) : Tableau.

Thursday, January 21, 2016

How to automate Data Analysis ? #part1

When I first started this project, I was wondering how to speed up the work done by analyst. I figure out there is a lot to do here :

What can be pre-processed (script to load a file, create the corresponding table, and so as soon as a new file / table is created, what is the statistics for every column, the regex, NULL values)
What can be automatically discovered (this column can be used to join with this table you have on element more though, user in the group [female] has an average .... compare to group [male]
Things that can be generated on the fly (mainly code : R code, SAS code, SQL code, mainly based on template)
Things that can be parameterized

How often I have heard, "I will try that later with this assumption" what if we can parameterised easily every step of the data workflow, useful as well for multi-dimension matrix based test case

Things that can be automated / trigger

I just created a new variable what does that bring to my workflow
Variable reduction / transformation

Friday, January 15, 2016

Stay positive !

Wednesday, January 13, 2016

The HadoopAutomator !

2016 will be the year of automation, I am currently working on several projects to automate almost everything (from installation to automatic data analysis and reporting) mainly using :

An orchestration tool : Ansible
A powerful big data platform : Hortonworks Data Platform
A database, I like elephant so will probably go for PostgreSQL
A programming language : Python (some Java too because of Ambari view)
Several REST API like Ambari blueprints, WebHCat, Ambari metrics
And of course, the basic stack : Jenkins, SVN, Git, Maven, SSH, Shell scripts and some web technologies

Friday, December 18, 2015

Photo #20 !

Merry Christmas and Happy New Year to all of you !

Saturday, November 14, 2015

Photo #19 !

Wednesday, November 11, 2015

Jenkins, Maven, SVN and Hortonworks HDP2.3 sandbox !

If you are also an automation and Open Source fan and you are (or not) in the process to build Hadoop application, I strongly suggest to use (minimum) :

Continuous integration tool (Jenkins, TeamCity, Travis CI)
Build tool (Maven, Ant, Gradle)
Provisionning tool (Chef, Ansible, shell script, Puppet)
Versionning system (Git, SVN, CVS)

In order to improve overall quality project / to stop loosing time / to ease Hadoop migration and testing / to be more efficient (yes a lot of good reasons).

I have the pleasure to use SVN + Jenkins + Maven + few Shell script + HDP sandbox on my laptop and this is really awesome.

Thanks ;-)

Friday, November 6, 2015

Hadoop Mini Clusters !

A nice project here to do some local test/development ! You can find others interesting projects in the Hortonworks gallery.

Sunday, October 25, 2015

Photo #18 !

Wednesday, October 21, 2015

Hadoop and version control software !

For those who don't use Ambari, or for those [edge] nodes which are not synced, please be sure to use a version control software so your team / admin will know which library / configuration file / links have been modified by who / when / why [and life will be easier]

Monday, October 19, 2015

Linux | Hadoop is only free if your time has no value !

This famous quote can also apply for Hadoop !

Sunday, October 11, 2015

Meta-development and [automatic] analytics !

It can sound crazy to some of you but I think it is time for our IT platforms to become smarter and to start to do some data management/data preparation/analytics by themselves or at least do more suggestion.

For most of our BI-like/analytics project we spend around 70% of the time to do data preparation. We, human, need time to understand the data, the way it is produced, how to ingest/transform it to get a structure, how to enrich it to get the semantic/standards for our business users/reporting.

What if I dedicate 20% space/CPU of my [Hadoop] platform, what if my platform knows some heuristics and makes some assumption ? What if I have an embedded-portal [ambari view] which shows metrics about data/asking question ?

Those files seem to be received as daily full dump, do you agree ?
This dataset can be map to this schema <CREATE TABLE .... schema>
This column contain 20% of NULL especially when this <columnname> = "October", do you want to create a rule ?
This file contains on average 45000 lines +/- 5% except for 3 days
<column name> can be used to join these two tables, the matching will be 74%
This column can be predicted using <analytics cube name> with a 70% accuracy, the best model is <model name>, top variable are <list of variable name>, do you think of a new variable ?

Wednesday, August 26, 2015

Apache Zeppelin !

Apache zeppelin is the really cool open-source web-based notebook which supports collaborative edition, basic reporting capabilities, data discovery and multiple language backend !

Saturday, May 23, 2015

Automatic Statistician !

I am looking forward to see the outcome of this project : The Automatic Statistician

Sunday, March 29, 2015

Photo #17 !

How to structure my datalake ?

Nowadays I am working on an interesting topic : How to build a datalake. Actually it can be straightforward, this is one possible way to do it :

Start by creating a directory "datalake" at the root
Depending on your company / team / project add sub folder to represent your organisation
Create technical users (for example etl...)
Use your HDFS security system to configure permission
Set up quota
Add sub-directory for every

source
version of the source
data type
version of the data type
data quality
partition [Year / Month / Day | Category | Type | Geography]

Configure workload management for each [technical] users / groups
For every folder create [once] the metadata to allow users to [many] query it from Hive / HCatalog / Pig / others
Keep it mind your HDFS block size [nowadays between 128MB and 1GB]

in order to avoid small file problem

Use naming standard on the file level to allow data lineage, guarantee one processing

And so datalake becomes organized and smart raw dump.

Monday, March 16, 2015

There are bugs but it is normal life !

Hadoop is evolving very fast and sometimes you can find bugs. Be sure to check for your version / component what are the bugs :

Saturday, February 14, 2015

Photo #16 !

Music #3

Sunday, February 8, 2015

Meta-development !

It has been more then five years I am working on BI / IT projects. There is a lot of things to cover :

Data [capture / quality / preparation / management / usage / monetization]
Security
Workload management, scheduling
Backups strategy
Optimisation, design pattern, good practices
SLAs, service & access
Analytics
Migration
Team, role, organization, responsibilities

And technology is evolving very quickly (especially Hadoop). I have seen a lot of smart people effectively working to make it done.

(Yes for now you didn't learn anything), what I am wondering is why the technology we are using is not smarter.

I think we should be more meta-development oriented, our technology should be able to understand more data and propose patterns to work with it. I don't think we gain any value by rewriting / reconnecting data and system or by indicating the way to parse a date... The same for project, some task should be automatically created. Last but not the least, it should be the same for analytic, instead of writing some great SQL-like code, I am more looking for a kind of "Find correlation", "Identify trends", "Build multi-axis reporting on this dataset".

It is quite slippy and cold in Stockholm but have a good week-end ;-) !

Friday, January 23, 2015

What makes you a better man / woman ?

Last year I took some time to think about my condition and life. I am quite motivating human and enjoy to live and work hard (and there is a lot to do / to learn).

I am not going to debate on possible answers because it is really depending on people / culture / [basic / complex] needs but I invite you, my dear reader, to think about that.

For me, I would like one day to help mankind and I daily like to feed my brain.

All the best ;-)

Tuesday, December 30, 2014

Collaborative datalake !

It is holidays now so let's relax a little and imagine some funny things !

Why not a collaborative datalake based on Hadoop and web technology which allows users to share both their dataset and the code story to create it ? I would add a vote system too !

Let's see ;-)

Monday, December 29, 2014

Photo #15 !

Hadoop & Late binding !

Late binding is one of the key capability of Hadoop. It allows users to parse raw data (gzip, snappy, bzip2, csv, xml, json, pdf, jpeg, doc, others) stored in HDFS and to apply a structure on the fly.

Unfortunately, there may be some problems which link with data quality. Depending on the data and the SerDe used, you can loose some data, find it in the wrong column or the entire job can fail (which is annoying when working with 4TB of data ;-))

So my advice is to do a minimum checking [on the edge node] before data ingestion.

Friday, December 19, 2014

I want the last Hadoop improvement / new tool !

For those who are really excited by the latest improvement of Hadoop, I invite them to use it first on a development environment. Sometimes Hadoop is a little too fresh and they are some bugs.

That is why also having Hadoop, most of the time, implies to do migration every 6 months to get the latest patches.

Thursday, December 18, 2014

WebHCat !

WebHCat is the REST API for HCatalog, so for those who want to use REST and summit Hive query ;-) !

Wednesday, December 17, 2014

Hadoop Deprecated Properties !

Sometimes it is important to know if your property is deprecated and then which other one to use : Hadoop Deprecated Properties.

Friday, December 12, 2014

PIG lipstick !

This is a great project : GUI for pig !

Thursday, December 11, 2014

Photo #14 !

Saturday, December 6, 2014

Using Open Data & Machine Learning !

Before I was not convinced that Open Data brings more value to my project. Lately, just using Open Data, I am able to build an efficient model to predict dengue rate in Brezil with Least Angle Regression algorithm. To do so, we used meteo (wind, temperature, precipitation, thunder / rain rates, ...), altitude, localisation, urbanization, twitter / wikipedia frequency and custom variables (mostly lag).

Monday, November 17, 2014

IPython Notebook !

After RStudio server, there is also a way to use Python from a website : IPython Notebook.

Friday, October 17, 2014

Hive development !

Hive 0.14 is now supporting ACID (atomicity, consistency, isolation and durability) transaction which lead to :

UPDATE, DELETE
BEGIN, COMMIT, ROLLBACK
INSERT ... VALUES

Stinger.next will bring more SQL compliance (non-equi joins, more sub-queries, materialized views and others) and Apache Optiq is bringing cost-based optimization to improve performance.

This is really impressive !

Friday, October 10, 2014

Meta-development with R !

I am now going deeper into R which is great ! I recommend to look at this functions when you want to write code which generate and evaluate code directly.

assign()
eval()
parse()
quote()
new.env()

It is not the best in term of performance but can be really useful for dynamic coding. Have a great week- end ;-) !

Monday, September 22, 2014

Let's change our way to eat !

Look at this great article.

Sunday, September 21, 2014

Summingbird !

Last week, I went to a meetup about streaming platform and there was a great guy who presents Summingbird : library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.

Friday, August 22, 2014

Jazz !

Let's relax with a little Jazz !

Tuesday, August 19, 2014

Unlockyourbrain !

This application is quite nice if you want to improve the way to unlock your phone !

Wednesday, August 6, 2014

How to be more productive !

Take 10 minutes and read the Aaron Swartz post !

Tuesday, August 5, 2014

Scale Open Source R with AsterR or Teradata 15 !

I recently contribute to a great project which deals with using R in a distributed way within Aster and Teradata. I rediscover that R is really permissive, flexible, powerful.

Monday, August 4, 2014

RStudio Server

You perhaps know RStudio IDE which is really nice. But if you want to use the RAM and the CPU of another server you can also install RStudio Server and access your R environment using a browser based interface, and it rocks !

Saturday, August 2, 2014

Mankind are quite stupid...

I mean we have done a lot of amazing studies and innovations in the field of Science, Technology and Health. But the fact that we have created this garbage continent, the way we run most of our business by enriching shareholders or all the political corruption you can discover in daily life... It deserves a very big WTF !

Have a good week-end ;-)

Thursday, July 31, 2014

Partial Redistribution Partial Duplication

PRPD is a new feature in Teradata since 14.10 and it improves joining with skew tables (so it depends on statictics to identify skewed values). This is a smart way to avoid DBA to create Surrogate key !

Wednesday, July 30, 2014

NVD3 : Re-usable charts for d3.js

If you don't want to start from scratch with D3.js, have a look at NVD3.js ;-)

Tuesday, July 29, 2014

Dataiku !

Dataiku is a French startup which is providing a great web-based plateform to accelerate data-science projects and there is an open-source version !

Thursday, July 24, 2014

Is it possible to install R, python on androïd (without rooting) ?

Yes and it is very easy thanks to GNURoot !

Wednesday, July 23, 2014

My Hadoop is not working, what can I do ?

Keep calm and ;-)

First check your logs
Is the service is running ? (netstat -nat | grep ...)
Is it possible to access it ? (telnet ip port)
Is there a problem linked with path, java libraries, environment variable or exec ?
Am I using the correct user ?
What is the security system in place ?
Are nodes well synchronized ?
What about memory issue ? (swap should be desactivated also)

Monday, July 14, 2014

Virtual Desktop on Windows !

For those who come from Linux or MacOs and would like virtual desktop on Windows :-)

Wednesday, May 14, 2014

Chief Data Officer

I would like to meet people who are working as CDO : Chief Data Officer. It's look like it is a very interesting job (data quality, data management, data wwwww) and it should be very helpful for data preparation I need before running analytics workflow / discovery process.

Monday, May 5, 2014

Hive development !

A lot of improvment for this new release of Hive !

[NOT] EXIST and [NOT] IN are available
WITH t_table_name AS ... well know as Common Table Expressions too
SELECT ... WHERE (SELECT c_column1 FROM ...) as Correlated Subqueries
SQL authorization system (GRANT, REVOKE) is now working
The Tez engine which can be enable thanks to set hive.execution.engine=tez;ee

Thursday, April 24, 2014

Python !

Python is already almost everywhere and used in production in Google. It is a very powerful programming langage to map your wish (from Web to GUI) in a script !

Wednesday, April 23, 2014

HBase coprocessor !

If you need to execute some custom code in your HBase cluster, you can use HBase coprocessor :

Observers : like triggers in RDBMS

RegionObserver : To pick up every DML statement : get, scan, put, delete, flush, split
WALObserver : To intercept WAL writing and reconstruction events
MasterObserver : To detect every DDL operation

EndPoints : kind of stored procedure

Wednesday, April 2, 2014

Scikit-learn !

Scikit-learn is an open-source machine-learning library written in Python. It is fast and handles memory well and thanks to Python is very flexible !

Monday, March 31, 2014

Teradata & Hadoop !

Teradata and Hadoop interacts well together especially inside UDA with InfiniBand interconnect. To know which platform to use when you should look at your needs, where is the largest volume and platform's capabilities.

If you want to transfert data, you can consider :

Between Hadoop and Teradata : Teradata Connector for Hadoop and SQL-H
Between Aster and Hadoop / Teradata : SQL-H

Friday, March 28, 2014

Machine Learning with Aster !

I am now working with Aster to do Machine Learning and statistics. Here are the functions you can use :

Approximate Distinct Count : to quickly estimates the number of distinct values
Approximate Percentile : to computes approximate percentiles
Correlation : to determine if one variable is useful for predicting an other
Generalized Linear Regression & Prediction : to perform linear regression analysis
Principal Component Analysis : for dimensionality reduction
Simple | Weighted | Exponential Moving Average : compute average with special algortihm
K-Nearest Neighbor : classification algorithm based on proximity
Support Vector Machines : build a SVM model and do prediction
Confusion Matrix [Plot] : visualize ML algorithm performance
Kmeans : famous clustering algorithm
Minhash : Another clustering technic which depends on the set of products bought by users
Naïve Bayes : useful classification method especially for documents
Random Forest Functions : predictive modelling approaches broadly used for supervised classification learning

Tuesday, March 11, 2014

Teradata’s SNAP Framework !

Teradata’s Seamless Network Analytic Processing Framework is one of the great ideas inside Aster 6 database. It allows user to query different analytical engines and multiple type of storage using a SQL-like programming interface. It is composed by a query optimizer, a layer that integrates and manages resources, an execution engine and the unified SQL interface. These are the main components and their goals :

SQL-GR & Graph Engine : provide functions to work with edge, vertex, [un|bi|]directed or cyclic graph
SQL-MR : library (Machine Learning, Statistics, Search behaviour, Pattern matching, Time series, Text analysis, Geo-spatial, Parsing) to process data using MapReduce framework
SQL-H : easy to use connection to HDFS for loading data from Hadoop
SQL : join, filter, aggregation, OLAP, insert, update, delete, CASE WHEN, table
AFS connector : SQL-MR function to map AFS file to table
Teradata connector : SQL-MR function to load data from / to Teradata RDBMS
Stream API : plug your Python, Ruby, Perl, C[|++|#] scripts and use Aster CPU workers node to process it

Tuesday, February 25, 2014

Good hackers make good citizens !

Sunday, February 23, 2014

Scam #1

I rent a car using locationdevoiture.fr, they called pretending there is a problem during website registration and they changed the date so when I arrived to take the car, no voucher, no reservation... #becareful #scam

Friday, February 21, 2014

Taxi !

I just took the taxi this morning and because I am a stranger, the guy took the wrong direction... I like to travel but not before going to work ;-). But thanks to Google Map and because I remember the price I payed the first day everything went well. These are advices I would like to share

Take your phone, use GMap and show the driver where you want to go to
Don't take taxi near the tourism spot or your hotel taxi
Ask for the (approximation) price before leaving

Saturday, February 15, 2014

My Hapiness receipe !

Travelling every 4 months
Save money by using smart website
Drink prêt a manger soup, hot hazelnut chocolate in Starbucks or tea (especially green one)
Drive motorbike when it is sunny
Enjoy your family
Share knowledge & keep discovering (not only IT)
Try new food or new restaurants
Gather with your friends and have some drink ;-)
Eat healthy (yes you are what you do but also what you eat !)
Take photo using Instagram and share our friends !
Write down what it is important for you
Read, read, read especially before going to sleep
Wake up at 07h07, go to sleep at 22h22 (at least try)
Close your computer now and do some sports ;)

Monday, February 10, 2014

HDP [>] 2.1 natively available applications !

Stack components :

MapReduce (API v1 & v2) : software framework for processing vast amounts of data
Tez : more powerful framework for executing DAG (directed acyclic graph) of tasks
HOYA, HBase on YARN : distributed, column oriented database
Accumulo : (Linux only) sorted, distributed key / value store
Hue : web application interface for Hadoop ecosystem (Hive, Pig, HDFS, ...)
HDFS : hadoop distributed file system
WebHDFS : interact to HDFS using HTTP (no need for library)
WebHCat : interact to HCatalog using HTTP (no need for library)
YARN : Yet Another Resource Negotiator, allows more applications to run on Hadoop
Oozie : workflow / coordination system
Mahout : Machine-Learning libraries which use MapReduce for computing
Zookeeper : centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
Flume : data ingestion and streaming tool
Sqoop : extract and push down data to databases
Pig : scripting platform for analyzing large data sets
Hive : tool to query the data using a SQL-like language
SolR : plateform for indexing and search
HCatalog : meta-data management service
Ambari : set up, monitor and configure your Hadoop cluster
Phoenix : sql layer over HBase

Components being developed / integrated :

Spark : in memory engine for large-scale data processing
Falcon : data management framework
Knox : single point of secure access for Apache Hadoop clusters (use WebHDFS)
Storm : distributed realtime computation system
Kafka : publish-subscribe messaging system
Giraph : iterative graph processing system
OpenMPI : high performance message passing library
S4 : stream computing platform
Samza : distributed stream processing framework
R : software programming language for statistical computing and graphics

What else ;-) ?

Tuesday, February 4, 2014

Basic statistics with R !

I am quite sure you already know but it is really useful (especially with na.rm=TRUE) :

And don't forget t.test and prop.test !

Saturday, February 1, 2014

Lego & Chrome !

For now, there is not a lot of piece but it can let you have greats moments with your child / nephew ;-)

Monday, January 27, 2014

Main Hadoop 2.0 daemons !

NameNode : one per Namespace, stores & handles HDFS metadata
Secondary NameNode : (for now) still in use if no HA
Checkpoint node : (later) multiple checkpoint node is possible, performs periodic checkpoints
Backup node / Standby node : allows high availability, keep updated copy of namespace in its memory (if using no checkpoint allowed)
DataNode : stores HDFS data

ResourceManager : a global pure scheduler
ApplicationMaster : one per application, negociates ressource with RM, monitors and asks task execution to NodeManager
NodeManager : one per slave server, a task application container launcher and reporting agent
Application Container : a separate processing unit, it can be a Map, a Reduce or a Storm Bolt, etc

Thursday, January 16, 2014

S.A.R.A.H

I will set up S.A.R.A.H soon, let's have an intelligent house and enjoy IoT ;-)

Monday, January 13, 2014

Google Keep !

In 2013, I tried some softwares to improve my organisation, I found one quite smart & useful : Google Keep. You can create task or task list, add color, picture, and reminder (date or location) and it is synchronise with your androïd device !

Sunday, January 12, 2014

Photo #13 !

Tuesday, January 7, 2014

Hadoop & Java !

Thanks to UD[AT]F or MapReduce you can work directly with Java and use your Hadoop ressources. Because of the huge number of Java library, you can imagine extract directly data from HTML / XML files, mix it with reference / parameter data (JDBC loading), and transform it in Excel files in one Job !

Friday, January 3, 2014

Can we eat to starve cancer ?

William Li: http://on.ted.com/tbWi

Never give up !

Diana Nyad : http://on.ted.com/hyPR

Sunday, December 29, 2013

What is your passions ?

I always like to hear about others passions.

My passions/interests are :

IT
Solving problem
Discovering & Travelling
Human & Sharing & Realisation
Health
Build things to make this worl smarter and safer
Running (marathon for 2014) & sports in general !
Aviation & aeromodelling & flight simulation
English
Motorbiking
Eating & discovering food
Movies
Automatic watches
Having fun ;-)

So if we meet, tell me yours !

Saturday, December 28, 2013

Life is precious because it is short

So enjoy it :-)

Thursday, December 26, 2013

Photo #12

Thursday, December 5, 2013

Decision trees & R | Mahout !

Yesterday, I was asked "how can we visualise what leads to problems" ? To me, one of the best way is using decision tree with R or Mahout !

And you can do prediction & draw nicely !

Sunday, December 1, 2013

Orion

Sometimes I use Eclipse Orion it's really useful to have a cloud-based IDE !

AngularJS

Try AngularJS a pratical javascript framework !

Saturday, November 30, 2013

Lambda architecture

On this post I would like to present one of the possible software lambda-architecture :

Speed layer : Storm, HBase

Storm is the real-time ETL and HBase because of random, realtime read/write capability is the storage !

Batch layer : Hadoop, HBase, Hive / Pig / [your datawarehouse]

To allow recomputation, just copy your data, har / compress and plug a partitioned Hive external table. So you can create complex Hive workflow and why not push some data (statistics, machine learning) to HBase again !

Serving layer : HBase, JEE & JS web application

JEE is convenient because of HBase java API and JDBC if you need to cache some ref data. And you can use some javascrip chart library.

Stay KISS ;-)

Photo #11

Nagios script & Hadoop !

A useful link to help you monitor your Hadoop cluster !

LAMP became MEAN !

MEAN is the new javascript-powerful way to develop web application !

Sunday, November 3, 2013

Photo #10

Friday, October 18, 2013

Hadoop 2.0 !

Apache Hadoop 2.0 has just been released some days ago ! Hadoop is no longer only a MapReduce container but a multi data-framework container and provides High Availability, HDFS Federation, NFS and snapshot !

Big-LambData Architecture !

Nathan Marz proposed to apply the lambda philosophy to big data architecture and so it can help when you have to solve use cases using batch and real time processing systems.

Lambda architecture is based on three main design principle :

human fault-tolerance
data immutability
recomputation

Quality Function Development !

I like to use Japanese methods, QFD is one of my favourite for improving / solving complex IT issues !

Sunday, October 13, 2013

Kafka !

Kafka is a good solution for high throughput large scale message processing applications !

Hive development !

Lateral view simplifies the use of UDTF :

SELECT column1, column_udtf1
FROM t_table
LATERAL VIEW explode(array_column) ssreq_lv1 AS column_udtf1
;

And with Hive 0.11, you now have ORC Files and windowing functions :

LEAD / LAG
FIRST_VALUE / LAST_VALUE
RANK / RANK / ROW_NUMBER / DENSE_RANK
CUME_DIST / PERCENT_RANK / NTILE

which is convenient for our BI needs !

Sunday, September 29, 2013

Why 30 is not the new 20 !

Friday, September 27, 2013

Business Intelligence & Hadoop

Most of the time BI means snowflake or star schema (or hybrid or complex ER). But with Hadoop you should rather think about denormalization, a big ODS, powerful ETL, a great place for your fact data and a new way (Hive / Mahout / Pig / Cascading) to tackle your normal / semi / non-structured data using real-time (HBase, Storm, Flume) or not !

Music #2

Wednesday, September 25, 2013

Hadoop & compression !

Compression with Hadoop is great ! You can reduce IO, network exchange and store more data, and most of the time your Hive/Pig/MapReduce jobs will be a little faster.

Depending on what your needs are, you should think about Snappy, lzo, lz4, bzip or gzip.

SET hive.exec.compress.output=true;

SET mapred.output.compression.type=BLOCK;

SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

Flume daemons !

Source (consumes events delivered to it by an external source)
Channel (stores temporarily event's data and help to provide end-to-end reliability of the flow)
Sink (removes the event from the channel and transfer/write it)

Both of them run asynchronously with the events staged in the channel.

Thursday, September 12, 2013

Node.js

Node.js is a great way to write web application in javascript ! It very useful especially with express and socket.io !

Wednesday, September 11, 2013

CouchSurfing !

CouchSurfing is a super cool & free way to travel and meet new friends around the world. I like it and I joined the community !

Monday, September 9, 2013

VirtualBox !

VirtualBox is a virtualization software package, it is very convenient when you want to test OS, to do web development or build your first Hadoop cluster :-) !

Sunday, September 8, 2013

Rattle !

I was wondering if there is a GUI for doing datamining / machine learning tasks and I found Rattle.

If you want to install and try :
install.packages("rattle", dependencies=TRUE)
library(rattle)
rattle()

And you can take a cofe during the first step ^^ !

How to make stress your friend !

Friday, August 30, 2013

MRUnit !

MRUnit (Blog) is the java library for MapReduce jobs testing !

Thursday, August 29, 2013

Speculative execution && Hadoop !

I usually disable speculative execution for MapReduce task when I write to RDBMS in Hive user defined table function.

set mapred.map.tasks.speculative=false;
set mapred.reduce.tasks.speculative.execution=false;
set hive.mapred.reduce.tasks.speculative.execution=false;

And if you tune the mapred.reduce.tasks, you can control RDBMS session-running number.

It is good also to use Batch mode and control the commit !

Photo #9

Monday, August 26, 2013

Recursive SQL !

A quite powerful way to handle hierarchical model data : recursive SQL !

WITH tmp_table AS
SELECT column1, column2, ...
FROM src_table
WHERE src_table.hierarch_column_id is NULL
UNION ALL
SELECT column1, column2, ...
FROM src_table
INNER JOIN tmp_table
ON src_table.hierarch_column_id = tmp_table.column_id
SELECT *
FROM tmp_table

You can also add meta-information like 1 as n in the first select using src_table and n + 1 in the second join which lets you filter the level.

R !

R is a free software for doing statistics, analytics, machine learning and data visualization.

If you want to start learning R, watch Google Developers videos, read machine learning or statistical models. You can find an IDE and an easy-way to create web-reporting using R and Shiny.

And don't forget library(rmr2) & library(rhdfs) to plug it with Hadoop !

Sunday, August 25, 2013

Guava !

Guava is a open-source java multi-library, very useful and time-saving especially when you want to work with collection !

Wednesday, August 21, 2013

Principal components analysis with R !

If you want to reduce the dimensial aspect of your n-variable problem and get the main uncorellated axis, try PCA and start with the generic function princomp !

Sunday, August 4, 2013

Trello !

A great, free, web-based way to organize your project with your collegue/friends : Trello !

Thursday, July 18, 2013

Photo #8

Monday, July 15, 2013

Hadoop Summit 2013 videos !

I wish to have time to watch all soon !

Sunday, May 26, 2013

RunKeeper !

I like to do running ! To record and share with my friends I use Runkeeper !

Saturday, May 25, 2013

Photo #7

Wednesday, May 8, 2013

SolR & ElasticSearch !

SolR and ElasticSearch are both great way to add search capability (and more) to your projects. And behind that, there is Lucene !

Thursday, May 2, 2013

Coursera !

A very cool and free way to learn : Coursera !

Monday, April 29, 2013

What can I do with Mahout ?

Clustering

Canopy
K-Means
Fuzzy K-Means
Dirichlet Process
Latent Dirichlet Allocation
Mean-shift
Expectation Maximization
Spectral
Minhash
Top Down

Classification

Logistic Regression
Bayesian
Support Vector Machines
Random Forests

Decision forest
Machine learning
Recommendation
Dimension reduction
Your own business ! (If you understand how MapReduce and Mahout class work together, you can code your own logic)

Labels

Tuesday, March 30, 2021

Monday, November 16, 2020

Wednesday, April 8, 2020

Sunday, March 29, 2020

Sunday, March 15, 2020

What do we want to achieve ?

What do we need :

Let's do it !

Saturday, September 21, 2019

Wednesday, May 30, 2018

Monday, May 7, 2018

Tuesday, December 12, 2017

Tuesday, July 18, 2017

Wednesday, April 5, 2017

Monday, March 20, 2017

Sunday, March 5, 2017

Friday, March 3, 2017

Wednesday, March 1, 2017

Monday, February 27, 2017

Friday, February 10, 2017

Monday, January 9, 2017

Friday, October 14, 2016

Wednesday, October 12, 2016

Monday, October 3, 2016

Monday, September 26, 2016

Sunday, September 25, 2016

Monday, August 22, 2016

Friday, August 12, 2016

Tuesday, July 26, 2016

Thursday, July 21, 2016

Wednesday, July 6, 2016

Sunday, June 26, 2016

Sunday, June 19, 2016

Saturday, June 18, 2016

Thursday, June 16, 2016

Friday, June 10, 2016

Saturday, May 28, 2016

Tuesday, April 19, 2016

Monday, March 7, 2016

Saturday, February 27, 2016

Friday, February 26, 2016

Tuesday, February 23, 2016

Thursday, January 21, 2016

Friday, January 15, 2016

Wednesday, January 13, 2016

Friday, December 18, 2015

Saturday, November 14, 2015

Wednesday, November 11, 2015

Friday, November 6, 2015

Sunday, October 25, 2015

Wednesday, October 21, 2015

Monday, October 19, 2015

Sunday, October 11, 2015

Wednesday, August 26, 2015

Saturday, May 23, 2015

Sunday, March 29, 2015

Monday, March 16, 2015

Saturday, February 14, 2015

Sunday, February 8, 2015

Friday, January 23, 2015

Tuesday, December 30, 2014

Monday, December 29, 2014

Friday, December 19, 2014

Thursday, December 18, 2014

Wednesday, December 17, 2014

Friday, December 12, 2014

Thursday, December 11, 2014

Saturday, December 6, 2014

Monday, November 17, 2014

Friday, October 17, 2014

Friday, October 10, 2014

Monday, September 22, 2014

Sunday, September 21, 2014