Lanciaux Maxime | BI | DWH | Hadoop | DevOps | Google Cloud | DataOps

Sunday, October 11, 2015

Meta-development and [automatic] analytics !

It can sound crazy to some of you but I think it is time for our IT platforms to become smarter and to start to do some data management/data preparation/analytics by themselves or at least do more suggestion.

For most of our BI-like/analytics project we spend around 70% of the time to do data preparation. We, human, need time to understand the data, the way it is produced, how to ingest/transform it to get a structure, how to enrich it to get the semantic/standards for our business users/reporting.

What if I dedicate 20% space/CPU of my [Hadoop] platform, what if my platform knows some heuristics and makes some assumption ? What if I have an embedded-portal [ambari view] which shows metrics about data/asking question ?

Those files seem to be received as daily full dump, do you agree ?
This dataset can be map to this schema <CREATE TABLE .... schema>
This column contain 20% of NULL especially when this <columnname> = "October", do you want to create a rule ?
This file contains on average 45000 lines +/- 5% except for 3 days
<column name> can be used to join these two tables, the matching will be 74%
This column can be predicted using <analytics cube name> with a 70% accuracy, the best model is <model name>, top variable are <list of variable name>, do you think of a new variable ?

Sunday, March 29, 2015

How to structure my datalake ?

Nowadays I am working on an interesting topic : How to build a datalake. Actually it can be straightforward, this is one possible way to do it :

Start by creating a directory "datalake" at the root
Depending on your company / team / project add sub folder to represent your organisation
Create technical users (for example etl...)
Use your HDFS security system to configure permission
Set up quota
Add sub-directory for every

source
version of the source
data type
version of the data type
data quality
partition [Year / Month / Day | Category | Type | Geography]

Configure workload management for each [technical] users / groups
For every folder create [once] the metadata to allow users to [many] query it from Hive / HCatalog / Pig / others
Keep it mind your HDFS block size [nowadays between 128MB and 1GB]

in order to avoid small file problem

Use naming standard on the file level to allow data lineage, guarantee one processing

And so datalake becomes organized and smart raw dump.

Lanciaux Maxime | BI | DWH | Hadoop | DevOps | Google Cloud | DataOps | PostgreSQL

Labels

Friday, October 14, 2016

Enjoying Ambari View !

Sunday, October 11, 2015

Meta-development and [automatic] analytics !

Sunday, March 29, 2015

How to structure my datalake ?