Showing posts with label DataLake. Show all posts
Showing posts with label DataLake. Show all posts

Sunday, October 11, 2015

Meta-development and [automatic] analytics !

It can sound crazy to some of you but I think it is time for our IT platforms to become smarter and  to start to do some data management/data preparation/analytics by themselves or at least do more suggestion.

For most of our BI-like/analytics project we spend around 70% of the time to do data preparation. We, human, need time to understand the data, the way it is produced, how to ingest/transform it to get a structure, how to enrich it to get the semantic/standards for our business users/reporting. 

What if I dedicate 20% space/CPU of my [Hadoop] platform, what if my platform knows some heuristics and makes some assumption ? What if I have an embedded-portal [ambari view] which shows metrics about data/asking question ?

  • Those files seem to be received as daily full dump, do you agree ?
  • This dataset can be map to this schema <CREATE TABLE .... schema>
  • This column contain 20% of NULL especially when this <columnname> = "October", do you want to create a rule ?
  • This file contains on average 45000 lines +/- 5% except for 3 days
  • <column name> can be used to join these two tables, the matching will be 74% 
  • This column can be predicted using <analytics cube name> with a 70% accuracy, the best model is <model name>, top variable are <list of variable name>, do you think of a new variable ?

Sunday, March 29, 2015

How to structure my datalake ?

Nowadays I am working on an interesting topic : How to build a datalake. Actually it can be straightforward, this is one possible way to do it :
  • Start by creating a directory "datalake" at the root
  • Depending on your company / team / project add sub folder to represent your organisation
  • Create technical users (for example etl...)
  • Use your HDFS security system to configure permission
  • Set up quota 
  • Add sub-directory for every 
    • source
    • version of the source
    • data type
    • version of the data type
    • data quality
    • partition [Year / Month / Day | Category | Type | Geography]
  • Configure workload management for each [technical] users / groups
  • For every folder create [once] the metadata to allow users to [many] query it from Hive / HCatalog / Pig / others
  • Keep it mind your HDFS block size [nowadays between 128MB and 1GB]
    • in order to avoid small file problem
  • Use naming standard on the file level to allow data lineage, guarantee one processing 
And so datalake becomes organized and smart raw dump.