Sunday, October 11, 2015

Meta-development and [automatic] analytics !

It can sound crazy to some of you but I think it is time for our IT platforms to become smarter and  to start to do some data management/data preparation/analytics by themselves or at least do more suggestion.

For most of our BI-like/analytics project we spend around 70% of the time to do data preparation. We, human, need time to understand the data, the way it is produced, how to ingest/transform it to get a structure, how to enrich it to get the semantic/standards for our business users/reporting. 

What if I dedicate 20% space/CPU of my [Hadoop] platform, what if my platform knows some heuristics and makes some assumption ? What if I have an embedded-portal [ambari view] which shows metrics about data/asking question ?

  • Those files seem to be received as daily full dump, do you agree ?
  • This dataset can be map to this schema <CREATE TABLE .... schema>
  • This column contain 20% of NULL especially when this <columnname> = "October", do you want to create a rule ?
  • This file contains on average 45000 lines +/- 5% except for 3 days
  • <column name> can be used to join these two tables, the matching will be 74% 
  • This column can be predicted using <analytics cube name> with a 70% accuracy, the best model is <model name>, top variable are <list of variable name>, do you think of a new variable ?

No comments: