It can sound crazy to some of you but I think it is time for our IT platforms to become smarter and to start to do some data management/data preparation/analytics by themselves or at least do more suggestion.
For most of our BI-like/analytics project we spend around 70% of the time to do data preparation. We, human, need time to understand the data, the way it is produced, how to ingest/transform it to get a structure, how to enrich it to get the semantic/standards for our business users/reporting.
What if I dedicate 20% space/CPU of my [Hadoop] platform, what if my platform knows some heuristics and makes some assumption ? What if I have an embedded-portal [ambari view] which shows metrics about data/asking question ?
- Those files seem to be received as daily full dump, do you agree ?
- This dataset can be map to this schema <CREATE TABLE .... schema>
- This column contain 20% of NULL especially when this <columnname> = "October", do you want to create a rule ?
- This file contains on average 45000 lines +/- 5% except for 3 days
- <column name> can be used to join these two tables, the matching will be 74%
- This column can be predicted using <analytics cube name> with a 70% accuracy, the best model is <model name>, top variable are <list of variable name>, do you think of a new variable ?
No comments:
Post a Comment