Saturday, May 28, 2016

How to automate Data Analysis ? #part2

Here we go, so I code a prototype to

  • help parse CSV file (will add database, JSON supports later)
  • load the data into a Hadoop
  • create the corresponding Hive ORC table
  • run simple query to extract information 
    • MIN, MAX, AVG
    • Top 10
    • COUNT(DISTINCT ), COUNT(*) (if timestamp by YEAR, YEAR / MONTH) and NULL value
    • Regex matching the record
You can find the code here !

Next step will probably to add Spark code generation.

No comments: