Sunday, March 29, 2015

How to structure my datalake ?

Nowadays I am working on an interesting topic : How to build a datalake. Actually it can be straightforward, this is one possible way to do it :
  • Start by creating a directory "datalake" at the root
  • Depending on your company / team / project add sub folder to represent your organisation
  • Create technical users (for example etl...)
  • Use your HDFS security system to configure permission
  • Set up quota 
  • Add sub-directory for every 
    • source
    • version of the source
    • data type
    • version of the data type
    • data quality
    • partition [Year / Month / Day | Category | Type | Geography]
  • Configure workload management for each [technical] users / groups
  • For every folder create [once] the metadata to allow users to [many] query it from Hive / HCatalog / Pig / others
  • Keep it mind your HDFS block size [nowadays between 128MB and 1GB]
    • in order to avoid small file problem
  • Use naming standard on the file level to allow data lineage, guarantee one processing 
And so datalake becomes organized and smart raw dump.

No comments: