Lanciaux Maxime | BI | DWH | Hadoop | DevOps | Google Cloud | DataOps | PostgreSQL: How to structure my datalake ?

Sunday, March 29, 2015

How to structure my datalake ?

Nowadays I am working on an interesting topic : How to build a datalake. Actually it can be straightforward, this is one possible way to do it :

Start by creating a directory "datalake" at the root
Depending on your company / team / project add sub folder to represent your organisation
Create technical users (for example etl...)
Use your HDFS security system to configure permission
Set up quota
Add sub-directory for every

source
version of the source
data type
version of the data type
data quality
partition [Year / Month / Day | Category | Type | Geography]

Configure workload management for each [technical] users / groups
For every folder create [once] the metadata to allow users to [many] query it from Hive / HCatalog / Pig / others
Keep it mind your HDFS block size [nowadays between 128MB and 1GB]

in order to avoid small file problem

Use naming standard on the file level to allow data lineage, guarantee one processing

And so datalake becomes organized and smart raw dump.

No comments:

Subscribe to: Post Comments (Atom)