- Start by creating a directory "datalake" at the root
- Depending on your company / team / project add sub folder to represent your organisation
- Create technical users (for example etl...)
- Use your HDFS security system to configure permission
- Set up quota
- Add sub-directory for every
- source
- version of the source
- data type
- version of the data type
- data quality
- partition [Year / Month / Day | Category | Type | Geography]
- Configure workload management for each [technical] users / groups
- For every folder create [once] the metadata to allow users to [many] query it from Hive / HCatalog / Pig / others
- Keep it mind your HDFS block size [nowadays between 128MB and 1GB]
- in order to avoid small file problem
- Use naming standard on the file level to allow data lineage, guarantee one processing
And so datalake becomes organized and smart raw dump.
No comments:
Post a Comment