PostgreSQL, BI, DWH, Hadoop, DevOps, DataOps, Machine Learning, Cloud and others topics !
Labels
Administration
Analytics
Architecture
Aster
Automation
Best practice
BI
Bitcoin
Bug
Business Intelligence
CDO
Data visualization
Databases
DataFlow
DataLake
DataMesh
DataOps
Datawarehouse
Detente
development
DevOps
ElasticSearch
enterpr1se 3.0
ETL
Flume
Fun
Games
Git
Google Cloud Platform
Graph Database
Hadoop
Hadoop 2.0
Hbase
Hive
Impala
Informatica
IoT
Java
Javascript
Jazz
Jenkins
Kafka
linux
Machine Learning
Mahout
MapReduce
Meta Development
Monitoring
Mood
Music
Oozie
Optimisation
performance
Pig
Python
Quality
R
Real Time
Scala
scam
Shark
SolR
Spark
SQL
Standards
Statistics
Stinger
Storm
SVN
Talend
Task
TED
Teradata
Thinking
Ubuntu
Useful
Web development
WTF
Yarn
Zeppelin
Zookeeper
Showing posts with label DataLake. Show all posts
Showing posts with label DataLake. Show all posts
Friday, October 14, 2016
Sunday, October 11, 2015
Meta-development and [automatic] analytics !
It can sound crazy to some of you but I think it is time for our IT platforms to become smarter and to start to do some data management/data preparation/analytics by themselves or at least do more suggestion.
For most of our BI-like/analytics project we spend around 70% of the time to do data preparation. We, human, need time to understand the data, the way it is produced, how to ingest/transform it to get a structure, how to enrich it to get the semantic/standards for our business users/reporting.
What if I dedicate 20% space/CPU of my [Hadoop] platform, what if my platform knows some heuristics and makes some assumption ? What if I have an embedded-portal [ambari view] which shows metrics about data/asking question ?
- Those files seem to be received as daily full dump, do you agree ?
- This dataset can be map to this schema <CREATE TABLE .... schema>
- This column contain 20% of NULL especially when this <columnname> = "October", do you want to create a rule ?
- This file contains on average 45000 lines +/- 5% except for 3 days
- <column name> can be used to join these two tables, the matching will be 74%
- This column can be predicted using <analytics cube name> with a 70% accuracy, the best model is <model name>, top variable are <list of variable name>, do you think of a new variable ?
Sunday, March 29, 2015
How to structure my datalake ?
Nowadays I am working on an interesting topic : How to build a datalake. Actually it can be straightforward, this is one possible way to do it :
- Start by creating a directory "datalake" at the root
- Depending on your company / team / project add sub folder to represent your organisation
- Create technical users (for example etl...)
- Use your HDFS security system to configure permission
- Set up quota
- Add sub-directory for every
- source
- version of the source
- data type
- version of the data type
- data quality
- partition [Year / Month / Day | Category | Type | Geography]
- Configure workload management for each [technical] users / groups
- For every folder create [once] the metadata to allow users to [many] query it from Hive / HCatalog / Pig / others
- Keep it mind your HDFS block size [nowadays between 128MB and 1GB]
- in order to avoid small file problem
- Use naming standard on the file level to allow data lineage, guarantee one processing
And so datalake becomes organized and smart raw dump.
Labels:
Architecture,
DataLake,
ETL,
Hadoop,
Hadoop 2.0,
Hbase,
Hive,
Standards,
Useful
Location:
Arabie saoudite
Subscribe to:
Posts (Atom)