PostgreSQL, BI, DWH, Hadoop, DevOps, DataOps, Machine Learning, Cloud and others topics !
Labels
Administration
Analytics
Architecture
Aster
Automation
Best practice
BI
Bitcoin
Bug
Business Intelligence
CDO
Data visualization
Databases
DataFlow
DataLake
DataMesh
DataOps
Datawarehouse
Detente
development
DevOps
ElasticSearch
enterpr1se 3.0
ETL
Flume
Fun
Games
Git
Google Cloud Platform
Graph Database
Hadoop
Hadoop 2.0
Hbase
Hive
Impala
Informatica
IoT
Java
Javascript
Jazz
Jenkins
Kafka
linux
Machine Learning
Mahout
MapReduce
Meta Development
Monitoring
Mood
Music
Oozie
Optimisation
performance
Pig
Python
Quality
R
Real Time
Scala
scam
Shark
SolR
Spark
SQL
Standards
Statistics
Stinger
Storm
SVN
Talend
Task
TED
Teradata
Thinking
Ubuntu
Useful
Web development
WTF
Yarn
Zeppelin
Zookeeper
Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts
Tuesday, March 30, 2021
Monday, November 16, 2020
Data Mesh on Google Cloud Platform (and this is excellent !)
Hello from Home 🏡,
Quick update as I am leading the creation of a new Deck explaining Data Mesh Architecture and why this is an amazing opportunity to adopt Google Cloud Platform for this approach :
- GCP Project per Data Domain / Team (<-> direct Data Mesh mapping 👌)
- Serverless
- Pay as you go
- Google APIs
- Looker (esp. Looker API and the semantic layer)
- Scalability
- BigQuery / Cloud Storage / Cloud Pub/Sub
- Ephemeral Hadoop cluster (Dataproc)
- IAM
- Cloud Source Repositories / Cloud Build
This [new] architecture is not a huge revolution (and this is great), it comes from 40+ years of data platform innovation and it follows the same approach as Microservice / Kubernetes.
Stay Data Mesh tuned !
Labels:
Architecture,
Best practice,
BI,
Business Intelligence,
DataMesh,
DataOps,
Datawarehouse,
Google Cloud Platform,
Hadoop
Location:
Vanves, France
Wednesday, March 1, 2017
Monday, February 27, 2017
Friday, February 10, 2017
Friday, October 14, 2016
Sunday, September 25, 2016
Wednesday, July 6, 2016
List of Jenkins plugins and configuration for Hadoop automatic deployment !
Configure :
- JDK
- Maven
- Security
- Share ssh public key from the jenkins hosts
Plugins :
- Locale Plugin (en_GB and Ignore browser preference and force this language to all users)
- GitHub Plugin (for Git interaction)
- Nested View (to allow grouping job views into multiple levels)
- SafeRestart (This plugin allows you to restart Jenkins safely)
- Conditional BuildStep (It will allow you to define a condition controling the execution of the step(s))
- Maven Integration plugin (Jenkins plugin for building Maven 2/3 jobs via a special project type)
- JobConfigHistory (Saves copies of all job and system configurations)
- Email-ext plugin (email notification functionality)
- PostgreSQL+Database+Plugin (This is a driver plugin for Database Plugin)
- thinBackup (simply backs up the global and job specific configurations)
- Dynamic Parameter Plug-in (dynamic generation of default build parameter values)
- Plot Plugin (This plugin provides generic plotting (or graphing) capabilities in Jenkins)
- Build Pipeline (Build Pipeline View of upstream and downstream connected jobs)
- View Job Filters (Create smart views with exactly the jobs you want)
- Folder Plugin
- xUnit Plugin
- jUnit Plugin
- R Plugin
- Ansible plugin
- Python Plugin
- Vagrant Plugin
Currently in testing :
Labels:
Administration,
Automation,
Best practice,
Git,
Hadoop,
Hadoop 2.0,
Jenkins,
Meta Development,
Optimisation,
Standards,
Useful
Location:
Londres, Royaume-Uni
Sunday, June 26, 2016
Don't forget the basics !
Working as Hadoop DBA, I have noticed several times that the previous admin did forget to :
- Configure HDFS quota
- Setup YARN Capacity Scheduler
- Check that Configuration Group is well defined
- Security policies are enforced
- Use Jenkins ;-)
Labels:
Administration,
Architecture,
Best practice,
Hadoop,
Hadoop 2.0,
Hbase,
Hive,
Standards
Friday, June 10, 2016
My Hadoop is not efficient enough, what can I do ?
1. Review your memory configuration to maximize CPU utilisation
2. Review your YARN settings especially the Capacity Scheduler
3. Review your application design, parameter used, join strategy, file format
Of course with checking your ganglia / Ambari Metrics, voilà !
PS : For those who don't trust Multi-tenant Hadoop cluster, please call me ;-)
2. Review your YARN settings especially the Capacity Scheduler
3. Review your application design, parameter used, join strategy, file format
Of course with checking your ganglia / Ambari Metrics, voilà !
PS : For those who don't trust Multi-tenant Hadoop cluster, please call me ;-)
Labels:
Administration,
Architecture,
Best practice,
Hadoop,
Hadoop 2.0,
Hive,
MapReduce,
Optimisation,
performance,
Pig,
Quality,
Spark,
SQL,
Standards,
Useful
Tuesday, April 19, 2016
My Hadoop application get stuck !
If you are using a multiple application environment, you can reach the point when you can't allocate any more mapper / reducer / container and so some of your application are wainting for ressource and so get stuck.
In that case, review you Capacity Scheduler queue settings (capacity and elasticity), check the mapreduce.job.reduce.slowstart.completedmaps and enable preemption !
In that case, review you Capacity Scheduler queue settings (capacity and elasticity), check the mapreduce.job.reduce.slowstart.completedmaps and enable preemption !
Labels:
Best practice,
Hadoop,
Hadoop 2.0,
Optimisation,
Useful
Location:
Old Broad St, London, Royaume-Uni
Friday, November 6, 2015
Hadoop Mini Clusters !
A nice project here to do some local test/development ! You can find others interesting projects in the Hortonworks gallery.
Labels:
development,
Hadoop,
Hadoop 2.0,
Hbase,
Hive,
Kafka,
MapReduce,
Oozie,
Optimisation,
Storm,
Useful,
Yarn,
Zookeeper
Location:
Cork, Irlande
Wednesday, October 21, 2015
Hadoop and version control software !
For those who don't use Ambari, or for those [edge] nodes which are not synced, please be sure to use a version control software so your team / admin will know which library / configuration file / links have been modified by who / when / why [and life will be easier]
Labels:
Architecture,
development,
Hadoop,
Hadoop 2.0,
Quality,
Standards,
Useful
Location:
Santa Clara, Californie, États-Unis
Sunday, October 11, 2015
Meta-development and [automatic] analytics !
It can sound crazy to some of you but I think it is time for our IT platforms to become smarter and to start to do some data management/data preparation/analytics by themselves or at least do more suggestion.
For most of our BI-like/analytics project we spend around 70% of the time to do data preparation. We, human, need time to understand the data, the way it is produced, how to ingest/transform it to get a structure, how to enrich it to get the semantic/standards for our business users/reporting.
What if I dedicate 20% space/CPU of my [Hadoop] platform, what if my platform knows some heuristics and makes some assumption ? What if I have an embedded-portal [ambari view] which shows metrics about data/asking question ?
- Those files seem to be received as daily full dump, do you agree ?
- This dataset can be map to this schema <CREATE TABLE .... schema>
- This column contain 20% of NULL especially when this <columnname> = "October", do you want to create a rule ?
- This file contains on average 45000 lines +/- 5% except for 3 days
- <column name> can be used to join these two tables, the matching will be 74%
- This column can be predicted using <analytics cube name> with a 70% accuracy, the best model is <model name>, top variable are <list of variable name>, do you think of a new variable ?
Wednesday, August 26, 2015
Apache Zeppelin !
Apache zeppelin is the really cool open-source web-based notebook which supports collaborative edition, basic reporting capabilities, data discovery and multiple language backend !
Sunday, March 29, 2015
How to structure my datalake ?
Nowadays I am working on an interesting topic : How to build a datalake. Actually it can be straightforward, this is one possible way to do it :
- Start by creating a directory "datalake" at the root
- Depending on your company / team / project add sub folder to represent your organisation
- Create technical users (for example etl...)
- Use your HDFS security system to configure permission
- Set up quota
- Add sub-directory for every
- source
- version of the source
- data type
- version of the data type
- data quality
- partition [Year / Month / Day | Category | Type | Geography]
- Configure workload management for each [technical] users / groups
- For every folder create [once] the metadata to allow users to [many] query it from Hive / HCatalog / Pig / others
- Keep it mind your HDFS block size [nowadays between 128MB and 1GB]
- in order to avoid small file problem
- Use naming standard on the file level to allow data lineage, guarantee one processing
And so datalake becomes organized and smart raw dump.
Labels:
Architecture,
DataLake,
ETL,
Hadoop,
Hadoop 2.0,
Hbase,
Hive,
Standards,
Useful
Location:
Arabie saoudite
Monday, March 16, 2015
There are bugs but it is normal life !
Hadoop is evolving very fast and sometimes you can find bugs. Be sure to check for your version / component what are the bugs :
Labels:
Architecture,
Bug,
Flume,
Hadoop,
Hadoop 2.0,
Hive,
Kafka,
Mahout,
Pig,
SolR,
Spark,
Task,
Useful
Location:
Antony, France
Sunday, February 8, 2015
Meta-development !
It has been more then five years I am working on BI / IT projects. There is a lot of things to cover :
(Yes for now you didn't learn anything), what I am wondering is why the technology we are using is not smarter.
I think we should be more meta-development oriented, our technology should be able to understand more data and propose patterns to work with it. I don't think we gain any value by rewriting / reconnecting data and system or by indicating the way to parse a date... The same for project, some task should be automatically created. Last but not the least, it should be the same for analytic, instead of writing some great SQL-like code, I am more looking for a kind of "Find correlation", "Identify trends", "Build multi-axis reporting on this dataset".
It is quite slippy and cold in Stockholm but have a good week-end ;-) !
- Data [capture / quality / preparation / management / usage / monetization]
- Security
- Workload management, scheduling
- Backups strategy
- Optimisation, design pattern, good practices
- SLAs, service & access
- Analytics
- Migration
- Team, role, organization, responsibilities
(Yes for now you didn't learn anything), what I am wondering is why the technology we are using is not smarter.
I think we should be more meta-development oriented, our technology should be able to understand more data and propose patterns to work with it. I don't think we gain any value by rewriting / reconnecting data and system or by indicating the way to parse a date... The same for project, some task should be automatically created. Last but not the least, it should be the same for analytic, instead of writing some great SQL-like code, I am more looking for a kind of "Find correlation", "Identify trends", "Build multi-axis reporting on this dataset".
It is quite slippy and cold in Stockholm but have a good week-end ;-) !
Labels:
Analytics,
Architecture,
Business Intelligence,
development,
Hadoop,
Hadoop 2.0,
Machine Learning,
Task,
Thinking
Location:
Stockholm, Suède
Tuesday, December 30, 2014
Collaborative datalake !
It is holidays now so let's relax a little and imagine some funny things !
Why not a collaborative datalake based on Hadoop and web technology which allows users to share both their dataset and the code story to create it ? I would add a vote system too !
Let's see ;-)
Labels:
Architecture,
BI,
Business Intelligence,
DataFlow,
development,
ETL,
Hadoop,
Hadoop 2.0,
Hbase,
Hive,
IoT,
Java,
Pig,
SolR,
Spark,
Standards
Location:
Tainan, Taïwan
Monday, December 29, 2014
Hadoop & Late binding !
Late binding is one of the key capability of Hadoop. It allows users to parse raw data (gzip, snappy, bzip2, csv, xml, json, pdf, jpeg, doc, others) stored in HDFS and to apply a structure on the fly.
Unfortunately, there may be some problems which link with data quality. Depending on the data and the SerDe used, you can loose some data, find it in the wrong column or the entire job can fail (which is annoying when working with 4TB of data ;-))
So my advice is to do a minimum checking [on the edge node] before data ingestion.
Labels:
Architecture,
development,
ETL,
Hadoop,
Hadoop 2.0,
Hive,
Java,
Optimisation,
Pig,
Quality,
Standards,
Stinger,
Useful
Location:
Tainan, Taïwan
Subscribe to:
Posts (Atom)