Lanciaux Maxime | BI | DWH | Hadoop | DevOps | Google Cloud | DataOps

Showing posts with label Optimisation. Show all posts

Saturday, September 21, 2019

How to DataOps with Google Cloud Platform !

Hello from Budapest,

It's a long time since I didn't have the chance to look at my blog, so little news, I will now restart to share here and the first topic gonna be DataOps on Google Cloud Platform using BigQuery, DataStudio, Jenkins, StackDriver, Google Cloud Storage and more !

Stay Data tuned !

Wednesday, May 30, 2018

Getting Things Done

Hello from Langkawi,

A quick one before boarding, I had the chance to work at several customers using several technologies on several environments and several versions... Thing skeeps changing / evolving especially when customer and / or management is changing priorities or bugs occurs on Production (can happen with Hadoop).

How to keep track of your tasks (from customer questions, to admin / expense tasks, to private item to achieve) ? I tried different ways from emails, post-it, Wunderlist, Trello, Google keep and the only that worked for me is Todoist.

Why ?

Easy to setup recurring tasks
Karma / Graph view to see number of tasks achieved per day / per week
You can assign a color to each project and also create hierarchy of project
Possible to share with another user a project

Hope it will help ;-)

Cheers

Monday, March 20, 2017

Chief DevOps Officer ! #automation

This is the new trendy job, this is what I think his/her mission should be :

Automation using DevOps
Improving metrics gathering & reporting
Quality improvement by Pareto

Monday, February 27, 2017

Be ready for tomorrow prez #hadoop #devops !

Friday, February 10, 2017

Monitoring Hadoop application deployment per environment #DevOps !

Tuesday, July 26, 2016

Minimum set of tools to do devOps with Hadoop !

DevOps is a way to do / frameworks to use to ease life of IT teams from developer to admin / prod in a complex, multi-parameter, collaborative environment.

Continuous integration tool : Jenkins
Build automation tool : Maven
Team code collaboration tool : Gerrit Code Review
Database : PostgreSQL

Admin and application monitoring

Visualisation tool : Zeppelin

CEO / admin / dev dashboards

A Versionning tool : Git

code
configuration
template

Wednesday, July 6, 2016

List of Jenkins plugins and configuration for Hadoop automatic deployment !

Configure :

JDK
Maven
Security
Share ssh public key from the jenkins hosts

Plugins :

Locale Plugin (en_GB and Ignore browser preference and force this language to all users)
GitHub Plugin (for Git interaction)
Nested View (to allow grouping job views into multiple levels)
SafeRestart (This plugin allows you to restart Jenkins safely)
Conditional BuildStep (It will allow you to define a condition controling the execution of the step(s))
Maven Integration plugin (Jenkins plugin for building Maven 2/3 jobs via a special project type)
JobConfigHistory (Saves copies of all job and system configurations)
Email-ext plugin (email notification functionality)
PostgreSQL+Database+Plugin (This is a driver plugin for Database Plugin)
thinBackup (simply backs up the global and job specific configurations)
Dynamic Parameter Plug-in (dynamic generation of default build parameter values)
Plot Plugin (This plugin provides generic plotting (or graphing) capabilities in Jenkins)
Build Pipeline (Build Pipeline View of upstream and downstream connected jobs)
View Job Filters (Create smart views with exactly the jobs you want)
Folder Plugin
xUnit Plugin
jUnit Plugin
R Plugin
Ansible plugin
Python Plugin
Vagrant Plugin

Currently in testing :

Thursday, June 16, 2016

Simple example of Jenkins-HDP integration !

I just create a How-to about Jenkins-HDP on Hortonworks Community Connection, please vote for it ;-)

Cheers

Friday, June 10, 2016

My Hadoop is not efficient enough, what can I do ?

1. Review your memory configuration to maximize CPU utilisation
2. Review your YARN settings especially the Capacity Scheduler
3. Review your application design, parameter used, join strategy, file format

Of course with checking your ganglia / Ambari Metrics, voilà !

PS : For those who don't trust Multi-tenant Hadoop cluster, please call me ;-)

Tuesday, April 19, 2016

My Hadoop application get stuck !

If you are using a multiple application environment, you can reach the point when you can't allocate any more mapper / reducer / container and so some of your application are wainting for ressource and so get stuck.

In that case, review you Capacity Scheduler queue settings (capacity and elasticity), check the mapreduce.job.reduce.slowstart.completedmaps and enable preemption !

Friday, February 26, 2016

eXtreme Gradient Boosting, XGBoost !

One of the new algorithm, XGBoost which brings excellent results.

Friday, November 6, 2015

Hadoop Mini Clusters !

A nice project here to do some local test/development ! You can find others interesting projects in the Hortonworks gallery.

Sunday, October 11, 2015

Meta-development and [automatic] analytics !

It can sound crazy to some of you but I think it is time for our IT platforms to become smarter and to start to do some data management/data preparation/analytics by themselves or at least do more suggestion.

For most of our BI-like/analytics project we spend around 70% of the time to do data preparation. We, human, need time to understand the data, the way it is produced, how to ingest/transform it to get a structure, how to enrich it to get the semantic/standards for our business users/reporting.

What if I dedicate 20% space/CPU of my [Hadoop] platform, what if my platform knows some heuristics and makes some assumption ? What if I have an embedded-portal [ambari view] which shows metrics about data/asking question ?

Those files seem to be received as daily full dump, do you agree ?
This dataset can be map to this schema <CREATE TABLE .... schema>
This column contain 20% of NULL especially when this <columnname> = "October", do you want to create a rule ?
This file contains on average 45000 lines +/- 5% except for 3 days
<column name> can be used to join these two tables, the matching will be 74%
This column can be predicted using <analytics cube name> with a 70% accuracy, the best model is <model name>, top variable are <list of variable name>, do you think of a new variable ?

Saturday, May 23, 2015

Automatic Statistician !

I am looking forward to see the outcome of this project : The Automatic Statistician

Monday, December 29, 2014

Hadoop & Late binding !

Late binding is one of the key capability of Hadoop. It allows users to parse raw data (gzip, snappy, bzip2, csv, xml, json, pdf, jpeg, doc, others) stored in HDFS and to apply a structure on the fly.

Unfortunately, there may be some problems which link with data quality. Depending on the data and the SerDe used, you can loose some data, find it in the wrong column or the entire job can fail (which is annoying when working with 4TB of data ;-))

So my advice is to do a minimum checking [on the edge node] before data ingestion.

Thursday, July 31, 2014

Partial Redistribution Partial Duplication

PRPD is a new feature in Teradata since 14.10 and it improves joining with skew tables (so it depends on statictics to identify skewed values). This is a smart way to avoid DBA to create Surrogate key !

Lanciaux Maxime | BI | DWH | Hadoop | DevOps | Google Cloud | DataOps | PostgreSQL

Labels