Lanciaux Maxime | BI | DWH | Hadoop | DevOps | Google Cloud | DataOps | PostgreSQL
PostgreSQL, BI, DWH, Hadoop, DevOps, DataOps, Machine Learning, Cloud and others topics !
Labels
Administration
Analytics
Architecture
Aster
Automation
Best practice
BI
Bitcoin
Bug
Business Intelligence
CDO
Data visualization
Databases
DataFlow
DataLake
DataMesh
DataOps
Datawarehouse
Detente
development
DevOps
ElasticSearch
enterpr1se 3.0
ETL
Flume
Fun
Games
Git
Google Cloud Platform
Graph Database
Hadoop
Hadoop 2.0
Hbase
Hive
Impala
Informatica
IoT
Java
Javascript
Jazz
Jenkins
Kafka
linux
Machine Learning
Mahout
MapReduce
Meta Development
Monitoring
Mood
Music
Oozie
Optimisation
performance
Pig
Python
Quality
R
Real Time
Scala
scam
Shark
SolR
Spark
SQL
Standards
Statistics
Stinger
Storm
SVN
Talend
Task
TED
Teradata
Thinking
Ubuntu
Useful
Web development
WTF
Yarn
Zeppelin
Zookeeper
Tuesday, March 30, 2021
Monday, November 16, 2020
Data Mesh on Google Cloud Platform (and this is excellent !)
Hello from Home 🏡,
Quick update as I am leading the creation of a new Deck explaining Data Mesh Architecture and why this is an amazing opportunity to adopt Google Cloud Platform for this approach :
- GCP Project per Data Domain / Team (<-> direct Data Mesh mapping 👌)
- Serverless
- Pay as you go
- Google APIs
- Looker (esp. Looker API and the semantic layer)
- Scalability
- BigQuery / Cloud Storage / Cloud Pub/Sub
- Ephemeral Hadoop cluster (Dataproc)
- IAM
- Cloud Source Repositories / Cloud Build
This [new] architecture is not a huge revolution (and this is great), it comes from 40+ years of data platform innovation and it follows the same approach as Microservice / Kubernetes.
Stay Data Mesh tuned !
Labels:
Architecture,
Best practice,
BI,
Business Intelligence,
DataMesh,
DataOps,
Datawarehouse,
Google Cloud Platform,
Hadoop
Location:
Vanves, France
Wednesday, April 8, 2020
Sunday, March 29, 2020
Sunday, March 15, 2020
How to DataOps with Google Cloud Platform !
What do we want to achieve ?
Use DataOps to monitor information from twitter about Google.
- Without doing IaaS (Infrastructure), so using Google Cloud managed service or Serverless Technologies
- Making sure all asset are stored in a repository with dev and master branch
- No manual step to test or push content to our Google Cloud Project
- Ensure I can adapt to data structure change and so replay all data processing from scratch
- Keep all data and compress them
What do we need :
- use a repository, let's try Cloud Source Repository (private git hosted by Google Cloud)
- schedule basics tasks : Cloud Scheduler (for more complex pipeline we have another option : Cloud Composer)
- Act on commit to the [dev] branch : Cloud Build
- Send message between task : Cloud Pub/Sub
- React on event or notification : Cloud Function
- Store information between tasks : Cloud Datastore
- Read / Write data to multiple sources & targets : Dataflow
- Use the best fully Serverless Datawarehouse : BigQuery
- Monitor technical / business / FinOps related information : Data Studio, Stackdriver
Let's do it !
- Schedule a task every minute to gather tweets from twitter API then store information to GCS
- Schedule a task every day to compress all previous data in a tar.gz file
- Read compress archive and load it to BigQuery with adaptive schema capabilities
- Build the according reporting
More information and code soon !
Saturday, September 21, 2019
How to DataOps with Google Cloud Platform !
Hello from Budapest,
It's a long time since I didn't have the chance to look at my blog, so little news, I will now restart to share here and the first topic gonna be DataOps on Google Cloud Platform using BigQuery, DataStudio, Jenkins, StackDriver, Google Cloud Storage and more !
Stay Data tuned !
It's a long time since I didn't have the chance to look at my blog, so little news, I will now restart to share here and the first topic gonna be DataOps on Google Cloud Platform using BigQuery, DataStudio, Jenkins, StackDriver, Google Cloud Storage and more !
Stay Data tuned !
Labels:
Analytics,
Architecture,
Automation,
Best practice,
DataFlow,
DevOps,
Meta Development,
Monitoring,
Optimisation,
performance,
Quality,
Standards,
Statistics
Location:
Budapest, Hongrie
Wednesday, May 30, 2018
Getting Things Done
Hello from Langkawi,
A quick one before boarding, I had the chance to work at several customers using several technologies on several environments and several versions... Thing skeeps changing / evolving especially when customer and / or management is changing priorities or bugs occurs on Production (can happen with Hadoop).
How to keep track of your tasks (from customer questions, to admin / expense tasks, to private item to achieve) ? I tried different ways from emails, post-it, Wunderlist, Trello, Google keep and the only that worked for me is Todoist.
Why ?
A quick one before boarding, I had the chance to work at several customers using several technologies on several environments and several versions... Thing skeeps changing / evolving especially when customer and / or management is changing priorities or bugs occurs on Production (can happen with Hadoop).
How to keep track of your tasks (from customer questions, to admin / expense tasks, to private item to achieve) ? I tried different ways from emails, post-it, Wunderlist, Trello, Google keep and the only that worked for me is Todoist.
Why ?
- Easy to setup recurring tasks
- Karma / Graph view to see number of tasks achieved per day / per week
- You can assign a color to each project and also create hierarchy of project
- Possible to share with another user a project
Hope it will help ;-)
Labels:
Optimisation,
performance,
Quality,
Thinking,
Useful
Monday, May 7, 2018
Tuesday, December 12, 2017
Tuesday, July 18, 2017
Wednesday, April 5, 2017
Monday, March 20, 2017
Chief DevOps Officer ! #automation
This is the new trendy job, this is what I think his/her mission should be :
- Automation using DevOps
- Improving metrics gathering & reporting
- Quality improvement by Pareto
Labels:
Analytics,
Architecture,
Automation,
Best practice,
Data visualization,
Databases,
DevOps,
enterpr1se 3.0,
Monitoring,
Mood,
Optimisation,
performance,
Quality,
Standards,
Thinking
Location:
La Défense, France
Sunday, March 5, 2017
Friday, March 3, 2017
Wednesday, March 1, 2017
Monday, February 27, 2017
Friday, February 10, 2017
Monday, January 9, 2017
Open source
Labels:
Analytics,
Architecture,
Data visualization,
Machine Learning
Location:
San Diego, Californie, États-Unis
Friday, October 14, 2016
Wednesday, October 12, 2016
How to install Ansible on Ubuntu !
sudo apt-get install software-properties-common
sudo apt-add-repository ppa:ansible/ansible
sudo apt-get update
sudo apt-get install ansible
sudo apt-add-repository ppa:ansible/ansible
sudo apt-get update
sudo apt-get install ansible
Monday, October 3, 2016
A normal process while doing automation #enterpr1se 3.0 !
Every time you want to automate an IT process :
- Look for Pareto
- Create the according test
- Set up quality control / performance metrics and dashboards
- Gather the logs
- Configure backups
- Ensure the code / configuration is pushed to the configuration management tool
- Be sure to be compliant with your security
Monday, September 26, 2016
Sunday, September 25, 2016
Monday, August 22, 2016
Friday, August 12, 2016
Tuesday, July 26, 2016
Minimum set of tools to do devOps with Hadoop !
DevOps is a way to do / frameworks to use to ease life of IT teams from developer to admin / prod in a complex, multi-parameter, collaborative environment.
- Continuous integration tool : Jenkins
- Build automation tool : Maven
- Team code collaboration tool : Gerrit Code Review
- Database : PostgreSQL
- Admin and application monitoring
- Visualisation tool : Zeppelin
- CEO / admin / dev dashboards
- A Versionning tool : Git
- code
- configuration
- template
Labels:
Administration,
Architecture,
Automation,
Best practice,
CDO,
development,
Git,
Jenkins,
Meta Development,
Optimisation,
performance,
Quality,
Standards,
Useful
Location:
Park St, London SE1, Royaume-Uni
Thursday, July 21, 2016
How to install Zeppelin in a few lines !
wget http://mirrors.ukfast.co.uk/sites/ftp.apache.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
tar xvfz apache-maven-3.3.9-bin.tar.gz
sudo apt-get install -y r-base
sudo apt-get install -y libcurl4-openssl-dev libssl-dev libxml2-dev libcurl4-gnutls-dev
sudo R -e "install.packages('evaluate')"
sudo R -e "install.packages('devtools', dependencies = TRUE)"
export JAVA_HOME=/usr/jdk/jdk1.8.0_60
git clone https://github.com/apache/zeppelin.git
tar xvfz apache-maven-3.3.9-bin.tar.gz
sudo apt-get install -y r-base
sudo apt-get install -y libcurl4-openssl-dev libssl-dev libxml2-dev libcurl4-gnutls-dev
sudo R -e "install.packages('evaluate')"
sudo R -e "install.packages('devtools', dependencies = TRUE)"
export JAVA_HOME=/usr/jdk/jdk1.8.0_60
git clone https://github.com/apache/zeppelin.git
cd zeppelin
../apache-maven-3.3.9/bin/mvn clean package -DskipTests -Pspark-1.6 -Pr -Psparkr
../apache-maven-3.3.9/bin/mvn clean package -DskipTests -Pspark-1.6 -Pr -Psparkr
./bin/zeppelin-daemon.sh start
So cool to finally have an easy Open Source tool to do reporting and visualisation !Wednesday, July 6, 2016
List of Jenkins plugins and configuration for Hadoop automatic deployment !
Configure :
- JDK
- Maven
- Security
- Share ssh public key from the jenkins hosts
Plugins :
- Locale Plugin (en_GB and Ignore browser preference and force this language to all users)
- GitHub Plugin (for Git interaction)
- Nested View (to allow grouping job views into multiple levels)
- SafeRestart (This plugin allows you to restart Jenkins safely)
- Conditional BuildStep (It will allow you to define a condition controling the execution of the step(s))
- Maven Integration plugin (Jenkins plugin for building Maven 2/3 jobs via a special project type)
- JobConfigHistory (Saves copies of all job and system configurations)
- Email-ext plugin (email notification functionality)
- PostgreSQL+Database+Plugin (This is a driver plugin for Database Plugin)
- thinBackup (simply backs up the global and job specific configurations)
- Dynamic Parameter Plug-in (dynamic generation of default build parameter values)
- Plot Plugin (This plugin provides generic plotting (or graphing) capabilities in Jenkins)
- Build Pipeline (Build Pipeline View of upstream and downstream connected jobs)
- View Job Filters (Create smart views with exactly the jobs you want)
- Folder Plugin
- xUnit Plugin
- jUnit Plugin
- R Plugin
- Ansible plugin
- Python Plugin
- Vagrant Plugin
Currently in testing :
Labels:
Administration,
Automation,
Best practice,
Git,
Hadoop,
Hadoop 2.0,
Jenkins,
Meta Development,
Optimisation,
Standards,
Useful
Location:
Londres, Royaume-Uni
Sunday, June 26, 2016
Don't forget the basics !
Working as Hadoop DBA, I have noticed several times that the previous admin did forget to :
- Configure HDFS quota
- Setup YARN Capacity Scheduler
- Check that Configuration Group is well defined
- Security policies are enforced
- Use Jenkins ;-)
Labels:
Administration,
Architecture,
Best practice,
Hadoop,
Hadoop 2.0,
Hbase,
Hive,
Standards
Sunday, June 19, 2016
Saturday, June 18, 2016
Thursday, June 16, 2016
Simple example of Jenkins-HDP integration !
I just create a How-to about Jenkins-HDP on Hortonworks Community Connection, please vote for it ;-)
Cheers
Cheers
Labels:
Administration,
Automation,
Best practice,
Meta Development,
Optimisation,
performance,
Standards,
Useful
Location:
Park St, London SE1, Royaume-Uni
Friday, June 10, 2016
My Hadoop is not efficient enough, what can I do ?
1. Review your memory configuration to maximize CPU utilisation
2. Review your YARN settings especially the Capacity Scheduler
3. Review your application design, parameter used, join strategy, file format
Of course with checking your ganglia / Ambari Metrics, voilà !
PS : For those who don't trust Multi-tenant Hadoop cluster, please call me ;-)
2. Review your YARN settings especially the Capacity Scheduler
3. Review your application design, parameter used, join strategy, file format
Of course with checking your ganglia / Ambari Metrics, voilà !
PS : For those who don't trust Multi-tenant Hadoop cluster, please call me ;-)
Labels:
Administration,
Architecture,
Best practice,
Hadoop,
Hadoop 2.0,
Hive,
MapReduce,
Optimisation,
performance,
Pig,
Quality,
Spark,
SQL,
Standards,
Useful
Saturday, May 28, 2016
How to automate Data Analysis ? #part2
Here we go, so I code a prototype to
Next step will probably to add Spark code generation.
- help parse CSV file (will add database, JSON supports later)
- load the data into a Hadoop
- create the corresponding Hive ORC table
- run simple query to extract information
- MIN, MAX, AVG
- Top 10
- COUNT(DISTINCT ), COUNT(*) (if timestamp by YEAR, YEAR / MONTH) and NULL value
- Regex matching the record
Next step will probably to add Spark code generation.
Tuesday, April 19, 2016
My Hadoop application get stuck !
If you are using a multiple application environment, you can reach the point when you can't allocate any more mapper / reducer / container and so some of your application are wainting for ressource and so get stuck.
In that case, review you Capacity Scheduler queue settings (capacity and elasticity), check the mapreduce.job.reduce.slowstart.completedmaps and enable preemption !
In that case, review you Capacity Scheduler queue settings (capacity and elasticity), check the mapreduce.job.reduce.slowstart.completedmaps and enable preemption !
Labels:
Best practice,
Hadoop,
Hadoop 2.0,
Optimisation,
Useful
Location:
Old Broad St, London, Royaume-Uni
Monday, March 7, 2016
A list of useful R package !
Here you go :
- sqldf (for selecting from data frames using SQL)
- forecast (for easy forecasting of time series)
- plyr (data aggregation)
- stringr (string manipulation)
- Database connection packages RPostgreSQL, RMYSQL, RMongo, RODBC, RSQLite
- lubridate (time and date manipulation)
- ggplot2 (data visulization)
- qcc (statistical quality control and QC charts)
- reshape2 (data restructuring)
- randomForest (random forest predictive models)
- xgboost (Extreme Gradient Boosting)
- RHadoop (Connect R with Hadoop)
And don't forget http://statmethods.net/ !
Saturday, February 27, 2016
Postgres !
PostreSQL is an excellent Open Source database for small and medium projects. You will find a lot of amazing features like HA, statistics functions, Row Level Security, JSON support, UPSERT.
Labels:
Analytics,
BI,
Business Intelligence,
SQL,
Useful
Location:
Londres, Royaume-Uni
Friday, February 26, 2016
eXtreme Gradient Boosting, XGBoost !
One of the new algorithm, XGBoost which brings excellent results.
Labels:
Analytics,
Best practice,
development,
Machine Learning,
Optimisation,
Python,
R,
Standards,
Statistics
Location:
Leeds, Yorkshire de l'Ouest, Royaume-Uni
Tuesday, February 23, 2016
Tableau !
For those who still don't have an amazing great reporting tool ;-) : Tableau.
Labels:
Analytics,
BI,
Business Intelligence,
Data visualization,
Useful
Location:
Antony, France
Thursday, January 21, 2016
How to automate Data Analysis ? #part1
When I first started this project, I was wondering how to speed up the work done by analyst. I figure out there is a lot to do here :
- What can be pre-processed (script to load a file, create the corresponding table, and so as soon as a new file / table is created, what is the statistics for every column, the regex, NULL values)
- What can be automatically discovered (this column can be used to join with this table you have on element more though, user in the group [female] has an average .... compare to group [male]
- Things that can be generated on the fly (mainly code : R code, SAS code, SQL code, mainly based on template)
- Things that can be parameterized
- How often I have heard, "I will try that later with this assumption" what if we can parameterised easily every step of the data workflow, useful as well for multi-dimension matrix based test case
- Things that can be automated / trigger
- I just created a new variable what does that bring to my workflow
- Variable reduction / transformation
Labels:
Analytics,
Architecture,
Automation,
Best practice,
DataFlow,
development
Location:
Leeds, Yorkshire de l'Ouest, Royaume-Uni
Friday, January 15, 2016
Wednesday, January 13, 2016
The HadoopAutomator !
2016 will be the year of automation, I am currently working on several projects to automate almost everything (from installation to automatic data analysis and reporting) mainly using :
- An orchestration tool : Ansible
- A powerful big data platform : Hortonworks Data Platform
- A database, I like elephant so will probably go for PostgreSQL
- A programming language : Python (some Java too because of Ambari view)
- Several REST API like Ambari blueprints, WebHCat, Ambari metrics
- And of course, the basic stack : Jenkins, SVN, Git, Maven, SSH, Shell scripts and some web technologies
Labels:
Architecture,
Automation,
Best practice,
development,
Git,
Hadoop 2.0,
Meta Development,
Standards,
SVN,
Thinking,
Useful,
Web development
Location:
Leeds, Yorkshire de l'Ouest, Royaume-Uni
Friday, December 18, 2015
Saturday, November 14, 2015
Wednesday, November 11, 2015
Jenkins, Maven, SVN and Hortonworks HDP2.3 sandbox !
If you are also an automation and Open Source fan and you are (or not) in the process to build Hadoop application, I strongly suggest to use (minimum) :
- Continuous integration tool (Jenkins, TeamCity, Travis CI)
- Build tool (Maven, Ant, Gradle)
- Provisionning tool (Chef, Ansible, shell script, Puppet)
- Versionning system (Git, SVN, CVS)
I have the pleasure to use SVN + Jenkins + Maven + few Shell script + HDP sandbox on my laptop and this is really awesome.
Thanks ;-)
Thanks ;-)
Labels:
Automation,
Best practice,
development,
Git,
Jenkins,
Quality,
Standards,
SVN,
Useful
Location:
Toulouse, France
Friday, November 6, 2015
Hadoop Mini Clusters !
A nice project here to do some local test/development ! You can find others interesting projects in the Hortonworks gallery.
Labels:
development,
Hadoop,
Hadoop 2.0,
Hbase,
Hive,
Kafka,
MapReduce,
Oozie,
Optimisation,
Storm,
Useful,
Yarn,
Zookeeper
Location:
Cork, Irlande
Sunday, October 25, 2015
Wednesday, October 21, 2015
Hadoop and version control software !
For those who don't use Ambari, or for those [edge] nodes which are not synced, please be sure to use a version control software so your team / admin will know which library / configuration file / links have been modified by who / when / why [and life will be easier]
Labels:
Architecture,
development,
Hadoop,
Hadoop 2.0,
Quality,
Standards,
Useful
Location:
Santa Clara, Californie, États-Unis
Monday, October 19, 2015
Linux | Hadoop is only free if your time has no value !
This famous quote can also apply for Hadoop !
Sunday, October 11, 2015
Meta-development and [automatic] analytics !
It can sound crazy to some of you but I think it is time for our IT platforms to become smarter and to start to do some data management/data preparation/analytics by themselves or at least do more suggestion.
For most of our BI-like/analytics project we spend around 70% of the time to do data preparation. We, human, need time to understand the data, the way it is produced, how to ingest/transform it to get a structure, how to enrich it to get the semantic/standards for our business users/reporting.
What if I dedicate 20% space/CPU of my [Hadoop] platform, what if my platform knows some heuristics and makes some assumption ? What if I have an embedded-portal [ambari view] which shows metrics about data/asking question ?
- Those files seem to be received as daily full dump, do you agree ?
- This dataset can be map to this schema <CREATE TABLE .... schema>
- This column contain 20% of NULL especially when this <columnname> = "October", do you want to create a rule ?
- This file contains on average 45000 lines +/- 5% except for 3 days
- <column name> can be used to join these two tables, the matching will be 74%
- This column can be predicted using <analytics cube name> with a 70% accuracy, the best model is <model name>, top variable are <list of variable name>, do you think of a new variable ?
Wednesday, August 26, 2015
Apache Zeppelin !
Apache zeppelin is the really cool open-source web-based notebook which supports collaborative edition, basic reporting capabilities, data discovery and multiple language backend !
Saturday, May 23, 2015
Automatic Statistician !
I am looking forward to see the outcome of this project : The Automatic Statistician
Labels:
Analytics,
Machine Learning,
Meta Development,
Optimisation
Location:
Londres, Royaume-Uni
Sunday, March 29, 2015
How to structure my datalake ?
Nowadays I am working on an interesting topic : How to build a datalake. Actually it can be straightforward, this is one possible way to do it :
- Start by creating a directory "datalake" at the root
- Depending on your company / team / project add sub folder to represent your organisation
- Create technical users (for example etl...)
- Use your HDFS security system to configure permission
- Set up quota
- Add sub-directory for every
- source
- version of the source
- data type
- version of the data type
- data quality
- partition [Year / Month / Day | Category | Type | Geography]
- Configure workload management for each [technical] users / groups
- For every folder create [once] the metadata to allow users to [many] query it from Hive / HCatalog / Pig / others
- Keep it mind your HDFS block size [nowadays between 128MB and 1GB]
- in order to avoid small file problem
- Use naming standard on the file level to allow data lineage, guarantee one processing
And so datalake becomes organized and smart raw dump.
Labels:
Architecture,
DataLake,
ETL,
Hadoop,
Hadoop 2.0,
Hbase,
Hive,
Standards,
Useful
Location:
Arabie saoudite
Monday, March 16, 2015
There are bugs but it is normal life !
Hadoop is evolving very fast and sometimes you can find bugs. Be sure to check for your version / component what are the bugs :
Labels:
Architecture,
Bug,
Flume,
Hadoop,
Hadoop 2.0,
Hive,
Kafka,
Mahout,
Pig,
SolR,
Spark,
Task,
Useful
Location:
Antony, France
Saturday, February 14, 2015
Music #3
Sam Smith - Stay With Me | Lorde - Royals | London Grammar - Live Montreux Jazz Festival 2014 | David Guetta - She Wolf | John Newman - Love Me Again | Imagine Dragons - Radioactive | Tom Odell - Another Love | Low Roar - Easy Way Out | Taylor Swift covers Vance Joy's Riptide in the Live Lounge | Passenger - Let Her Go | Ed Sheeran - Thinking Out Loud | Johnny Cash - Hurt
And happy valentine's day ;-) !
And happy valentine's day ;-) !
Sunday, February 8, 2015
Meta-development !
It has been more then five years I am working on BI / IT projects. There is a lot of things to cover :
(Yes for now you didn't learn anything), what I am wondering is why the technology we are using is not smarter.
I think we should be more meta-development oriented, our technology should be able to understand more data and propose patterns to work with it. I don't think we gain any value by rewriting / reconnecting data and system or by indicating the way to parse a date... The same for project, some task should be automatically created. Last but not the least, it should be the same for analytic, instead of writing some great SQL-like code, I am more looking for a kind of "Find correlation", "Identify trends", "Build multi-axis reporting on this dataset".
It is quite slippy and cold in Stockholm but have a good week-end ;-) !
- Data [capture / quality / preparation / management / usage / monetization]
- Security
- Workload management, scheduling
- Backups strategy
- Optimisation, design pattern, good practices
- SLAs, service & access
- Analytics
- Migration
- Team, role, organization, responsibilities
(Yes for now you didn't learn anything), what I am wondering is why the technology we are using is not smarter.
I think we should be more meta-development oriented, our technology should be able to understand more data and propose patterns to work with it. I don't think we gain any value by rewriting / reconnecting data and system or by indicating the way to parse a date... The same for project, some task should be automatically created. Last but not the least, it should be the same for analytic, instead of writing some great SQL-like code, I am more looking for a kind of "Find correlation", "Identify trends", "Build multi-axis reporting on this dataset".
It is quite slippy and cold in Stockholm but have a good week-end ;-) !
Labels:
Analytics,
Architecture,
Business Intelligence,
development,
Hadoop,
Hadoop 2.0,
Machine Learning,
Task,
Thinking
Location:
Stockholm, Suède
Friday, January 23, 2015
What makes you a better man / woman ?
Last year I took some time to think about my condition and life. I am quite motivating human and enjoy to live and work hard (and there is a lot to do / to learn).
I am not going to debate on possible answers because it is really depending on people / culture / [basic / complex] needs but I invite you, my dear reader, to think about that.
For me, I would like one day to help mankind and I daily like to feed my brain.
All the best ;-)
Tuesday, December 30, 2014
Collaborative datalake !
It is holidays now so let's relax a little and imagine some funny things !
Why not a collaborative datalake based on Hadoop and web technology which allows users to share both their dataset and the code story to create it ? I would add a vote system too !
Let's see ;-)
Labels:
Architecture,
BI,
Business Intelligence,
DataFlow,
development,
ETL,
Hadoop,
Hadoop 2.0,
Hbase,
Hive,
IoT,
Java,
Pig,
SolR,
Spark,
Standards
Location:
Tainan, Taïwan
Monday, December 29, 2014
Hadoop & Late binding !
Late binding is one of the key capability of Hadoop. It allows users to parse raw data (gzip, snappy, bzip2, csv, xml, json, pdf, jpeg, doc, others) stored in HDFS and to apply a structure on the fly.
Unfortunately, there may be some problems which link with data quality. Depending on the data and the SerDe used, you can loose some data, find it in the wrong column or the entire job can fail (which is annoying when working with 4TB of data ;-))
So my advice is to do a minimum checking [on the edge node] before data ingestion.
Labels:
Architecture,
development,
ETL,
Hadoop,
Hadoop 2.0,
Hive,
Java,
Optimisation,
Pig,
Quality,
Standards,
Stinger,
Useful
Location:
Tainan, Taïwan
Friday, December 19, 2014
I want the last Hadoop improvement / new tool !
For those who are really excited by the latest improvement of Hadoop, I invite them to use it first on a development environment. Sometimes Hadoop is a little too fresh and they are some bugs.
That is why also having Hadoop, most of the time, implies to do migration every 6 months to get the latest patches.
That is why also having Hadoop, most of the time, implies to do migration every 6 months to get the latest patches.
Labels:
Architecture,
Hadoop,
Hadoop 2.0,
Hbase,
Hive,
Pig,
Shark,
Spark,
Standards,
Useful
Thursday, December 18, 2014
WebHCat !
WebHCat is the REST API for HCatalog, so for those who want to use REST and summit Hive query ;-) !
Wednesday, December 17, 2014
Hadoop Deprecated Properties !
Sometimes it is important to know if your property is deprecated and then which other one to use : Hadoop Deprecated Properties.
Labels:
Architecture,
Hadoop,
Hadoop 2.0,
Hbase,
Hive,
Pig,
Standards,
Useful
Location:
Prague, République tchèque
Friday, December 12, 2014
Thursday, December 11, 2014
Saturday, December 6, 2014
Using Open Data & Machine Learning !
Before I was not convinced that Open Data brings more value to my project. Lately, just using Open Data, I am able to build an efficient model to predict dengue rate in Brezil with Least Angle Regression algorithm. To do so, we used meteo (wind, temperature, precipitation, thunder / rain rates, ...), altitude, localisation, urbanization, twitter / wikipedia frequency and custom variables (mostly lag).
Labels:
Analytics,
Aster,
BI,
Business Intelligence,
Hadoop,
IoT,
Machine Learning,
Useful
Location:
Singapour
Monday, November 17, 2014
IPython Notebook !
After RStudio server, there is also a way to use Python from a website : IPython Notebook.
Friday, October 17, 2014
Hive development !
Hive 0.14 is now supporting ACID (atomicity, consistency, isolation and durability) transaction which lead to :
This is really impressive !
- UPDATE, DELETE
- BEGIN, COMMIT, ROLLBACK
- INSERT ... VALUES
Stinger.next will bring more SQL compliance (non-equi joins, more sub-queries, materialized views and others) and Apache Optiq is bringing cost-based optimization to improve performance.
Labels:
Analytics,
BI,
Business Intelligence,
Databases,
Datawarehouse,
ETL,
Hadoop,
Hadoop 2.0,
Hbase,
Hive,
SQL,
Standards,
Stinger,
Useful
Location:
San Diego, Californie, États-Unis
Friday, October 10, 2014
Meta-development with R !
I am now going deeper into R which is great ! I recommend to look at this functions when you want to write code which generate and evaluate code directly.
- assign()
- eval()
- parse()
- quote()
- new.env()
It is not the best in term of performance but can be really useful for dynamic coding. Have a great week- end ;-) !
Monday, September 22, 2014
Sunday, September 21, 2014
Summingbird !
Last week, I went to a meetup about streaming platform and there was a great guy who presents Summingbird : library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.
Labels:
Architecture,
development,
Hadoop,
Scala,
Shark,
Storm,
Useful
Location:
San Diego, Californie, États-Unis
Friday, August 22, 2014
Tuesday, August 19, 2014
Unlockyourbrain !
This application is quite nice if you want to improve the way to unlock your phone !
Wednesday, August 6, 2014
Tuesday, August 5, 2014
Scale Open Source R with AsterR or Teradata 15 !
I recently contribute to a great project which deals with using R in a distributed way within Aster and Teradata. I rediscover that R is really permissive, flexible, powerful.
Labels:
Analytics,
Architecture,
Aster,
BI,
Business Intelligence,
Data visualization,
development,
ETL,
Machine Learning,
Python,
R,
SQL,
Standards,
Statistics,
Teradata,
Useful
Monday, August 4, 2014
RStudio Server
You perhaps know RStudio IDE which is really nice. But if you want to use the RAM and the CPU of another server you can also install RStudio Server and access your R environment using a browser based interface, and it rocks !
Labels:
Analytics,
Architecture,
Data visualization,
development,
R,
Useful,
Web development
Saturday, August 2, 2014
Mankind are quite stupid...
I mean we have done a lot of amazing studies and innovations in the field of Science, Technology and Health. But the fact that we have created this garbage continent, the way we run most of our business by enriching shareholders or all the political corruption you can discover in daily life... It deserves a very big WTF !
Have a good week-end ;-)
Have a good week-end ;-)
Thursday, July 31, 2014
Partial Redistribution Partial Duplication
PRPD is a new feature in Teradata since 14.10 and it improves joining with skew tables (so it depends on statictics to identify skewed values). This is a smart way to avoid DBA to create Surrogate key !
Wednesday, July 30, 2014
NVD3 : Re-usable charts for d3.js
If you don't want to start from scratch with D3.js, have a look at NVD3.js ;-)
Labels:
Analytics,
Data visualization,
Fun,
Javascript,
Statistics,
Useful,
Web development
Tuesday, July 29, 2014
Dataiku !
Dataiku is a French startup which is providing a great web-based plateform to accelerate data-science projects and there is an open-source version !
Labels:
Analytics,
Architecture,
Data visualization,
Databases,
development,
ETL,
Hadoop,
Hadoop 2.0,
Hive,
Machine Learning,
Mahout,
Pig,
Python,
Quality,
R,
SQL,
Standards,
Statistics,
Useful
Thursday, July 24, 2014
Is it possible to install R, python on androïd (without rooting) ?
Yes and it is very easy thanks to GNURoot !
Wednesday, July 23, 2014
My Hadoop is not working, what can I do ?
Keep calm and ;-)
- First check your logs
- Is the service is running ? (netstat -nat | grep ...)
- Is it possible to access it ? (telnet ip port)
- Is there a problem linked with path, java libraries, environment variable or exec ?
- Am I using the correct user ?
- What is the security system in place ?
- Are nodes well synchronized ?
- What about memory issue ? (swap should be desactivated also)
Monday, July 14, 2014
Virtual Desktop on Windows !
For those who come from Linux or MacOs and would like virtual desktop on Windows :-)
Wednesday, May 14, 2014
Chief Data Officer
I would like to meet people who are working as CDO : Chief Data Officer. It's look like it is a very interesting job (data quality, data management, data wwwww) and it should be very helpful for data preparation I need before running analytics workflow / discovery process.
Monday, May 5, 2014
Hive development !
A lot of improvment for this new release of Hive !
- [NOT] EXIST and [NOT] IN are available
- WITH t_table_name AS ... well know as Common Table Expressions too
- SELECT ... WHERE (SELECT c_column1 FROM ...) as Correlated Subqueries
- SQL authorization system (GRANT, REVOKE) is now working
- The Tez engine which can be enable thanks to set hive.execution.engine=tez;ee
Thursday, April 24, 2014
Python !
Python is already almost everywhere and used in production in Google. It is a very powerful programming langage to map your wish (from Web to GUI) in a script !
Labels:
Analytics,
Aster,
development,
ETL,
Hadoop,
Machine Learning,
Python,
Useful
Location:
Londres, Royaume-Uni
Wednesday, April 23, 2014
HBase coprocessor !
If you need to execute some custom code in your HBase cluster, you can use HBase coprocessor :
- Observers : like triggers in RDBMS
- RegionObserver : To pick up every DML statement : get, scan, put, delete, flush, split
- WALObserver : To intercept WAL writing and reconstruction events
- MasterObserver : To detect every DDL operation
- EndPoints : kind of stored procedure
Labels:
Analytics,
Architecture,
BI,
Business Intelligence,
ETL,
Hadoop,
Hbase,
Useful
Location:
Antony, France
Wednesday, April 2, 2014
Scikit-learn !
Scikit-learn is an open-source machine-learning library written in Python. It is fast and handles memory well and thanks to Python is very flexible !
Labels:
Analytics,
Business Intelligence,
Machine Learning,
Python,
Statistics,
Useful
Location:
La Défense, France
Monday, March 31, 2014
Teradata & Hadoop !
Teradata and Hadoop interacts well together especially inside UDA with InfiniBand interconnect. To know which platform to use when you should look at your needs, where is the largest volume and platform's capabilities.
If you want to transfert data, you can consider :
If you want to transfert data, you can consider :
- Between Hadoop and Teradata : Teradata Connector for Hadoop and SQL-H
- Between Aster and Hadoop / Teradata : SQL-H
Labels:
Architecture,
BI,
Business Intelligence,
Databases,
Datawarehouse,
ETL,
Hadoop,
Hadoop 2.0,
Standards,
Useful
Location:
Issy-les-Moulineaux, France
Friday, March 28, 2014
Machine Learning with Aster !
I am now working with Aster to do Machine Learning and statistics. Here are the functions you can use :
- Approximate Distinct Count : to quickly estimates the number of distinct values
- Approximate Percentile : to computes approximate percentiles
- Correlation : to determine if one variable is useful for predicting an other
- Generalized Linear Regression & Prediction : to perform linear regression analysis
- Principal Component Analysis : for dimensionality reduction
- Simple | Weighted | Exponential Moving Average : compute average with special algortihm
- K-Nearest Neighbor : classification algorithm based on proximity
- Support Vector Machines : build a SVM model and do prediction
- Confusion Matrix [Plot] : visualize ML algorithm performance
- Kmeans : famous clustering algorithm
- Minhash : Another clustering technic which depends on the set of products bought by users
- Naïve Bayes : useful classification method especially for documents
- Random Forest Functions : predictive modelling approaches broadly used for supervised classification learning
Labels:
Analytics,
Aster,
Business Intelligence,
Machine Learning,
Statistics
Location:
Antony, France
Tuesday, March 11, 2014
Teradata’s SNAP Framework !
Teradata’s Seamless Network Analytic Processing Framework is one of the great ideas inside Aster 6 database. It allows user to query different analytical engines and multiple type of storage using a SQL-like programming interface. It is composed by a query optimizer, a layer that integrates and manages resources, an execution engine and the unified SQL interface. These are the main components and their goals :
- SQL-GR & Graph Engine : provide functions to work with edge, vertex, [un|bi|]directed or cyclic graph
- SQL-MR : library (Machine Learning, Statistics, Search behaviour, Pattern matching, Time series, Text analysis, Geo-spatial, Parsing) to process data using MapReduce framework
- SQL-H : easy to use connection to HDFS for loading data from Hadoop
- SQL : join, filter, aggregation, OLAP, insert, update, delete, CASE WHEN, table
- AFS connector : SQL-MR function to map AFS file to table
- Teradata connector : SQL-MR function to load data from / to Teradata RDBMS
- Stream API : plug your Python, Ruby, Perl, C[|++|#] scripts and use Aster CPU workers node to process it
Labels:
Analytics,
Architecture,
Aster,
BI,
Data visualization,
Databases,
development,
Graph Database,
SQL,
Standards,
Teradata,
Useful
Location:
Antony, France
Tuesday, February 25, 2014
Sunday, February 23, 2014
Scam #1
I rent a car using locationdevoiture.fr, they called pretending there is a problem during website registration and they changed the date so when I arrived to take the car, no voucher, no reservation... #becareful #scam
Friday, February 21, 2014
Taxi !
I just took the taxi this morning and because I am a stranger, the guy took the wrong direction... I like to travel but not before going to work ;-). But thanks to Google Map and because I remember the price I payed the first day everything went well. These are advices I would like to share
- Take your phone, use GMap and show the driver where you want to go to
- Don't take taxi near the tourism spot or your hotel taxi
- Ask for the (approximation) price before leaving
Saturday, February 15, 2014
My Hapiness receipe !
- Travelling every 4 months
- Save money by using smart website
- Drink prêt a manger soup, hot hazelnut chocolate in Starbucks or tea (especially green one)
- Drive motorbike when it is sunny
- Enjoy your family
- Share knowledge & keep discovering (not only IT)
- Try new food or new restaurants
- Gather with your friends and have some drink ;-)
- Eat healthy (yes you are what you do but also what you eat !)
- Take photo using Instagram and share our friends !
- Write down what it is important for you
- Read, read, read especially before going to sleep
- Wake up at 07h07, go to sleep at 22h22 (at least try)
- Close your computer now and do some sports ;)
Monday, February 10, 2014
HDP [>] 2.1 natively available applications !
Stack components :
- MapReduce (API v1 & v2) : software framework for processing vast amounts of data
- Tez : more powerful framework for executing DAG (directed acyclic graph) of tasks
- HOYA, HBase on YARN : distributed, column oriented database
- Accumulo : (Linux only) sorted, distributed key / value store
- Hue : web application interface for Hadoop ecosystem (Hive, Pig, HDFS, ...)
- HDFS : hadoop distributed file system
- WebHDFS : interact to HDFS using HTTP (no need for library)
- WebHCat : interact to HCatalog using HTTP (no need for library)
- YARN : Yet Another Resource Negotiator, allows more applications to run on Hadoop
- Oozie : workflow / coordination system
- Mahout : Machine-Learning libraries which use MapReduce for computing
- Zookeeper : centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
- Flume : data ingestion and streaming tool
- Sqoop : extract and push down data to databases
- Pig : scripting platform for analyzing large data sets
- Hive : tool to query the data using a SQL-like language
- SolR : plateform for indexing and search
- HCatalog : meta-data management service
- Ambari : set up, monitor and configure your Hadoop cluster
- Phoenix : sql layer over HBase
Components being developed / integrated :
- Spark : in memory engine for large-scale data processing
- Falcon : data management framework
- Knox : single point of secure access for Apache Hadoop clusters (use WebHDFS)
- Storm : distributed realtime computation system
- Kafka : publish-subscribe messaging system
- Giraph : iterative graph processing system
- OpenMPI : high performance message passing library
- S4 : stream computing platform
- Samza : distributed stream processing framework
- R : software programming language for statistical computing and graphics
Labels:
Architecture,
BI,
Business Intelligence,
development,
Hadoop,
Hadoop 2.0,
Hbase,
Hive,
Java,
Kafka,
linux,
Mahout,
Pig,
Real Time,
SolR,
Spark,
Stinger,
Useful
Location:
Istambul, Turquie
Tuesday, February 4, 2014
Basic statistics with R !
Saturday, February 1, 2014
Lego & Chrome !
For now, there is not a lot of piece but it can let you have greats moments with your child / nephew ;-)
Monday, January 27, 2014
Main Hadoop 2.0 daemons !
- NameNode : one per Namespace, stores & handles HDFS metadata
- Secondary NameNode : (for now) still in use if no HA
- Checkpoint node : (later) multiple checkpoint node is possible, performs periodic checkpoints
- Backup node / Standby node : allows high availability, keep updated copy of namespace in its memory (if using no checkpoint allowed)
- DataNode : stores HDFS data
- ResourceManager : a global pure scheduler
- ApplicationMaster : one per application, negociates ressource with RM, monitors and asks task execution to NodeManager
- NodeManager : one per slave server, a task application container launcher and reporting agent
- Application Container : a separate processing unit, it can be a Map, a Reduce or a Storm Bolt, etc
Thursday, January 16, 2014
Monday, January 13, 2014
Google Keep !
In 2013, I tried some softwares to improve my organisation, I found one quite smart & useful : Google Keep. You can create task or task list, add color, picture, and reminder (date or location) and it is synchronise with your androïd device !
Sunday, January 12, 2014
Tuesday, January 7, 2014
Hadoop & Java !
Thanks to UD[AT]F or MapReduce you can work directly with Java and use your Hadoop ressources. Because of the huge number of Java library, you can imagine extract directly data from HTML / XML files, mix it with reference / parameter data (JDBC loading), and transform it in Excel files in one Job !
Friday, January 3, 2014
Sunday, December 29, 2013
What is your passions ?
I always like to hear about others passions.
My passions/interests are :
My passions/interests are :
- IT
- Solving problem
- Discovering & Travelling
- Human & Sharing & Realisation
- Health
- Build things to make this worl smarter and safer
- Running (marathon for 2014) & sports in general !
- Aviation & aeromodelling & flight simulation
- English
- Motorbiking
- Eating & discovering food
- Movies
- Automatic watches
- Having fun ;-)
So if we meet, tell me yours !
Saturday, December 28, 2013
Thursday, December 26, 2013
Thursday, December 5, 2013
Decision trees & R | Mahout !
Yesterday, I was asked "how can we visualise what leads to problems" ? To me, one of the best way is using decision tree with R or Mahout !
And you can do prediction & draw nicely !
And you can do prediction & draw nicely !
Labels:
Analytics,
BI,
Business Intelligence,
Data visualization,
Mahout,
R,
Statistics,
Useful
Location:
Issy-les-Moulineaux, France
Sunday, December 1, 2013
Saturday, November 30, 2013
Lambda architecture
On this post I would like to present one of the possible software lambda-architecture :
Speed layer : Storm, HBase
Storm is the real-time ETL and HBase because of random, realtime read/write capability is the storage !
Batch layer : Hadoop, HBase, Hive / Pig / [your datawarehouse]
To allow recomputation, just copy your data, har / compress and plug a partitioned Hive external table. So you can create complex Hive workflow and why not push some data (statistics, machine learning) to HBase again !Serving layer : HBase, JEE & JS web application
JEE is convenient because of HBase java API and JDBC if you need to cache some ref data. And you can use some javascrip chart library.
Stay KISS ;-)
Stay KISS ;-)
Labels:
Architecture,
BI,
Business Intelligence,
ETL,
Hadoop,
Hbase,
Hive,
Java,
Real Time,
Useful
Location:
Antony, France
Sunday, November 3, 2013
Friday, October 18, 2013
Hadoop 2.0 !
Apache Hadoop 2.0 has just been released some days ago ! Hadoop is no longer only a MapReduce container but a multi data-framework container and provides High Availability, HDFS Federation, NFS and snapshot !
Big-LambData Architecture !
Nathan Marz proposed to apply the lambda philosophy to big data architecture and so it can help when you have to solve use cases using batch and real time processing systems.
Lambda architecture is based on three main design principle :
Lambda architecture is based on three main design principle :
- human fault-tolerance
- data immutability
- recomputation
Quality Function Development !
I like to use Japanese methods, QFD is one of my favourite for improving / solving complex IT issues !
Sunday, October 13, 2013
Hive development !
Lateral view simplifies the use of UDTF :
SELECT column1, column_udtf1
FROM t_table
LATERAL VIEW explode(array_column) ssreq_lv1 AS column_udtf1
;
And with Hive 0.11, you now have ORC Files and windowing functions :
SELECT column1, column_udtf1
FROM t_table
LATERAL VIEW explode(array_column) ssreq_lv1 AS column_udtf1
;
And with Hive 0.11, you now have ORC Files and windowing functions :
- LEAD / LAG
- FIRST_VALUE / LAST_VALUE
- RANK / RANK / ROW_NUMBER / DENSE_RANK
- CUME_DIST / PERCENT_RANK / NTILE
Sunday, September 29, 2013
Friday, September 27, 2013
Business Intelligence & Hadoop
Most of the time BI means snowflake or star schema (or hybrid or complex ER). But with Hadoop you should rather think about denormalization, a big ODS, powerful ETL, a great place for your fact data and a new way (Hive / Mahout / Pig / Cascading) to tackle your normal / semi / non-structured data using real-time (HBase, Storm, Flume) or not !
Location:
Paris, France
Wednesday, September 25, 2013
Hadoop & compression !
Compression with Hadoop is great ! You can reduce IO, network exchange and store more data, and most of the time your Hive/Pig/MapReduce jobs will be a little faster.
Depending on what your needs are, you should think about Snappy, lzo, lz4, bzip or gzip.
Depending on what your needs are, you should think about Snappy, lzo, lz4, bzip or gzip.
SET hive.exec.compress.output=true;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
Flume daemons !
Both of them run asynchronously
with the events staged in the channel.
Labels:
BI,
Business Intelligence,
Data visualization,
ETL,
Flume,
Hadoop,
Java,
Useful
Location:
Dayton, Ohio, États-Unis
Thursday, September 12, 2013
Wednesday, September 11, 2013
CouchSurfing !
CouchSurfing is a super cool & free way to travel and meet new friends around the world. I like it and I joined the community !
Monday, September 9, 2013
VirtualBox !
VirtualBox is a virtualization software package, it is very convenient when you want to test OS, to do web development or build your first Hadoop cluster :-) !
Sunday, September 8, 2013
Friday, August 30, 2013
Thursday, August 29, 2013
Speculative execution && Hadoop !
I usually disable speculative execution for MapReduce task when I write to RDBMS in Hive user defined table function.
set mapred.map.tasks.speculative=false;
set mapred.reduce.tasks.speculative.execution=false;
set hive.mapred.reduce.tasks.speculative.execution=false;
And if you tune the mapred.reduce.tasks, you can control RDBMS session-running number.
It is good also to use Batch mode and control the commit !
set mapred.map.tasks.speculative=false;
set mapred.reduce.tasks.speculative.execution=false;
set hive.mapred.reduce.tasks.speculative.execution=false;
And if you tune the mapred.reduce.tasks, you can control RDBMS session-running number.
It is good also to use Batch mode and control the commit !
Monday, August 26, 2013
Recursive SQL !
A quite powerful way to handle hierarchical model data : recursive SQL !
WITH tmp_table AS
SELECT column1, column2, ...
FROM src_table
WHERE src_table.hierarch_column_id is NULL
UNION ALL
SELECT column1, column2, ...
FROM src_table
INNER JOIN tmp_table
ON src_table.hierarch_column_id = tmp_table.column_id
SELECT *
FROM tmp_table
You can also add meta-information like 1 as n in the first select using src_table and n + 1 in the second join which lets you filter the level.
WITH tmp_table AS
SELECT column1, column2, ...
FROM src_table
WHERE src_table.hierarch_column_id is NULL
UNION ALL
SELECT column1, column2, ...
FROM src_table
INNER JOIN tmp_table
ON src_table.hierarch_column_id = tmp_table.column_id
SELECT *
FROM tmp_table
You can also add meta-information like 1 as n in the first select using src_table and n + 1 in the second join which lets you filter the level.
R !
R is a free software for doing statistics, analytics, machine learning and data visualization.
If you want to start learning R, watch Google Developers videos, read machine learning or statistical models. You can find an IDE and an easy-way to create web-reporting using R and Shiny.
And don't forget library(rmr2) & library(rhdfs) to plug it with Hadoop !
If you want to start learning R, watch Google Developers videos, read machine learning or statistical models. You can find an IDE and an easy-way to create web-reporting using R and Shiny.
And don't forget library(rmr2) & library(rhdfs) to plug it with Hadoop !
Labels:
Analytics,
BI,
Business Intelligence,
Data visualization,
Hadoop,
R,
Web development
Sunday, August 25, 2013
Wednesday, August 21, 2013
Principal components analysis with R !
Sunday, August 4, 2013
Thursday, July 18, 2013
Monday, July 15, 2013
Hadoop Summit 2013 videos !
I wish to have time to watch all soon !
Labels:
Business Intelligence,
Data visualization,
ETL,
Hadoop,
Useful
Location:
San José, Californie, États-Unis
Sunday, May 26, 2013
Wednesday, May 8, 2013
SolR & ElasticSearch !
SolR and ElasticSearch are both great way to add search capability (and more) to your projects. And behind that, there is Lucene !
Thursday, May 2, 2013
Monday, April 29, 2013
What can I do with Mahout ?
- Clustering
- Canopy
- K-Means
- Fuzzy K-Means
- Dirichlet Process
- Latent Dirichlet Allocation
- Mean-shift
- Expectation Maximization
- Spectral
- Minhash
- Top Down
- Classification
- Logistic Regression
- Bayesian
- Support Vector Machines
- Random Forests
- Decision forest
- Machine learning
- Recommendation
- Dimension reduction
- Your own business ! (If you understand how MapReduce and Mahout class work together, you can code your own logic)
Subscribe to:
Posts (Atom)