Lanciaux Maxime | BI | DWH | Hadoop | DevOps | Google Cloud | DataOps

Showing posts with label Business Intelligence. Show all posts

Monday, November 16, 2020

Data Mesh on Google Cloud Platform (and this is excellent !)

Hello from Home 🏡,

Quick update as I am leading the creation of a new Deck explaining Data Mesh Architecture and why this is an amazing opportunity to adopt Google Cloud Platform for this approach :

GCP Project per Data Domain / Team (<-> direct Data Mesh mapping 👌)
Serverless
Pay as you go
Google APIs
Looker (esp. Looker API and the semantic layer)
Scalability
BigQuery / Cloud Storage / Cloud Pub/Sub
Ephemeral Hadoop cluster (Dataproc)
IAM
Cloud Source Repositories / Cloud Build

This [new] architecture is not a huge revolution (and this is great), it comes from 40+ years of data platform innovation and it follows the same approach as Microservice / Kubernetes.

Stay Data Mesh tuned !

Saturday, February 27, 2016

Postgres !

PostreSQL is an excellent Open Source database for small and medium projects. You will find a lot of amazing features like HA, statistics functions, Row Level Security, JSON support, UPSERT.

Tuesday, February 23, 2016

Tableau !

For those who still don't have an amazing great reporting tool ;-) : Tableau.

Sunday, October 11, 2015

Meta-development and [automatic] analytics !

It can sound crazy to some of you but I think it is time for our IT platforms to become smarter and to start to do some data management/data preparation/analytics by themselves or at least do more suggestion.

For most of our BI-like/analytics project we spend around 70% of the time to do data preparation. We, human, need time to understand the data, the way it is produced, how to ingest/transform it to get a structure, how to enrich it to get the semantic/standards for our business users/reporting.

What if I dedicate 20% space/CPU of my [Hadoop] platform, what if my platform knows some heuristics and makes some assumption ? What if I have an embedded-portal [ambari view] which shows metrics about data/asking question ?

Those files seem to be received as daily full dump, do you agree ?
This dataset can be map to this schema <CREATE TABLE .... schema>
This column contain 20% of NULL especially when this <columnname> = "October", do you want to create a rule ?
This file contains on average 45000 lines +/- 5% except for 3 days
<column name> can be used to join these two tables, the matching will be 74%
This column can be predicted using <analytics cube name> with a 70% accuracy, the best model is <model name>, top variable are <list of variable name>, do you think of a new variable ?

Sunday, February 8, 2015

Meta-development !

It has been more then five years I am working on BI / IT projects. There is a lot of things to cover :

Data [capture / quality / preparation / management / usage / monetization]
Security
Workload management, scheduling
Backups strategy
Optimisation, design pattern, good practices
SLAs, service & access
Analytics
Migration
Team, role, organization, responsibilities

And technology is evolving very quickly (especially Hadoop). I have seen a lot of smart people effectively working to make it done.

(Yes for now you didn't learn anything), what I am wondering is why the technology we are using is not smarter.

I think we should be more meta-development oriented, our technology should be able to understand more data and propose patterns to work with it. I don't think we gain any value by rewriting / reconnecting data and system or by indicating the way to parse a date... The same for project, some task should be automatically created. Last but not the least, it should be the same for analytic, instead of writing some great SQL-like code, I am more looking for a kind of "Find correlation", "Identify trends", "Build multi-axis reporting on this dataset".

It is quite slippy and cold in Stockholm but have a good week-end ;-) !

Tuesday, December 30, 2014

Collaborative datalake !

It is holidays now so let's relax a little and imagine some funny things !

Why not a collaborative datalake based on Hadoop and web technology which allows users to share both their dataset and the code story to create it ? I would add a vote system too !

Let's see ;-)

Saturday, December 6, 2014

Using Open Data & Machine Learning !

Before I was not convinced that Open Data brings more value to my project. Lately, just using Open Data, I am able to build an efficient model to predict dengue rate in Brezil with Least Angle Regression algorithm. To do so, we used meteo (wind, temperature, precipitation, thunder / rain rates, ...), altitude, localisation, urbanization, twitter / wikipedia frequency and custom variables (mostly lag).

Friday, October 17, 2014

Hive development !

Hive 0.14 is now supporting ACID (atomicity, consistency, isolation and durability) transaction which lead to :

UPDATE, DELETE
BEGIN, COMMIT, ROLLBACK
INSERT ... VALUES

Stinger.next will bring more SQL compliance (non-equi joins, more sub-queries, materialized views and others) and Apache Optiq is bringing cost-based optimization to improve performance.

This is really impressive !

Tuesday, August 5, 2014

Scale Open Source R with AsterR or Teradata 15 !

I recently contribute to a great project which deals with using R in a distributed way within Aster and Teradata. I rediscover that R is really permissive, flexible, powerful.

Wednesday, April 23, 2014

HBase coprocessor !

If you need to execute some custom code in your HBase cluster, you can use HBase coprocessor :

Observers : like triggers in RDBMS

RegionObserver : To pick up every DML statement : get, scan, put, delete, flush, split
WALObserver : To intercept WAL writing and reconstruction events
MasterObserver : To detect every DDL operation

EndPoints : kind of stored procedure

Wednesday, April 2, 2014

Scikit-learn !

Scikit-learn is an open-source machine-learning library written in Python. It is fast and handles memory well and thanks to Python is very flexible !

Monday, March 31, 2014

Teradata & Hadoop !

Teradata and Hadoop interacts well together especially inside UDA with InfiniBand interconnect. To know which platform to use when you should look at your needs, where is the largest volume and platform's capabilities.

If you want to transfert data, you can consider :

Between Hadoop and Teradata : Teradata Connector for Hadoop and SQL-H
Between Aster and Hadoop / Teradata : SQL-H

Friday, March 28, 2014

Machine Learning with Aster !

I am now working with Aster to do Machine Learning and statistics. Here are the functions you can use :

Approximate Distinct Count : to quickly estimates the number of distinct values
Approximate Percentile : to computes approximate percentiles
Correlation : to determine if one variable is useful for predicting an other
Generalized Linear Regression & Prediction : to perform linear regression analysis
Principal Component Analysis : for dimensionality reduction
Simple | Weighted | Exponential Moving Average : compute average with special algortihm
K-Nearest Neighbor : classification algorithm based on proximity
Support Vector Machines : build a SVM model and do prediction
Confusion Matrix [Plot] : visualize ML algorithm performance
Kmeans : famous clustering algorithm
Minhash : Another clustering technic which depends on the set of products bought by users
Naïve Bayes : useful classification method especially for documents
Random Forest Functions : predictive modelling approaches broadly used for supervised classification learning

Monday, February 10, 2014

HDP [>] 2.1 natively available applications !

Stack components :

MapReduce (API v1 & v2) : software framework for processing vast amounts of data
Tez : more powerful framework for executing DAG (directed acyclic graph) of tasks
HOYA, HBase on YARN : distributed, column oriented database
Accumulo : (Linux only) sorted, distributed key / value store
Hue : web application interface for Hadoop ecosystem (Hive, Pig, HDFS, ...)
HDFS : hadoop distributed file system
WebHDFS : interact to HDFS using HTTP (no need for library)
WebHCat : interact to HCatalog using HTTP (no need for library)
YARN : Yet Another Resource Negotiator, allows more applications to run on Hadoop
Oozie : workflow / coordination system
Mahout : Machine-Learning libraries which use MapReduce for computing
Zookeeper : centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
Flume : data ingestion and streaming tool
Sqoop : extract and push down data to databases
Pig : scripting platform for analyzing large data sets
Hive : tool to query the data using a SQL-like language
SolR : plateform for indexing and search
HCatalog : meta-data management service
Ambari : set up, monitor and configure your Hadoop cluster
Phoenix : sql layer over HBase

Components being developed / integrated :

Spark : in memory engine for large-scale data processing
Falcon : data management framework
Knox : single point of secure access for Apache Hadoop clusters (use WebHDFS)
Storm : distributed realtime computation system
Kafka : publish-subscribe messaging system
Giraph : iterative graph processing system
OpenMPI : high performance message passing library
S4 : stream computing platform
Samza : distributed stream processing framework
R : software programming language for statistical computing and graphics

What else ;-) ?

Thursday, December 5, 2013

Decision trees & R | Mahout !

Yesterday, I was asked "how can we visualise what leads to problems" ? To me, one of the best way is using decision tree with R or Mahout !

And you can do prediction & draw nicely !

Saturday, November 30, 2013

Lambda architecture

On this post I would like to present one of the possible software lambda-architecture :

Speed layer : Storm, HBase

Storm is the real-time ETL and HBase because of random, realtime read/write capability is the storage !

Batch layer : Hadoop, HBase, Hive / Pig / [your datawarehouse]

To allow recomputation, just copy your data, har / compress and plug a partitioned Hive external table. So you can create complex Hive workflow and why not push some data (statistics, machine learning) to HBase again !

Serving layer : HBase, JEE & JS web application

JEE is convenient because of HBase java API and JDBC if you need to cache some ref data. And you can use some javascrip chart library.

Stay KISS ;-)

Friday, October 18, 2013

Hadoop 2.0 !

Apache Hadoop 2.0 has just been released some days ago ! Hadoop is no longer only a MapReduce container but a multi data-framework container and provides High Availability, HDFS Federation, NFS and snapshot !

Sunday, October 13, 2013

Kafka !

Kafka is a good solution for high throughput large scale message processing applications !

Friday, September 27, 2013

Business Intelligence & Hadoop

Most of the time BI means snowflake or star schema (or hybrid or complex ER). But with Hadoop you should rather think about denormalization, a big ODS, powerful ETL, a great place for your fact data and a new way (Hive / Mahout / Pig / Cascading) to tackle your normal / semi / non-structured data using real-time (HBase, Storm, Flume) or not !

Wednesday, September 25, 2013

Flume daemons !

Source (consumes events delivered to it by an external source)
Channel (stores temporarily event's data and help to provide end-to-end reliability of the flow)
Sink (removes the event from the channel and transfer/write it)

Both of them run asynchronously with the events staged in the channel.

Labels