Showing posts with label Business Intelligence. Show all posts
Showing posts with label Business Intelligence. Show all posts

Monday, November 16, 2020

Data Mesh on Google Cloud Platform (and this is excellent !)

Hello from Home 🏡,

Quick update as I am leading the creation of a new Deck explaining Data Mesh Architecture and why this is an amazing opportunity to adopt Google Cloud Platform for this approach :
  • GCP Project per Data Domain / Team (<-> direct Data Mesh mapping 👌)
  • Serverless
  • Pay as you go
  • Google APIs
  • Looker (esp. Looker API and the semantic layer)
  • Scalability
  • BigQuery / Cloud Storage / Cloud Pub/Sub
  • Ephemeral Hadoop cluster (Dataproc)
  • IAM
  • Cloud Source Repositories / Cloud Build
This [new] architecture is not a huge revolution (and this is great), it comes from 40+ years of data platform innovation and it follows the same approach as Microservice / Kubernetes.



Stay Data Mesh tuned !

Saturday, February 27, 2016

Postgres !

PostreSQL is an excellent Open Source database for small and medium projects. You will find a lot of amazing features like HA, statistics functions,  Row Level Security, JSON support, UPSERT.

Tuesday, February 23, 2016

Tableau !

For those who still don't have an amazing great reporting tool ;-) : Tableau.

Sunday, October 11, 2015

Meta-development and [automatic] analytics !

It can sound crazy to some of you but I think it is time for our IT platforms to become smarter and  to start to do some data management/data preparation/analytics by themselves or at least do more suggestion.

For most of our BI-like/analytics project we spend around 70% of the time to do data preparation. We, human, need time to understand the data, the way it is produced, how to ingest/transform it to get a structure, how to enrich it to get the semantic/standards for our business users/reporting. 

What if I dedicate 20% space/CPU of my [Hadoop] platform, what if my platform knows some heuristics and makes some assumption ? What if I have an embedded-portal [ambari view] which shows metrics about data/asking question ?

  • Those files seem to be received as daily full dump, do you agree ?
  • This dataset can be map to this schema <CREATE TABLE .... schema>
  • This column contain 20% of NULL especially when this <columnname> = "October", do you want to create a rule ?
  • This file contains on average 45000 lines +/- 5% except for 3 days
  • <column name> can be used to join these two tables, the matching will be 74% 
  • This column can be predicted using <analytics cube name> with a 70% accuracy, the best model is <model name>, top variable are <list of variable name>, do you think of a new variable ?

Sunday, February 8, 2015

Meta-development !

It has been more then five years I am working on BI / IT projects. There is a lot of things to cover :
  • Data [capture / quality / preparation / management / usage / monetization]
  • Security
  • Workload management, scheduling
  • Backups strategy
  • Optimisation, design pattern, good practices 
  • SLAs, service & access
  • Analytics
  • Migration
  • Team, role, organization, responsibilities
And technology is evolving very quickly (especially Hadoop). I have seen a lot of smart people effectively working to make it done.

(Yes for now you didn't learn anything), what I am wondering is why the technology we are using is not smarter.

I think we should be more meta-development oriented, our technology should be able to understand more data and propose patterns to work with it. I don't think we gain any value by rewriting / reconnecting data and system or by indicating the way to parse a date... The same for project, some task should be automatically created. Last but not the least, it should be the same for analytic, instead of writing some great SQL-like code, I am more looking for a kind of "Find correlation", "Identify trends", "Build multi-axis reporting on this dataset".

It is quite slippy and cold in Stockholm but have a good week-end ;-) !

Tuesday, December 30, 2014

Collaborative datalake !

It is holidays now so let's relax a little and imagine some funny things ! 

Why not a collaborative datalake based on Hadoop and web technology which allows users to share both their dataset and the code story to create it ? I would add a vote system too !

Let's see ;-)

Saturday, December 6, 2014

Using Open Data & Machine Learning !

Before I was not convinced that Open Data brings more value to my project. Lately, just using Open Data, I am able to build an efficient model to predict dengue rate in Brezil with Least Angle Regression algorithm. To do so, we used meteo (wind, temperature, precipitation, thunder / rain rates, ...), altitude, localisation, urbanization, twitter / wikipedia frequency and custom variables (mostly lag).

Friday, October 17, 2014

Hive development !

Hive 0.14 is now supporting ACID (atomicity, consistency, isolation and durability) transaction which lead to :
  • UPDATE, DELETE
  • BEGIN, COMMIT, ROLLBACK
  • INSERT ... VALUES
Stinger.next will bring more SQL compliance (non-equi joins, more sub-queries, materialized views and others) and Apache Optiq is bringing cost-based optimization to improve performance.

This is really impressive !

Tuesday, August 5, 2014

Scale Open Source R with AsterR or Teradata 15 !

I recently contribute to a great project which deals with using R in a distributed way within Aster and Teradata. I rediscover that R is really permissive, flexible, powerful.

Wednesday, April 23, 2014

HBase coprocessor !

If you need to execute some custom code in your HBase cluster, you can use HBase coprocessor :
  • Observers : like triggers in RDBMS
    • RegionObserver : To pick up every DML statement : get, scan, put, delete, flush, split
    • WALObserver : To intercept WAL writing and reconstruction events
    • MasterObserver : To detect every DDL operation
  • EndPoints : kind of stored procedure

Wednesday, April 2, 2014

Scikit-learn !

Scikit-learn is an open-source machine-learning library written in Python. It is fast and handles memory well and thanks to Python is very flexible !

Monday, March 31, 2014

Teradata & Hadoop !

Teradata and Hadoop interacts well together especially inside UDA with InfiniBand interconnect. To know which platform to use when you should look at your needs, where is the largest volume and platform's capabilities.

If you want to transfert data, you can consider :

Friday, March 28, 2014

Machine Learning with Aster !

I am now working with Aster to do Machine Learning and statistics. Here are the functions you can use :
  • Approximate Distinct Count : to quickly estimates the number of distinct values
  • Approximate Percentile :  to computes approximate percentiles
  • Correlation : to determine if one variable is useful for predicting an other
  • Generalized Linear Regression & Prediction : to perform linear regression analysis
  • Principal Component Analysis : for dimensionality reduction 
  • Simple | Weighted | Exponential Moving Average : compute average with special algortihm
  • K-Nearest Neighbor : classification algorithm based on proximity
  • Support Vector Machines : build a SVM model and do prediction 
  • Confusion Matrix [Plot] : visualize ML algorithm performance
  • Kmeans : famous clustering algorithm
  • Minhash : Another clustering technic which depends on the set of products bought by users
  • Naïve Bayes : useful classification method especially for documents
  • Random Forest Functions : predictive modelling approaches broadly used for supervised classification learning

Monday, February 10, 2014

HDP [>] 2.1 natively available applications !

Stack components :
  • MapReduce (API v1 & v2) : software framework for processing vast amounts of data
  • Tez : more powerful framework for executing DAG (directed acyclic graph) of tasks
  • HOYA, HBase on YARN : distributed, column oriented database
  • Accumulo : (Linux only) sorted, distributed key / value store
  • Hue : web application interface for Hadoop ecosystem (Hive, Pig, HDFS, ...)
  • HDFS : hadoop distributed file system
  • WebHDFS : interact to HDFS using HTTP (no need for library)
  • WebHCat : interact to HCatalog using HTTP (no need for library)
  • YARN : Yet Another Resource Negotiator, allows more applications to run on Hadoop
  • Oozie : workflow / coordination system
  • Mahout : Machine-Learning libraries which use MapReduce for computing
  • Zookeeper : centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
  • Flume : data ingestion and streaming tool
  • Sqoop : extract and push down data to databases
  • Pig : scripting platform for analyzing large data sets
  • Hive : tool to query the data using a SQL-like language
  • SolR : plateform for indexing and search
  • HCatalog : meta-data management service
  • Ambari : set up, monitor and configure your Hadoop cluster
  • Phoenix : sql layer over HBase
Components being developed / integrated :
  • Spark : in memory engine for large-scale data processing
  • Falcon : data management framework
  • Knox : single point of secure access for Apache Hadoop clusters (use WebHDFS)
  • Storm : distributed realtime computation system
  • Kafka : publish-subscribe messaging system
  • Giraph : iterative graph processing system
  • OpenMPI : high performance message passing library
  • S4 : stream computing platform
  • Samza : distributed stream processing framework
  • R : software programming language for statistical computing and graphics
What else ;-) ?

Thursday, December 5, 2013

Decision trees & R | Mahout !

Yesterday, I was asked "how can we visualise what leads to problems" ? To me, one of the best way is using decision tree with R or Mahout !

And you can do prediction & draw nicely !

Saturday, November 30, 2013

Lambda architecture

On this post I would like to present one of the possible software lambda-architecture :

Speed layer :  Storm, HBase

Storm is the real-time ETL and HBase because of random, realtime read/write capability is the storage !

Batch layer : Hadoop, HBase, Hive / Pig / [your datawarehouse]

To allow recomputation, just copy your data, har / compress and plug a partitioned Hive external table. So you can create complex Hive workflow and why not push some data (statistics, machine learning) to HBase again !

Serving layer : HBase, JEE & JS web application

JEE is convenient because of HBase java API and JDBC if you need to cache some ref data. And you can use some javascrip chart library.

Stay KISS ;-)

Friday, October 18, 2013

Hadoop 2.0 !

Apache Hadoop 2.0 has just been released some days ago ! Hadoop is no longer only a MapReduce container but a multi data-framework container and provides High Availability, HDFS Federation, NFS and snapshot !

Sunday, October 13, 2013

Kafka !

Kafka is a good solution for high throughput large scale message processing applications !

Friday, September 27, 2013

Business Intelligence & Hadoop

Most of the time BI means snowflake or star schema (or hybrid or complex ER). But with Hadoop you should rather think about denormalization, a big ODS, powerful ETL, a great place for your fact data and a new way (Hive / Mahout / Pig / Cascading) to tackle your normal / semi / non-structured data using real-time (HBase, Storm, Flume) or not !

Wednesday, September 25, 2013

Flume daemons !

  • Source (consumes events delivered to it by an external source)
  • Channel (stores temporarily event's data and help to provide end-to-end reliability of the flow)
  • Sink (removes the event from the channel and transfer/write it)
Both of them run asynchronously with the events staged in the channel.