What do we want to achieve ?
Use DataOps to monitor information from twitter about Google.
- Without doing IaaS (Infrastructure), so using Google Cloud managed service or Serverless Technologies
- Making sure all asset are stored in a repository with dev and master branch
- No manual step to test or push content to our Google Cloud Project
- Ensure I can adapt to data structure change and so replay all data processing from scratch
- Keep all data and compress them
What do we need :
- use a repository, let's try Cloud Source Repository (private git hosted by Google Cloud)
- schedule basics tasks : Cloud Scheduler (for more complex pipeline we have another option : Cloud Composer)
- Act on commit to the [dev] branch : Cloud Build
- Send message between task : Cloud Pub/Sub
- React on event or notification : Cloud Function
- Store information between tasks : Cloud Datastore
- Read / Write data to multiple sources & targets : Dataflow
- Use the best fully Serverless Datawarehouse : BigQuery
- Monitor technical / business / FinOps related information : Data Studio, Stackdriver
Let's do it !
- Schedule a task every minute to gather tweets from twitter API then store information to GCS
- Schedule a task every day to compress all previous data in a tar.gz file
- Read compress archive and load it to BigQuery with adaptive schema capabilities
- Build the according reporting
More information and code soon !
No comments:
Post a Comment