Big Data Managed Services | ETL Pipelines on Cloud Dataflow

Build ETL Pipelines using Cloud Dataflow
In this topic you will know how you can use Cloud dataflow to perform, extract, transform and load operations. Cloud dataflow offers simplified streaming and batch data processing it’s a processing service based on Apache beam to develop and execute a range of data processing patterns extract transform and load bash and streaming. You use Cloud dataflow to build data pipelines monitor their execution and transform and analyse that data. Importantly the same pipelines the same code that you gonna write works both for a batch data and streaming data. You will explore pipelines more in detail shortly. Cloud data flow fully automates operational tests like resource management and performance optimisation for your pipeline all resources are provided on demand and automatically scale to meet requirements. Cloud data flow provides built-in support for fault tolerant execution at is consistent and corrupt regardless of data size, cluster size, processing pattern or even the complexity of your pipeline. Through the integration with the GCP the consul cloud dataflow provides statistics such as pipelines throughput and lag as well the consolidated log and inspection all in near real-time it also integrates with cloud storage, cloud pub sub cloud datastore cloud big table and bigquery for seamless data processing its the glue that can hold it all together. It can also be extended to interact with other sources and syncs like hdfs. Google provides quick-start templates like flow a rapidly deployed number of useful data pipelines without requiring any Apache beam programming experience. The templats also remove the need to develop the pipeline code and therefore the need to consider the management of component dependencies in that pipeline code you will do a lab later we’ll create a streaming pipeline using one of these Google Cloud dataflow templates. Lets look at pipelines now in more detail a pipeline represents a complete rocess on one or more data sets. The data could be brought in from external data sources that have a series of transformation operations such as filters joins, aggregations etc applied that data to give it some meaning and to achieve its desired form this data can then be written to a sink the sink could be within gcp or external the sink could even be the same as the data source the pipeline itself is what’s called that directed acyclic graph or a dag. PCollections are specialised containers and nearly unlimited size they were represent a set of data in the pipeline. specialised containers and nearly unlimited size that represent a set of data that’s in the pipeline these datasets to be bounded also referred to as fixed size such as the national census data or on down and such as a Twitter feed or data from weather sensor coming in continuously. PCollections are the input and the output of every single transform operation, transforms are the data processing steps inside of your pipeline transforms take one or more of those PCollections perform an operation that you specify, on each element in that collection and produce one or more p collections as an output a transform can perform nearly any kind of processing operation including performing mathematical computations on data converting data from one format to another grouping data together reading and writing data filtering data to only the elements that you want or combining data elements into single data values. Source and sink api's provide functions to read data into and out of collections the sources act is the route of the pipeline and the sinks are the endpoints of the pipeline cloud dataflow has a set of built-in sinks and sources but it's also possible to write sources and sinks for cost and data sources too. Let’s look at different pipeline examples to get a sense of the processing and capabilities of cloud dataflow. In this multiple transform pipeline example data read from bigquery is filtered into two collections based on initial character of the name. Note that the inputs in these examples could be from a different data souce and that this pipeline doesn’t go so far as to perfect and outline the output. In thi merged pipeline example we are taking data that was filtered into collection in our previous multiple transform pipeline example and merging those two datasets together. This leaves us with a single data set with names that start with a and b in this multiple-input pipeline example would even joinss from different data sources the job of cloud dataflow is ingest data from one or more sources it necessary in parallel transfer the data and then all the data into warm or six Google services can be used as both a source and ascent in a simple but real example the cloud data for pipeline reads data from a big grey table the source processes it in various ways in the transforms writes its output to Google Cloud storage which is our sink some of the transforms in this example are math operations and some are reduced operations you can build really expressive pipelines each step in the pipeline is elastically scaled there’s no need to launch and manage your own cluster instead this service provides all the resources on-demand it has automated and optimising work partitioning built in which can dynamically rebalance is lagging work that reduces the need to worry about hot keys that is situations with disa proportionally large chunks of your input that get mapped to the same cluster we have discussed cloud data and cloud dataflow as managed service solutions for processing your big data. This flowchat summarises what differentiates one from the other both cloud dataproc and cloud dataflow can perform mapreduce operations the biggest difference between them is that cloud dataproc works similarly to how hadoop would work in the physical infrastructure you would still create a cluster of servers to performing etl jobs in the case of cloud dataflow the process is serverless you provide that java or python code and leverage the Apache beam SDK to perform etl operations and bash and streaming data in a serverless fashion

Module 1: Big Data Managed Services in the Cloud

Module 1: What's the Cloud Anyway?

What's the Cloud Anyway? - Learning Outcomes

What’s the Cloud Anyway - Introduction

Cloud Computing

Cloud vs Traditional Architecture

Iaas, PaaS and SaaS

Google Cloud Architecture

What's the Cloud Anywhere - Quiz

What's the Cloud Anyway? - Lesson Summary

Module 2: Start with a Solid Platform

Start with a Solid Platform - Learning Outcomes

Start with a Solid Platform - Introduction

The GCP Console

Understanding Projects

Billing in GCP

Install and Configure Cloud SDK

Use Cloud Shell

GCP APIs

Cloud Console Mobile App

Start with a Solid Platform - Quiz

Start with a Solid Platform - Lesson Summary

Module 3: Use GCP To Build Your Apps

Use GCP to Build your Apps - Learning Outcomes

Use GCP to Build your Apps - Introduction

Compute Options in the Cloud

Exploring IaaS with Compute Engine

Configuring Elastic Apps with Autoscaling

Exploring PaaS with App Engine

Event Driven Programs with Cloud Functions

Containerizing and Orchestrating Apps with GKE

Use GCP to Build your Apps - Lesson Summary

Module 4: Where Do I Store This Stuff?

Where Do I Store this Stuff? - Learning Outcomes

Where Do I Store this Stuff? - Introduction

Storage Options in the Cloud

Structured and Unstructured Storage in the Cloud

Unstructured Storage Using Cloud Storage

SQL Managed Services

Exploring Cloud SQL

Cloud Spanner as a Managed Service

NoSQL Managed Services Options

Cloud Datastore a NoSQL Document Store

Cloud Bigtable as a NoSQL Option

Where Do I Store this Stuff? – Lesson summary

Module 5: There's an API for that

There’s an API for that - Learning Outcomes

There’s an API for that - Introduction

The Purpose of APIs

Cloud Endpoints

Using Apigee Edge

Managed Message Services

Cloud Pub_Sub

There's an API for that – Lesson summary

Module 6: You Can't Secure the Cloud Right?

You Can't Secure the Cloud Right? - Learning Outcomes

You Can't Secure the Cloud Right? - Introduction

Introduction to Security in the Cloud

Understanding the Shared Security Model

Explore Encryption Options

Understand Authentication and Authorization

Identify Best Practices for Authorization

You Can't Secure the Cloud Right? - Quiz

You Can't Secure the Cloud Right? – Lesson summary

Module 7: It Helps To Network

It Helps to Network - Learning Outcomes

It Helps to Network - Introduction

Introduction to Networking in the Cloud

Defining a Virtual Private Cloud

Public and Private IP Address Basics

Google’s Network Architecture

Routes and Firewall Rules in the Cloud

Multiple VPC Networks

Building Hybrid Clouds

Different Options for Load Balancing

It Helps to Network - Quiz

It Helps to Network – Lesson Summary

Module 8: Let Google Keep an Eye on Things

Let Google Keep an Eye on Things - Learning Outcomes

Let Google Keep an Eye on Things - Introduction

Education should be...
free and accessible.