Amazing April Sale! 🥳 25% off all digital certs & diplomas!Ends in  : : :

Claim Your Discount!

Module 1: Big Data Managed Services in the Cloud

    Study Reminders
    Support

    Build ETL Pipelines using Cloud Dataflow
    In this topic you will know how you can use Cloud dataflow to perform, extract, transform and load operations. Cloud dataflow offers simplified streaming and batch data processing it’s a processing service based on Apache beam to develop and execute a range of data processing patterns extract transform and load bash and streaming. You use Cloud dataflow to build data pipelines monitor their execution and transform and analyse that data. Importantly the same pipelines the same code that you gonna write works both for a batch data and streaming data. You will explore pipelines more in detail shortly. Cloud data flow fully automates operational tests like resource management and performance optimisation for your pipeline all resources are provided on demand and automatically scale to meet requirements. Cloud data flow provides built-in support for fault tolerant execution at  is consistent and corrupt regardless of data size, cluster size, processing pattern or even the complexity of your pipeline. Through the integration with the GCP the consul cloud dataflow provides statistics such as pipelines throughput and lag as well the consolidated log and inspection  all in near real-time it also integrates with cloud storage, cloud pub sub cloud datastore cloud big table and bigquery for seamless data processing its the glue that can hold it all together. It can also be extended to interact with other sources and syncs like hdfs. Google provides quick-start templates like flow a rapidly deployed number of useful data pipelines without requiring any Apache beam programming experience. The templats also remove the need to develop the pipeline code and therefore the need to consider the management of component dependencies in that pipeline code you will do a lab later we’ll create a streaming pipeline using one of these Google Cloud dataflow templates. Lets look at pipelines now in more detail a pipeline represents a complete rocess on one or more data sets. The data could be brought in from external data sources that have a series of transformation operations such as filters joins, aggregations etc applied that data to give it some meaning and to achieve its desired form this data can then be written to a sink the sink could be within gcp or external the sink could even be the same as the data source the pipeline itself is what’s called that directed acyclic graph or a dag. PCollections are specialised containers and nearly unlimited size they were represent a set of data in the pipeline. specialised containers and nearly unlimited size that represent a set of data that’s in the pipeline these datasets to be bounded also referred to as fixed size such as the national census data or on down and such as a Twitter feed or data from weather sensor  coming in continuously. PCollections  are the input and the output of every single transform operation,  transforms are the data processing steps inside of your pipeline transforms take one or more of those PCollections perform an operation that you specify, on each element in that collection and produce one or more p collections as an output a transform can perform nearly any kind of processing operation including performing mathematical computations on data converting data from one format to another grouping data together reading and writing data filtering data to only the elements that you want or combining data elements into single data values. Source and sink api's provide functions to read data into and out of collections the sources act is the route of the pipeline and the sinks are the endpoints of the pipeline cloud dataflow has a set of built-in sinks and sources but it's also possible to write sources and sinks for cost and data sources too. Let’s look at different pipeline examples to get a sense of the processing and capabilities of cloud dataflow. In this multiple transform pipeline example data read from  bigquery is filtered into two collections based on initial character of the name. Note that the inputs in these examples could be from a different data souce and that this pipeline doesn’t go so far as to perfect and outline the output. In thi merged  pipeline example we are taking data that was filtered into collection in our previous multiple transform pipeline example and merging those two datasets together. This leaves us with a single data set with names that start with a and b in this multiple-input pipeline example would even joinss from different data sources the job of cloud dataflow is ingest data from one or more sources it necessary in parallel transfer the data and then all the data into warm or six Google services can be used as both a source and ascent in a simple but real example the cloud data for pipeline reads data from a big grey table the source processes it in various ways in the transforms writes its output to Google Cloud storage which is our sink some of the transforms in this example are math operations and some are reduced operations you can build really expressive pipelines each step in the pipeline is elastically scaled there’s no need to launch and manage your own cluster instead this service provides all the resources on-demand it has automated and optimising work partitioning built in which can dynamically rebalance is lagging work that reduces the need to worry about hot keys that is situations with disa proportionally large chunks of your input that get mapped to the same cluster we have discussed cloud data and cloud dataflow as managed service solutions for processing your big data. This flowchat  summarises what differentiates one from the other both cloud dataproc and cloud dataflow can perform mapreduce operations the biggest difference between them is that cloud dataproc works similarly to how hadoop would work in the physical infrastructure you would still create a cluster of servers to performing etl jobs in the case of cloud dataflow the process is serverless you provide that java or python code and leverage the Apache beam SDK to perform etl operations and bash and streaming data in a serverless fashion