Alison's New App is now available on iOS and Android! Download Now

Module 1: Big Data Managed Services in the Cloud

    Study Reminders
    Support

    Leverage Big Operations with Cloud Dataproc
    In this topic more about how our how cloud dataproc provides a fast, easy, cost-effective way to run Apache hadoop and Apache spark whichh are open source big data technologies that support big data operations. Hadoop and Spark or open-source technologies that ofen form the backbone of big data processing. Hadoop is a set of tools and technologies which enable a cluster of computers to store and process large volumes of data. It intelligently ties together individual computers in a cluster to distribute the storage in the processing of that data. Apache spark is the unified analytics engine for large-scale data processing and archievs both high-performance for both batch and streaming data. Cloud dataproc is a managed spark and hadoop service that lets you take advantage of the open source data tools for batch processing, querying, streaming and machine learning. Cloud dataproc automation helps you to quickly create those clusters manage them easily and because clusters are typically run a afamily meaning short-lived you’ll save money as they’re turned off when you don’t need that processing power anymore. Let’s take a look at the key features of cloud cloud dataproc, its priced at 1 cent per virtual CPU per cluster per hour on top of any other gcp resources that you use in addition cloud datapro cluster includes preemptible instances that have a lower compute prices using certain things only when you need them and that when you do. Cloud dataproc clusters are quick to start to, scale and a shutdown which each of these operations taking 90 seconds or less on average. Clusters can be created and scaled out quickly with a variety of virtual machine sizes, types, number of nodes and networking options. You can use spark and hadoop tools libraries and documentation with Cloud dataproc. Cloud dataproc provides frequent updates to native versions of spark, hadoop pig and a hive so there’s no need to learn to tools or the API It is possible to move your existing projects or your etl pipeline to Google Cloud with out redevelopment you can easily interact with clusters and spark or hadoop jobs without the assistance of administrator or special software to the gcp console the cloud SDK or the cloud dataproc rest API. When you’re done with the cluster simply turn it off so money isn’t spent on an idle cluster. Image versioning allows you to switch between different versions of Apache spark Apache hadoop and other tools the built-in integration with cloud storage and cloud big table and ensures data will never be lost even when your cluster is down. This together with stackdriver logging and stackdriver monitoring provides a complete data platform and not just a spark or Hadoop cluster for example you can use Cloud dataproc to effortlessly etl terabytes of role log data directly into bigquery for your business reporting needs. So how does cloud DATA PROC WORK speed up a cluster when needed for example the answer of a specific query or run a specific eta job. The architecture depicted here provides insight into how the cluster remains separate yet easily integrates with other important functionalities for example login via stackdriver and cloud big table instead of hbase. This contributes to the ability for Cloud dataproc to run a family and therefore efficiently and cost-effectively the approach allows users to use Hadoop, spark, hive and pig when they need it again as we mentioned it only takes 90 seconds on average from the moment user’s request of resources before they submit their first job. What makes this possible is the separation of storage and compute which is a real game-changer with the traditional approach typical on-premise cluster their stories and hard drives are touched each other nodes and cluster if it was there isn’t available due to maintenance neither is the storage. Since the storage is attached to the same computer notes as those do the processing there is often a contention for resources for example input and output bottlenecks on the cluster cloud dataproc on the other hand relies on storage resources being separated from those compute resources files are stored on Google Cloud storage or on the Google Cloud storage connector meaning that when using Google Cloud storage instead of hdfs is as easy as changing the prefix in a script from hdfs to GS or Google storage also consider cloud dataproc in terms of hadoop and spark jobs and workflows the workflow template allows users to configure and execute one or more jobs is important remember that beyond making the process easier for example by allowing the user to focus on jobs and view the logs on stackdriver they can always access the hadoop components in applications such as the yarn weight by running on a cloud dataproc cluster if it wanted to. To run a cluster when needed for a given job to answer a specific query this architecture shows what’s possible and how it can intergrate with managed services running outside the cluster For example login my driver or Cloud big table instead of traditional traditional hbase. Lets a look at a few of these use cases starting with how cloud data proc can help with log processing. In this example a customer processes 50 gigabyte of text log data per day from several sources to produce aggregated data thats then loaded into databases from which metrics are then gathered for things like daily reporting, management, dashboards and analysis. Up until now THEY HAVE used a dedicated on premises cluster to store and process their logs with mapreduce so what’s the solution. Firstly cloud storage can act as a landing zone for the log data at low cost. A cloud dataproc cluster can then be created less than two minutes to process this data with the existing mapreduce once completed it can be removed immediately it’s not needed anymore. In terms of value instead of running all the time and incurring costs when its not used cloud dataproc only runs to process those logs which saves money and reduces your overall complexity. The second use case looks at how cloud dataproc can help with ad hoc analysis. In this organisation and also align our account with using spark shell however their IT department is concerned about the increasing usage and how to scale their cluster which is running in standalone mode as a solution cloud dataproc can create clusters that scale for speed and mitigate any single point of failure. Since cloud dataproc supports spark, spark sequel and pie spark they can use the web interface cloud SDK all the native spark-shell via SSH. In terms of value it quickly unlocks the power of the Cloud for anyone without adding technical complexities running complex computations now takes seconds instead of minutes or hours on premise. The use cases in this third examples looks at how cloud dataproc can help with machine learning and in this example a customer uses machining library to run classification algorithms and very very large datasets they rely on cloud-based machines where they install and customise spark because spark in the machinery libraries can be installed at any cloud dataproc cluster the customer can save time by quickly creating cloud dataproc clusters any additional customisation can be applied easily to the entire cluster to what are called initialization actions to keep it an eye on workflows they can be used with a built-in cloud logging and monitoring solutions in terms of value resources can be focused on the data with cloud data proc not spent on things like cluster creation and management also integrations with other new GCP products can unlock new features for your spark clusters.