Migrating On-Premises Hadoop Infrastructure to Google Cloud Platform
This guide provides an overview of how to move your on-premises @Apache #Hadoop system to @Google Cloud Platform ( #GCP). It describes a migration process that not only moves your Hadoop work to GCP, but also enables you to adapt your work to take advantage of the benefits of a Hadoop system optimized for cloud computing. It also introduces some fundamental concepts you need to understand in order to translate your Hadoop configuration to GCP. This is the first of three guides describing how to move from on-premises Hadoop: This guide, which provides context and planning advice for your migration. Migrating HDFS Data from On-Premises to Google Cloud Platform provides additional context for incrementally moving your data to GCP. Migrating Hadoop Jobs from On-Premises to Google Cloud Platform describes the process of running your jobs on Cloud Dataproc and other GCP products. The benefits of migrating to GCP There are many ways in which using GCP can save you time, money, and effort compared to using an on-premises Hadoop solution. In many cases, adopting a cloud-based approach can make your overall solution simpler and easy to manage. Built-in support for Hadoop GCP includes Cloud Dataproc, which is a managed Hadoop and Spark environment. You can use Cloud Dataproc to run most of your existing jobs with minimal alteration, so you don’t need to move away from all of the Hadoop tools you already know. Managed hardware and configuration When you run Hadoop on GCP, you never need to worry about physical hardware. You specify the configuration of your cluster, and Cloud Dataproc allocates resources for you. You can scale your cluster at any time. Simplified version management Keeping open source tools up to date and working together is one of the most complex parts of managing a Hadoop cluster. When you use Cloud Dataproc, much of that work is managed for you by Cloud Dataproc versioning. Flexible job configuration A typical on-premises Hadoop setup uses a single cluster that serves many purposes. When you move to GCP, you can focus on individual tasks, creating as many clusters as you need. This removes much of the complexity of maintaining a single cluster with growing dependencies and software configuration interactions. Planning your migration Migrating from an on-premises Hadoop solution to GCP requires a shift in approach. A typical on-premises Hadoop system consists of a monolithic cluster that supports many workloads, often across multiple business areas. As a result, the system becomes more complex over time and can require administrators to make compromises to get everything working in the monolithic cluster. When you move your Hadoop system to GCP, you can reduce the administrative complexity. However, to get that simplification and to get the most efficient processing in GCP with the minimal cost, you need to rethink how to structure your data and jobs. Because Cloud Dataproc runs Hadoop on GCP, using a persistent Cloud Dataproc cluster to replicate your on-premises setup might seem like the easiest solution. However, there are some limitations to that approach: Keeping your data in a persistent HDFS cluster using Cloud Dataproc is more expensive than storing your data in Cloud Storage, which is what we recommend, as explained later. Keeping data in an HDFS cluster also limits your ability to use your data with other GCP products. Using open source tools on GCP is often not as efficient or economical as using similar GCP services. Using a single, persistent Cloud Dataproc cluster for your jobs is more difficult to manage than shifting to targeted clusters that serve individual jobs or job areas. The most cost-effective and flexible way to migrate your Hadoop system to GCP is to shift away from thinking in terms of large, multi-purpose, persistent clusters and instead think about small, short-lived clusters that are designed to run specific jobs. You store your data in Cloud Storage to support multiple, temporary processing clusters. This model is often called the ephemeral model, because the clusters you use for processing jobs are allocated as needed and are released as jobs finish. The following diagram shows a hypothetical migration from an on-premises system to an ephemeral model on GCP.