Migrate on-premises Cloudera workloads to Cloudera Data Platform on AWS - AWS Prescriptive Guidance

Migrate on-premises Cloudera workloads to Cloudera Data Platform on AWS

Created by Battulga Purevragchaa (AWS), Nijjwol Lamsal (Partner), and Nidhi Gupta (AWS)

Environment: PoC or pilot

Source: Cloudera workloads

Target: Cloudera Data Platform (CDP) Public Cloud

R Type: N/A

Workload: All other workloads

Technologies: Migration; Big data; Databases; Analytics

AWS services: Amazon EC2; Amazon EKS; AWS Identity and Access Management; Amazon S3; Amazon RDS

Summary

This pattern describes the high-level steps for migrating your on-premises Cloudera Distributed Hadoop (CDH), Hortonworks Data Platform (HDP), and Cloudera Data Platform (CDP) workloads to CDP Public Cloud on AWS. We recommend that you partner with Cloudera Professional Services and a systems integrator (SI) to implement these steps.

There are many reasons Cloudera customers want to move their on-premises CDH, HDP, and CDP workloads to the cloud. Some typical reasons include:

  • Streamline adoption of new data platform paradigms such as data lakehouse or data mesh

  • Increase business agility, democratize access and inference on existing data assets

  • Lower the total cost of ownership (TCO)

  • Enhance workload elasticity

  • Enable greater scalability; drastically reduce time to provision data services compared with legacy, on-premises install base

  • Retire legacy hardware; significantly reduce hardware refresh cycles

  • Take advantage of pay-as-you-go pricing, which is extended to Cloudera workloads on AWS with the Cloudera licensing model (CCU)

  • Take advantage of faster deployment and improved integration with continuous integration and continuous delivery (CI/CD) platforms

  • Use a single unified platform (CDP) for multiple workloads

Cloudera supports all major workloads, including Machine Learning, Data Engineering, Data Warehouse, Operational Database, Stream Processing (CSP), and data security and governance. Cloudera has offered these workloads for many years on premises, and you can migrate these workloads to the AWS Cloud by using CDP Public Cloud with Workload Manager and Replication Manager. 

Cloudera Shared Data Experience (SDX) provides a shared metadata catalog across these workloads to facilitate consistent data management and operations. SDX also includes comprehensive, granular security to protect against threats, and unified governance for audit and search capabilities for compliance with standards such as Payment Card Industry Data Security Standard (PCI DSS) and GDPR. 

CDP migration at a glance

 

 

 

Workload

Source workload

CDH, HDP, and CDP Private Cloud

Source environment

  • Windows, Linux

  • On-premises, colocation, or any non-AWS environment

Destination workload

CDP Public Cloud on AWS

Destination environment

  • Deployment model: customer account

  • Operating model: customer/Cloudera control plane

 

 

Migration

Migration strategy (7Rs)

Rehost, replatform, or refactor

Is this an upgrade in the workload version?

Yes

Migration duration

  • Deployment: About 1 week to create customer account, virtual private cloud (VPC), and CDP Public Cloud customer-managed environment.

  • Migration duration: 1-4 months, depending on the complexity and size of the workload.

Cost

Cost of running the workload on AWS

  • At a high level, the cost of a CDH workload migration to AWS assumes that you will establish a new environment on AWS. It includes accounting for personnel time and effort as well as provisioning computing resources and licensing software for the new environment.

  • The Cloudera cloud consumption-based pricing model gives you the flexibility to take advantage of bursting and automatic scaling capabilities. For more information, see CDP Public Cloud service rates on the Cloudera website.

  • Cloudera Enterprise Data Hub is based on Amazon Elastic Compute Cloud (Amazon EC2) and closely models traditional clusters. Data Hub can be customized, but this will affect costs.

  • CDP Public Cloud Data Warehouse, Cloudera Machine Learning, and Cloudera Data Engineering (CDE) are container-based and can be configured to scale automatically.

 

 

Infrastructure agreements and framework

System requirements

See the Prerequisites section.

SLA

See Cloudera Service Level Agreement for CDP Public Cloud.

DR

See Disaster Recovery in the Cloudera documentation.

Licensing and operating model (for target AWS account)

Bring Your Own License (BYOL) model

 

Compliance

Security requirements

See Cloudera Security Overview in the Cloudera documentation.

Other compliance certifications

See the information on the Cloudera website about General Data Protection Regulation (GDPR) compliance and the CDP Trust Center.

Prerequisites and limitations

Prerequisites

The migration requires the following roles and expertise:

Role

Skills and responsibilities

Migration lead

Ensures executive support, team collaboration, planning, implementation, and assessment

Cloudera SME

Expert skills in CDH, HDP, and CDP administration, system administration, and architecture

AWS architect

Skills in AWS services, networking, security, and architectures

Architecture

Building to the appropriate architecture is a critical step to ensure that migration and performance meet your expectations. For your migration effort to meet this playbook’s assumptions, your target data environment in the AWS Cloud, either on virtual private cloud (VPC) hosted instances or CDP, must be an equivalent match to your source environment in terms of operating system and software versions as well as major machine specifications.

The following diagram (reproduced with permission from the Cloudera Shared Data Experience data sheet) shows the infrastructure components for the CDP environment and how the tiers or infrastructure components interact.

CDP environment components

The architecture includes the following CDP components:

  • Data Hub is a service for launching and managing workload clusters powered by Cloudera Runtime. You can use the cluster definitions in Data Hub to provision and access workload clusters for custom use cases and define custom cluster configurations. For more information, see the Cloudera website.

  • Data Flow and Streaming addresses the key challenges enterprises face with data in motion. It manages the following:

    • Processing real-time data streaming at high volume and high scale

    • Tracking data provenance and lineage of streaming data

    • Managing and monitoring edge applications and streaming sources

    For more information, see Cloudera DataFlow and CSP on the Cloudera website.

  • Data Engineering includes data integration, data quality, and data governance, which help organizations build and maintain data pipelines and workflows. For more information, see the Cloudera website. Learn about support for spot instances to facilitate cost savings on AWS for Cloudera Data Engineering workloads.

  • Data Warehouse enables you to create independent data warehouses and data marts that automatically scale to meet workload demands. This service provides isolated compute instances and automated optimization for each data warehouse and data mart, and helps you save costs while meeting SLAs. For more information, see the Cloudera website. Learn about managing costs and auto-scaling for Cloudera Data Warehouse on AWS.

  • Operational Database in CDP provides a reliable and flexible foundation for scalable, high-performance applications. It delivers a real-time, always available, scalable database that serves traditional structured data alongside new, unstructured data within a unified operational and warehousing platform. For more information, see the Cloudera website.

  • Machine Learning is a cloud-native machine learning platform that merges self-service data science and data engineering capabilities into a single, portable service within an enterprise data cloud. It enables scalable deployment of machine learning and artificial intelligence (AI) on data anywhere. For more information, see the Cloudera website.

CDP on AWS

The following diagram (adapted with permission from the Cloudera website) shows the high-level architecture of CDP on AWS. CDP implements its own security model to manage both accounts and data flow. These are integrated with IAM through the use of cross-account roles

CDP on AWS high-level architecture

The CDP control plane resides in a Cloudera master account in its own VPC. Each customer account has its own sub-account and unique VPC. Cross-account IAM roles and SSL technologies route management traffic to and from the control plane to customer services that reside on internet-routable public subnets within each customer VPC. On the customer’s VPC, the Cloudera Shared Data Experience (SDX) provides enterprise-strength security with unified governance and compliance so you can get insights from your data faster. SDX is a design philosophy incorporated into all Cloudera products. For more information about SDX and the CDP Public Cloud network architecture for AWS, see the Cloudera documentation.

Tools

AWS services

Automation and tooling

Epics

TaskDescriptionSkills required

Engage the Cloudera team.

Cloudera pursues a standardized engagement model with its customers and can work with your systems integrator (SI) to promote the same approach. Contact the Cloudera customer team so they can provide guidance and the necessary technical resources to get the project started. Contacting the Cloudera team ensures that all necessary teams can prepare for the migration as its date approaches. 

You can contact Cloudera Professional Services to move your Cloudera deployment from pilot to production quickly, at lower cost, and with peak performance. For a complete list of offerings, see the Cloudera website.

Migration lead

Create a CDP Public Cloud environment on AWS for your VPC.

Work with Cloudera Professional Services or your SI to plan and deploy CDP Public Cloud into a VPC on AWS.

Cloud architect, Cloudera SME

Prioritize and assess workloads for migration.

Evaluate all your on-premises workloads to determine the workloads that are the easiest to migrate. Applications that aren’t mission-critical are the best to move first, because they will have minimal impact on your customers. Save the mission-critical workloads for last, after you successfully migrate other workloads.

Note: Transient (CDP Data Engineering) workloads are easier to migrate than persistent (CDP Data Warehouse) workloads. It’s also important to consider data volume and locations when migrating. Challenges can include replicating data continuously from an on-premises environment to the cloud, and changing the data ingestion pipelines to import data directly to the cloud.

Migration lead

Discuss CDH, HDP, CDP, and legacy application migration activities.

Consider and start planning for the following activities with Cloudera Workload Manager:

  • Data and workloads to copy to your AWS environment

  • Cloud-ready data

  • Noisy neighbors, which use up resources and create issues for other tenants

  • Elastic workloads

  • Small clusters with high operational overhead

Migration lead

Complete the Cloudera Replication Manager requirements and recommendations.

Work with Cloudera Professional Services  and your SI to prepare to migrate workloads to your CDP Public Cloud environment on AWS. Understanding the following requirements and recommendations can help you avoid common issues during and after you install the Replication Manager service.

  • Review Replication Manager supporting documents to confirm that you meet the environment and system requirements. For more information, see Support matrix for CDP Public Cloud Replication Manager on the Cloudera website.

  • You don’t need root access to the nodes on which the Replication Manager App and Data Lifecycle Manager (DLM) engine will be installed.

  • Install Apache Hive during the initial installation of Replication Manager, unless you are certain that you won’t use Hive replication in the future. If you decide to install Hive after creating HDFS replication policies in Replication Manager, you have to delete and then recreate all HDFS replication policies after you add Hive.

  • Clusters used in Replication Manager must have symmetrical configurations. Each cluster in a replication relationship must be configured exactly the same for security (Kerberos), user management (LDAP/AD), and Knox Proxy. Cluster services such as Hadoop Distributed File System (HDFS), Apache Hive, Apache Knox, Apache Ranger, and Apache Atlas can have different configurations for high availability (HA). For example, source and target clusters might have separate HA and non-HA configurations.

Migration lead
TaskDescriptionSkills required

Migrate the first workload for dev/test environments by using Cloudera Workload Manager.

Your SI can help you migrate your first workload to the AWS Cloud. This should be an application that isn’t customer-facing or mission-critical. Ideal candidates for dev/test migration are applications that have data that the cloud can easily ingest, such as CDP Data Engineering workloads. This is a transient workload that usually has fewer users accessing it, compared with a persistent workload such as a CDP Data Warehouse workload that could have many users who need uninterrupted access. Data Engineering workloads aren’t persistent, which minimizes the business impact if something goes wrong. However, these jobs could be critical for production reporting, so prioritize low-impact Data Engineering workloads first.

Migration lead

Repeat migration steps as necessary.

Cloudera Workload Manager helps identify workloads that are best suited for the cloud. It provides metrics such as cloud performance ratings, sizing/capacity plans for the target environment, and replication plans. The best candidates for migration are seasonal workloads, ad hoc reporting, and intermittent jobs that don’t consume many resources.

Cloudera Replication Manager moves data from on premises to the cloud, and from the cloud to on premises.

Proactively optimize workloads, applications, performance, and infrastructure capacity for data warehousing, data engineering, and machine learning by using Workload Manager. For a complete guide on how to modernize a data warehouse, see the Cloudera website.

Cloudera SME

Related resources

Cloudera documentation:

AWS documentation: