Migrate data from an on-premises Hadoop environment to Amazon S3 using DistCp with AWS PrivateLink for Amazon S3 - AWS Prescriptive Guidance

Migrate data from an on-premises Hadoop environment to Amazon S3 using DistCp with AWS PrivateLink for Amazon S3

Created by Jason Owens (AWS), Andres Cantor (AWS), Jeff Klopfenstein (AWS), Bruno Rocha Oliveira (AWS), and Samuel Schmidt (AWS)

Environment: Production

Source: Hadoop

Target: Any

R Type: Replatform

Workload: Open-source

Technologies: Storage & backup; Analytics

AWS services: Amazon S3; Amazon EMR

This pattern demonstrates how to migrate nearly any amount of data from an on-premises Apache Hadoop environment to the Amazon Web Services (AWS) Cloud by using the Apache open-source tool DistCp with AWS PrivateLink for Amazon Simple Storage Service (Amazon S3). Instead of using the public internet or a proxy solution to migrate data, you can use AWS PrivateLink for Amazon S3 to migrate data to Amazon S3 over a private network connection between your on-premises data center and an Amazon Virtual Private Cloud (Amazon VPC). If you use DNS entries in Amazon Route 53 or add entries in the /etc/hosts file in all nodes of your on-premises Hadoop cluster, then you are automatically directed to the correct interface endpoint.

This guide provides instructions for using DistCp for migrating data to the AWS Cloud. DistCp is the most commonly used tool, but other migration tools are available. For example, you can use offline AWS tools like AWS Snowball or AWS Snowmobile, or online AWS tools like AWS Storage Gateway or AWS DataSync. Additionally, you can use other open-source tools like Apache NiFi.

Prerequisites

  • An active AWS account with a private network connection between your on-premises data center and the AWS Cloud

  • Hadoop, installed on premises with DistCp

  • A Hadoop user with access to the migration data in the Hadoop Distributed File System (HDFS)

  • AWS Command Line Interface (AWS CLI), installed and configured

  • Permissions to put objects into an S3 bucket

Limitations

Virtual private cloud (VPC) limitations apply to AWS PrivateLink for Amazon S3. For more information, see Interface endpoint properties and limitations and AWS PrivateLink quotas (AWS PrivateLink documentation).

AWS PrivateLink for Amazon S3 doesn't support the following:

Source technology stack

  • Hadoop cluster with DistCp installed

Target technology stack

  • Amazon S3

  • Amazon VPC

Target architecture

Hadoop cluster with DistCp copies data from on-premises environment through Direct Connect to S3.

The diagram shows how the Hadoop administrator uses DistCp to copy data from an on-premises environment through a private network connection, such as AWS Direct Connect, to Amazon S3 through an Amazon S3 interface endpoint.

AWS services

Other tools

  • Apache Hadoop DistCp (distributed copy) is a tool used for copying large inter-clusters and intra-clusters. DistCp uses Apache MapReduce for distribution, error handling and recovery, and reporting.

TaskDescriptionSkills required

Create an endpoint for AWS PrivateLink for Amazon S3.

  1. Sign in to the AWS Management Console and open the Amazon VPC console.

  2. On the navigation pane, choose Endpoints, and then choose Create Endpoint.

  3. For Service category, choose AWS services.

  4. In the search box, enter s3, and then press Enter.

  5. In the search results, choose the com.amazonaws.<your-aws-region>.s3 service name where the value in the Type column is Interface.

  6. For VPC, choose your VPC. For Subnets, choose your subnets.

  7. For Security group, choose or create a security group that allows TCP 443.

  8. Add tags based on your requirements and then choose Create endpoint.

AWS administrator

Verify the endpoints and find the DNS entries.

  1. Open the Amazon VPC console, choose Endpoints, and then select the endpoint that you created earlier.

  2. On the Details tab, find the first DNS entry for DNS names. This is the Regional DNS entry. When you use this DNS name, requests alternate between DNS entries specific to Availability Zones.

  3. Choose the Subnets tab. You can find the address of the endpoint’s elastic network interface in each Availability Zone.

AWS administrator

Check the firewall rules and routing configurations.

To confirm that your firewall rules are open and that your networking configuration is correctly set up, use Telnet to test the endpoint on port 443. For example:

$ telnet vpce-<your-VPC-endpoint-ID>.s3.us-east-2.vpce.amazonaws.com 443 Trying 10.104.88.6... Connected to vpce-<your-VPC-endpoint-ID>.s3.us-east-2.vpce.amazonaws.com. ... $ telnet vpce-<your-VPC-endpoint-ID>.s3.us-east-2.vpce.amazonaws.com 443 Trying 10.104.71.141... Connected to vpce-<your-VPC-endpoint-ID>.s3.us-east-2.vpce.amazonaws.com.

Note: If you use the Regional entry, a successful test shows that the DNS is alternating between the two IP addresses that you can see on the Subnets tab for your selected endpoint in the Amazon VPC console.

Network administrator, AWS administrator

Configure the name resolution.

You must configure the name resolution to allow Hadoop to access the Amazon S3 interface endpoint. You can’t use the endpoint name itself. Instead, you must resolve <your-bucket-name>.s3.<your-aws-region>.amazonaws.com or *.s3.<your-aws-region>.amazonaws.com. For more information on this naming limitation, see Introducing the Hadoop S3A client (Hadoop website).

Choose one of the following configuration options:

  • Use on-premises DNS to resolve the private IP address of the endpoint. You can override behavior for all buckets or selected buckets. For more information, see “Option 2: Access Amazon S3 using Domain Name System Response Policy Zones (DNS RPZ)” in Secure hybrid access to Amazon S3 using AWS PrivateLink (AWS blog post).

  • Configure on-premises DNS to conditionally forward traffic to the resolver inbound endpoints in the VPC. Traffic is forwarded to Route 53. For more information, see “Option 3: Forwarding DNS requests from on premises using Amazon Route 53 Resolver Inbound Endpoints” in Secure hybrid access to Amazon S3 using AWS PrivateLink (AWS blog post).

  • Edit the /etc/hosts file on all the nodes in your Hadoop cluster. This is a temporary solution for testing and isn't recommended for production. To edit the /etc/hosts file, add an entry for either <your-bucket-name>.s3.<your-aws-region>.amazonaws.com or s3.<your-aws-region>.amazonaws.com. The /etc/hosts file can’t have multiple IP addresses for an entry. You must choose a single IP address from one of the Availability Zones, which then becomes a single point of failure.

AWS administrator

Configure authentication for Amazon S3.

To authenticate to Amazon S3 through Hadoop, we recommend that you export temporary role credentials to the Hadoop environment. For more information, see Authenticating with S3 (Hadoop website). For long-running jobs, you can create a user and assign a policy that has permissions to put data into an S3 bucket only. The access key and secret key can be stored on Hadoop, accessible only to the DistCp job itself and to the Hadoop administrator. For more information on storing secrets, see Storing secrets with Hadoop Credential Providers (Hadoop website). For more information on other authentication methods, see How to get credentials of an IAM role for use with CLI access to an AWS account in the documentation for AWS IAM Identity Center (successor to AWS Single Sign-On).

To use temporary credentials, add the temporary credentials to your credentials file, or run the following commands to export the credentials to your environment:

export AWS_SESSION_TOKEN=SECRET-SESSION-TOKEN export AWS_ACCESS_KEY_ID=SESSION-ACCESS-KEY export AWS_SECRET_ACCESS_KEY=SESSION-SECRET-KEY

If you have a traditional access key and secret key combination, run the following commands:

export AWS_ACCESS_KEY_ID=my.aws.key export AWS_SECRET_ACCESS_KEY=my.secret.key

Note: If you use an access key and secret key combination, then change the credentials provider in the DistCp commands from "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider" to "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider".

AWS administrator

Transfer data by using DistCp.

To use DistCp to transfer data, run the following commands:

hadoop distcp -Dfs.s3a.aws.credentials.provider=\ "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider" \ -Dfs.s3a.access.key="${AWS_ACCESS_KEY_ID}" \ -Dfs.s3a.secret.key="${AWS_SECRET_ACCESS_KEY}" \ -Dfs.s3a.session.token="${AWS_SESSION_TOKEN}" \ -Dfs.s3a.path.style.access=true \ -Dfs.s3a.connection.ssl.enabled=true \ -Dfs.s3a.endpoint=s3.<your-aws-region>.amazonaws.com \ hdfs:///user/root/ s3a://<your-bucket-name>

Note: The AWS Region of the endpoint isn’t automatically discovered when you use the DistCp command with AWS PrivateLink for Amazon S3. Hadoop 3.3.2 and later versions resolve this issue by enabling the option to explicitly set the AWS Region of the S3 bucket. For more information, see S3A to add option fs.s3a.endpoint.region to set AWS region (Hadoop website).

For more information on additional S3A providers, see General S3A Client configuration (Hadoop website). For example, if you use encryption, you can add the following option to the series of commands above depending on your type of encryption:

-Dfs.s3a.server-side-encryption-algorithm=AES-256 [or SSE-C or SSE-KMS]

Note: To use the interface endpoint with S3A, you must create a DNS alias entry for the S3 Regional name (for example, s3.<your-aws-region>.amazonaws.com) to the interface endpoint. See the Configure authentication for Amazon S3 section for instructions. This workaround is required for Hadoop 3.3.2 and earlier versions. Future versions of S3A won’t require this workaround.

If you have signature issues with Amazon S3, add an option to use Signature Version 4 (SigV4) signing:

-Dmapreduce.map.java.opts="-Dcom.amazonaws.services.s3.enableV4=true"
Migration engineer, AWS administrator