Migrate Hadoop data to Amazon S3 by using WANdisco LiveData Migrator
Created by Tony Velcich
Source: On-premises Hadoop cluster | Target: Amazon S3 | R Type: Rehost |
Environment: Production | Technologies: DataLakes; Big data; Hybrid cloud; Migration | Workload: All other workloads |
AWS services: Amazon S3 |
Summary
This pattern describes the process for migrating Apache Hadoop data from a Hadoop Distributed File System (HDFS) to Amazon Simple Storage Service (Amazon S3). It uses WANdisco LiveData Migrator to automate the data migration process.
Prerequisites and limitations
Prerequisites
Hadoop cluster edge node where LiveData Migrator will be installed. The node should meet the following requirements:
Minimum specification: 4 CPUs, 16 GB RAM, 100 GB storage.
2 Gbps minimum network.
Port 8081 accessible on your edge node to access the WANdisco UI.
Java 1.8 64-bit.
Hadoop client libraries installed on the edge node.
Ability to authenticate as the HDFS superuser
(for example, "hdfs"). If Kerberos is enabled on your Hadoop cluster, a valid keytab that contains a suitable principal for the HDFS superuser must be available on the edge node.
See the release notes
for a list of supported operating systems.
An active AWS account with access to an S3 bucket.
An AWS Direct Connect link established between your on-premises Hadoop cluster (specifically the edge node) and AWS.
Product versions
LiveData Migrator 1.8.6
WANdisco UI (OneUI) 5.8.0
Architecture
Source technology stack
On-premises Hadoop cluster
Target technology stack
Amazon S3
Architecture
The following diagram shows the LiveData Migrator solution architecture.
The workflow consists of four primary components for data migration from on-premises HDFS to Amazon S3.
LiveData Migrator
– Automates the migration of data from HDFS to Amazon S3, and resides on an edge node of the Hadoop cluster. HDFS
– A distributed file system that provides high-throughput access to application data. Amazon S3
– An object storage service that offers scalability, data availability, security, and performance. AWS Direct Connect – A service that establishes a dedicated network connection from your on-premises data centers to AWS.
Automation and scale
You will typically create multiple migrations so that you can select specific content from your source file system by path or directory. You can also migrate data to multiple, independent file systems at the same time by defining multiple migration resources.
Epics
Task | Description | Skills required |
---|---|---|
Sign in to your AWS account. | Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/. | AWS experience |
Create an S3 bucket. | If you don't already have an existing S3 bucket to use as the target storage, choose the “Create bucket” option on the Amazon S3 console, and specify a bucket name, AWS Region, and bucket settings for block public access. AWS and WANdisco recommend that you enable the block public access options for the S3 bucket, and set up the bucket access and user permission policies to meet your organization's requirements. An AWS example is provided at https://docs.aws.amazon.com/AmazonS3/latest/dev/example-walkthroughs-managing-access-example1.html. | AWS experience |
Task | Description | Skills required |
---|---|---|
Download the LiveData Migrator installer. | Download the LiveData Migrator installer and upload it to the Hadoop edge node. You can download a free trial of LiveData Migrator at https://www2.wandisco.com/ldm-trial. You can also obtain access to LiveData Migrator from AWS Marketplace, at https://aws.amazon.com/marketplace/pp/B07B8SZND9. | Hadoop administrator, Application owner |
Install LiveData Migrator. | Use the downloaded installer and install LiveData Migrator as the HDFS superuser on an edge node in your Hadoop cluster. See the "Additional information" section for the installation commands. | Hadoop administrator, Application owner |
Check the status of LiveData Migrator and other services. | Check the status of LiveData Migrator, Hive migrator, and WANdisco UI by using the commands provided in the "Additional information" section. | Hadoop administrator, Application owner |
Task | Description | Skills required |
---|---|---|
Register your LiveData Migrator account. | Log in to the WANdisco UI through a web browser on port 8081 (on the Hadoop edge node) and provide your details for registration. For example, if you are running LiveData Migrator on a host named myldmhost.example.com, the URL would be: http://myldmhost.example.com:8081 | Application owner |
Configure your source HDFS storage. | Provide the configuration details needed for your source HDFS storage. This will include the "fs.defaultFS" value and a user-defined storage name. If Kerberos is enabled, provide the principal and keytab location for LiveData Migrator to use. If NameNode HA is enabled on the cluster, provide a path to the core-site.xml and hdfs-site.xml files on the edge node. | Hadoop administrator, Application owner |
Configure your target Amazon S3 storage. | Add your target storage as the S3a type. Provide the user-defined storage name and the S3 bucket name. Enter "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" for the Credentials Provider option, and provide the AWS access and secret keys for the S3 bucket. Additional S3a properties will also be needed. For details, see the "S3a Properties" section in the LiveData Migrator documentation at https://docs.wandisco.com/live-data-migrator/docs/command-reference/#filesystem-add-s3a. | AWS, Application owner |
Task | Description | Skills required |
---|---|---|
Add exclusions (if needed). | If you want to exclude specific datasets from migration, add exclusions for the source HDFS storage. These exclusions can be based on file size, file names (based on regex patterns), and modification date. | Hadoop administrator, Application owner |
Task | Description | Skills required |
---|---|---|
Create and configure the migration. | Create a migration in the dashboard of the WANdisco UI. Choose your source (HDFS) and target (the S3 bucket). Add new exclusions that you have defined in the previous step. Select either the "Overwrite" or the "Skip if Size Match" option. Create the migration when all fields are complete. | Hadoop administrator, Application owner |
Start the migration. | On the dashboard, select the migration you created. Click to start the migration. You can also start a migration automatically by choosing the auto-start option when you create the migration. | Application owner |
Task | Description | Skills required |
---|---|---|
Set a network bandwidth limit between the source and target. | In the Storages list on the dashboard, select your source storage and select "Bandwidth Management" in the Grouping list. Clear the unlimited option, and define the maximum bandwidth limit and unit. Choose "Apply." | Application owner, Networking |
Task | Description | Skills required |
---|---|---|
View migration information using the WANdisco UI. | Use the WANdisco UI to view license, bandwidth, storage and migration information. The UI also provides a notification system so you can receive notifications about errors, warnings, or important milestones in your usage. | Hadoop administrator, Application owner |
Stop, resume, and delete migrations. | You can stop a migration from transferring content to its target by placing it in the STOPPED state. Stopped migrations can be resumed. Migrations in the STOPPED state can also be deleted. | Hadoop administrator, Application owner |
Related resources
Additional information
Installing LiveData Migrator
You can use the following commands to install LiveData Migrator, assuming that the installer is inside your working directory:
su – hdfs chmod +x livedata-migrator.sh && sudo ./livedata-migrator.sh
Checking the status of LiveData Migrator and other services after installation
Use the following commands to check the status of LiveData Migrator, Hive migrator, and WANdisco UI:
service livedata-migrator status service hivemigrator status service livedata-ui status