Connecting to Apache Hadoop databases with the AWS Schema Conversion Tool
You can use the AWS SCT command line interface (CLI) to migrate from Apache Hadoop to Amazon EMR. AWS SCT uses your Amazon S3 bucket as a temporary storage for your data during migration.
AWS SCT supports as a source Apache Hadoop version 2.2.0 and higher. Also, AWS SCT supports Apache Hive version 0.13.0 and higher.
AWS SCT supports as a target Amazon EMR version 6.3.0 and higher. Also, AWS SCT supports as a target Apache Hadoop version 2.6.0 and higher, and Apache Hive version 0.13.0 and higher.
Topics
Prerequisites for using Apache Hadoop as a source
The following prerequisites are required to connect to Apache Hadoop with the AWS SCT CLI.
-
Create an Amazon S3 bucket to store data during the migration. You can then copy data to Amazon EMR HDFS or use Amazon S3 as a data repository for your Hadoop workloads. For more information, see Creating a bucket in the Amazon S3 User Guide.
-
Create an AWS Identity and Access Management (IAM) role with the
AmazonS3FullAccess
policy. AWS SCT uses this IAM role to access your Amazon S3 bucket. -
Take a note of your AWS secret key and AWS secret access key. For more information about AWS access keys, see Managing access keys in the IAM User Guide.
-
Create and configure a target Amazon EMR cluster. For more information, see Getting started with Amazon EMR in the Amazon EMR Management Guide.
-
Install the
distcp
utility on your source Apache Hadoop cluster. Also, install thes3-dist-cp
utility on your target Amazon EMR cluster. Make sure that your database users have permissions to run these utilities. -
Configure the
core-site.xml
file in your source Hadoop cluster to use the s3a protocol. To do so, set thefs.s3a.aws.credentials.provider
parameter to one of the following values.org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
You can add the following code example into the
core-site.xml
file.<property> <name>fs.s3a.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value> </property>
The preceding example shows one of the four options from the preceding list of options. If you don't set the
fs.s3a.aws.credentials.provider
parameter in thecore-site.xml
file, AWS SCT chooses the provider automatically.
Permissions for using Hive as a source
The permissions required for a Hive source user are as follows:
READ
access to the source data folders and to the source Amazon S3 bucketREAD+WRITE
access to the intermediate and target Amazon S3 buckets
To increase the migration speed, we recommend that you run compaction for ACID-transactional source tabes.
The permissions required for an Amazon EMR Hive target user are as follows:
READ
access to the target Amazon S3 bucketREAD+WRITE
access to the intermediate Amazon S3 bucketREAD+WRITE
access to the target HDFS folders
Permissions for using HDFS as a source
The permissions required for HDFS as a source are as follows:
EXECUTE
for the NameNodeEXECUTE+READ
for all source folders and files that you include in the migration projectREAD+WRITE
for thetmp
directory in the NameNode to run Spark jobs and store files before the migration to Amazon S3
In HDFS, all operations require traversal access. Traversal access demands the EXECUTE
permission on all existing components of the path, except for the final path component. For example,
for any operation accessing /foo/bar/baz
, your user must have EXECUTE
permission
on /
, /foo
, and /foo/bar
.
The following code example demonstrates how to grant EXECUTE+READ
permissions for
your source folders and files, and READ+WRITE
permissions for the tmp
directory.
hadoop fs –chmod –R 744 /user/hdfs-data hadoop fs –chmod –R 766 /tmp
Permissions for using HDFS as a target
The permissions required for Amazon EMR HDFS as a target are as follows:
EXECUTE
for the NameNode of the target Amazon EMR clusterREAD+WRITE
for the target HDFS folders where you will store data after migration
Connecting to Apache Hadoop as a source
You can use Apache Hadoop as a source in AWS SCT version 1.0.670 or higher. You can migrate Hadoop clusters to Amazon EMR only in the AWS SCT command line interface (CLI). Before you start, familiarize yourself with the command line interface of AWS SCT. For more information, see CLI Reference for AWS Schema Conversion Tool.
To connect to Apache Hadoop in the AWS SCT CLI
-
Create a new AWS SCT CLI script or edit an existing scenario template. For example, you can download and edit the
HadoopMigrationTemplate.scts
template. For more information, see Getting CLI scenarios. -
Configure the AWS SCT application settings such as the driver location and log folder.
Download the required JDBC driver and specify the location where you store the file. For more information, see Installing JDBC drivers for AWS Schema Conversion Tool.
The following code example shows you how to add the path to the Apache Hive driver. After you run this code example, AWS SCT stores log files in the
c:\sct
folder.SetGlobalSettings -save: 'true' -settings: '{ "hive_driver_file": "c:\\sct\\HiveJDBC42.jar", "log_folder": "c:\\sct", "console_log_folder": "c:\\sct" }' /
You can use this example and the following examples in Windows.
-
Create a new AWS SCT project.
The following code example creates the
hadoop_emr
project in thec:\sct
folder.CreateProject -name: 'hadoop_emr' -directory: 'c:\sct' /
-
Add your source Hadoop cluster to the project.
Use the
AddSourceCluster
command to connect to the source Hadoop cluster. Make sure that you provide values for the following mandatory parameters:name
,host
,port
, anduser
. Other parameters are optional.The following code example adds the source Hadoop cluster. This example sets
HADOOP_SOURCE
as a name of the source cluster. Use this object name to add Hive and HDFS services to the project and create mapping rules.AddSourceCluster -name: 'HADOOP_SOURCE' -vendor: 'HADOOP' -host: '
hadoop_address
' -port: '22' -user: 'hadoop_user
' -password: 'hadoop_password
' -useSSL: 'true' -privateKeyPath: 'c:\path\name
.pem' -passPhrase: 'hadoop_passphrase
' /In the preceding example, replace
hadoop_address
with the IP address of your Hadoop cluster. If needed, configure the value of the port option. Next, replacehadoop_user
andhadoop_password
with the name of your Hadoop user and the password for this user. Forpath\name
, enter the name and path to the PEM file for your source Hadoop cluster. -
Save your CLI script. Next, add the connection information for your Hive and HDFS services.
Connecting to your source Hive and HDFS services
You can connect to your source Hive and HDFS services with the AWS SCT CLI. To connect to Apache Hive, use the Hive JDBC driver version 2.3.4 or higher. For more information, see Installing JDBC drivers for AWS Schema Conversion Tool.
AWS SCT connects to Apache Hive with the hadoop
cluster user. To do so, use the
AddSourceClusterHive
and AddSourceClusterHDFS
commands. You can use one
of the following approaches.
-
Create a new SSH tunnel.
For
createTunnel
, entertrue
. Forhost
, enter the internal IP address of your source Hive or HDFS service. Forport
, enter the service port of your Hive or HDFS service.Next, enter your Hive or HDFS credentials for
user
andpassword
. For more information about SSH tunnels, see Set up an SSH tunnel to the primary node using local port forwarding in the Amazon EMR Management Guide. -
Use an existing SSH tunnel.
For
host
, enterlocalhost
. Forport
, enter the local port from the SSH tunnel parameters. -
Connect to your Hive and HDFS services directly.
For
host
, enter the IP address or hostname of your source Hive or HDFS service. Forport
, enter the service port of your Hive or HDFS service. Next, enter your Hive or HDFS credentials foruser
andpassword
.
To connect to Hive and HDFS in the AWS SCT CLI
-
Open your CLI script which includes the connection information for your source Hadoop cluster. Make sure that you use the name of the Hadoop cluster that you defined in the previous step.
-
Add your source Hive service to the project.
Use the
AddSourceClusterHive
command to connect the source Hive service. Make sure that you provide values for the following mandatory parameters:user
,password
,cluster
,name
, andport
. Other parameters are optional.The following code example creates a tunnel for AWS SCT to work with your Hive service. This source Hive service runs on the same PC as AWS SCT. This example uses the
HADOOP_SOURCE
source cluster from the previous example.AddSourceClusterHive -cluster: 'HADOOP_SOURCE' -name: 'HIVE_SOURCE' -host: 'localhost' -port: '10005' -user: '
hive_user
' -password: 'hive_password
' -createTunnel: 'true' -localPort: '10005' -remoteHost: 'hive_remote_address
' -remotePort: 'hive_port
' /The following code example connects to your Hive service without a tunnel.
AddSourceClusterHive -cluster: 'HADOOP_SOURCE' -name: 'HIVE_SOURCE' -host: '
hive_address
' -port: 'hive_port
' -user: 'hive_user
' -password: 'hive_password
' /In the preceding examples, replace
hive_user
andhive_password
with the name of your Hive user and the password for this user.Next, replace
hive_address
andhive_port
with the NameNode IP address and port of your source Hadoop cluster.For
hive_remote_address
, you might use the default value127.0.0.1
or the NameNode IP address of your source Hive service. -
Add your source HDFS service to the project.
Use the
AddSourceClusterHDFS
command to connect the source HDFS service. Make sure that you provide values for the following mandatory parameters:user
,password
,cluster
,name
, andport
. Other parameters are optional.Make sure that your user has the required permissions to migrate data from your source HDFS service. For more information, see Permissions for using Hive as a source.
The following code example creates a tunnel for AWS SCT to work with your Apache HDFS service. This example uses the
HADOOP_SOURCE
source cluster that you created before.AddSourceClusterHDFS -cluster: 'HADOOP_SOURCE' -name: 'HDFS_SOURCE' -host: 'localhost' -port: '9005' -user: '
hdfs_user
' -password: 'hdfs_password
' -createTunnel: 'true' -localPort: '9005' -remoteHost: 'hdfs_remote_address
' -remotePort: 'hdfs_port
' /The following code connects to your Apache HDFS service without a tunnel.
AddSourceClusterHDFS -cluster: 'HADOOP_SOURCE' -name: 'HDFS_SOURCE' -host: '
hdfs_address
' -port: 'hdfs_port
' -user: 'hdfs_user
' -password: 'hdfs_password
' /In the preceding examples, replace
hdfs_user
andhdfs_password
with the name of your HDFS user and the password for this user.Next, replace
hdfs_address
andhdfs_port
with the NameNode IP address and port of your source Hadoop cluster.For
hdfs_remote_address
, you might use the default value127.0.0.1
or the NameNode IP address of your source Hive service. -
Save your CLI script. Next, add the connection information for your target Amazon EMR cluster, and the migration commands.
Connecting to Amazon EMR as a target
You can connect to your target Amazon EMR cluster with the AWS SCT CLI. To do so, you authorize inbound traffic and use SSH. In this case, AWS SCT has all required permissions to work with your Amazon EMR cluster. For more information, see Before you connect and Connect to the primary node using SSH in the Amazon EMR Management Guide.
AWS SCT connects to Amazon EMR Hive with the hadoop
cluster user. To connect to Amazon EMR Hive,
use the Hive JDBC driver version 2.6.2.1002 or higher. For more information, see
Installing JDBC drivers for AWS Schema Conversion Tool.
To connect to Amazon EMR in the AWS SCT CLI
-
Open your CLI script which includes the connection information for your source Hadoop cluster. Add the target Amazon EMR credentials into this file.
-
Add your target Amazon EMR cluster to the project.
The following code example adds the target Amazon EMR cluster. This example sets
HADOOP_TARGET
as a name of the target cluster. Use this object name to add your Hive and HDFS services and an Amazon S3, bucket folder to the project and create mapping rules.AddTargetCluster -name: 'HADOOP_TARGET' -vendor: 'AMAZON_EMR' -host: 'ec2-44-44-55-66.eu-west-1.EXAMPLE.amazonaws.com' -port: '22' -user: '
emr_user
' -password: 'emr_password
' -useSSL: 'true' -privateKeyPath: 'c:\path\name
.pem' -passPhrase: '1234567890abcdef0!' -s3Name: 'S3_TARGET' -accessKey: 'AKIAIOSFODNN7EXAMPLE' -secretKey: 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY' -region: 'eu-west-1' -s3Path: 'doc-example-bucket/example-folder' /In the preceding example, enter your AWS resource names and Amazon EMR connection information. This includes the IP address of your Amazon EMR cluster, AWS access key, AWS secret access key, and Amazon S3 bucket. If needed, configure the value of the port variable. Next, replace
emr_user
andemr_password
with the name of your Amazon EMR user and the password for this user. Forpath\name
, enter the name and path to the PEM file for your target Amazon EMR cluster. For more information, see Download PEM File for EMR Cluster Access. -
Add your target Amazon S3 bucket to the project.
The following code example adds the target Amazon S3 bucket. This example uses the
HADOOP_TARGET
cluster that you created before.AddTargetClusterS3 -cluster: 'HADOOP_TARGET' -Name: 'S3_TARGET' -accessKey: 'AKIAIOSFODNN7EXAMPLE' -secretKey: 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY' -region: 'eu-west-1' -s3Path: 'doc-example-bucket/example-folder' /
In the preceding example, enter your AWS access key, AWS secret access key, and Amazon S3 bucket.
-
Add your target Hive service to the project.
The following code example creates a tunnel for AWS SCT to work with your target Hive service. This example uses the
HADOOP_TARGET
target cluster that you created before.AddTargetClusterHive -cluster: 'HADOOP_TARGET' -name: 'HIVE_TARGET' -host: 'localhost' -port: '10006' -user: '
hive_user
' -password: 'hive_password
' -createTunnel: 'true' -localPort: '10006' -remoteHost: 'hive_address
' -remotePort: 'hive_port
' /In the preceding example, replace
hive_user
andhive_password
with the name of your Hive user and the password for this user.Next, replace
hive_address
with the default value127.0.0.1
or with the NameNode IP address of your target Hive service. Next, replacehive_port
with the port of your target Hive service. -
Add your target HDFS service to the project.
The following code example creates a tunnel for AWS SCT to work with your Apache HDFS service. This example uses the
HADOOP_TARGET
target cluster that you created before.AddTargetClusterHDFS -cluster: 'HADOOP_TARGET' -name: 'HDFS_TARGET' -host: 'localhost' -port: '8025' -user: '
hdfs_user
' -password: 'hdfs_password
' -createTunnel: 'true' -localPort: '8025' -remoteHost: 'hdfs_address
' -remotePort: 'hdfs_port
' /In the preceding example, replace
hdfs_user
andhdfs_password
with the name of your HDFS user and the password for this user.Next, replace
hdfs_address
andhdfs_port
with the private IP address and port of the NameNode of your target HDFS service. -
Save your CLI script. Next, add mapping rules and migration commands. For more information, see Migrating Hadoop workloads.