Migrating data from an on-premises data warehouse to Amazon Redshift - AWS Schema Conversion Tool

Migrating data from an on-premises data warehouse to Amazon Redshift

You can use an AWS SCT agent to extract data from your on-premises data warehouse and migrate it to Amazon Redshift. The agent extracts your data and uploads the data to either Amazon S3 or, for large-scale migrations, an AWS Snowball Edge device. You can then use an AWS SCT agent to copy the data to Amazon Redshift.

Alternatively, you can use AWS Database Migration Service (AWS DMS) to migrate data to Amazon Redshift. The advantage of AWS DMS is the support of ongoing replication (change data capture). However, to increase the speed of data migration, use several AWS SCT agents in parallel. According to our tests, AWS SCT agents migrate data faster than AWS DMS by 15–35 percent. The difference in speed is due to data compression, support of migration of table partitions in parallel, and different configuration settings. For more information, see Using an Amazon Redshift database as a target for AWS Database Migration Service.

Amazon S3 is a storage and retrieval service. To store an object in Amazon S3, you upload the file you want to store to an Amazon S3 bucket. When you upload a file, you can set permissions on the object and also on any metadata.

Large-scale migrations

Large-scale data migrations can include many terabytes of information, and can be slowed by network performance and by the sheer amount of data that has to be moved. AWS Snowball Edge is an AWS service you can use to transfer data to the cloud at faster-than-network speeds using an AWS-owned appliance. An AWS Snowball Edge device can hold up to 100 TB of data. It uses 256-bit encryption and an industry-standard Trusted Platform Module (TPM) to ensure both security and full chain-of-custody for your data. AWS SCT works with AWS Snowball Edge devices.

When you use AWS SCT and an AWS Snowball Edge device, you migrate your data in two stages. First, you use AWS SCT to process the data locally and then move that data to the AWS Snowball Edge device. You then send the device to AWS using the AWS Snowball Edge process, and then AWS automatically loads the data into an Amazon S3 bucket. Next, when the data is available on Amazon S3, you use AWS SCT to migrate the data to Amazon Redshift. Data extraction agents can work in the background while AWS SCT is closed.

The following diagram shows the supported scenario.

Extraction agent architecture

Data extraction agents are currently supported for the following source data warehouses:

  • Azure Synapse Analytics

  • BigQuery

  • Greenplum Database (version 4.3)

  • Microsoft SQL Server (version 2008 and higher)

  • Netezza (version 7.0.3 and higher)

  • Oracle (version 10 and higher)

  • Snowflake (version 3)

  • Teradata (version 13 and higher)

  • Vertica (version 7.2.2 and higher)

You can connect to FIPS endpoints for Amazon Redshift if you need to comply with the Federal Information Processing Standard (FIPS) security requirements. FIPS endpoints are available in the following AWS Regions:

  • US East (N. Virginia) Region (redshift-fips.us-east-1.amazonaws.com)

  • US East (Ohio) Region (redshift-fips.us-east-2.amazonaws.com)

  • US West (N. California) Region (redshift-fips.us-west-1.amazonaws.com)

  • US West (Oregon) Region (redshift-fips.us-west-2.amazonaws.com)

Use the information in the following topics to learn how to work with data extraction agents.

Prerequisites for using data extraction agents

Before you work with data extraction agents, add the required permissions for Amazon Redshift as a target to your Amazon Redshift user. For more information, see Permissions for Amazon Redshift as a target.

Then, store your Amazon S3 bucket information and set up your Secure Sockets Layer (SSL) trust and key store.

Amazon S3 settings

After your agents extract your data, they upload it to your Amazon S3 bucket. Before you continue, you must provide the credentials to connect to your AWS account and your Amazon S3 bucket. You store your credentials and bucket information in a profile in the global application settings, and then associate the profile with your AWS SCT project. If necessary, choose Global settings to create a new profile. For more information, see Storing AWS service profiles in AWS SCT.

To migrate data into your target Amazon Redshift database, the AWS SCT data extraction agent needs permission to access the Amazon S3 bucket on your behalf. To provide this permission, create an AWS Identity and Access Management (IAM) user with the following policy.

{ "Version": "2012-10-17", "Statement": [ { "Action": [ "s3:PutObject", "s3:DeleteObject", "s3:GetObject", "s3:GetObjectTagging", "s3:PutObjectTagging" ], "Resource": [ "arn:aws:s3:::bucket_name/*", "arn:aws:s3:::bucket_name" ], "Effect": "Allow" }, { "Action": [ "s3:ListBucket", "s3:GetBucketLocation" ], "Resource": [ "arn:aws:s3:::bucket_name" ], "Effect": "Allow" }, { "Effect": "Allow", "Action": "s3:ListAllMyBuckets", "Resource": "*" }, { "Action": [ "iam:GetUser" ], "Resource": [ "arn:aws:iam::111122223333:user/DataExtractionAgentName" ], "Effect": "Allow" } ] }

In the preceding example, replace bucket_name with the name of your Amazon S3 bucket. Then, replace 111122223333:user/DataExtractionAgentName with the name of your IAM user.

Assuming IAM roles

For additional security, you can use AWS Identity and Access Management (IAM) roles to access your Amazon S3 bucket. To do so, create an IAM user for your data extraction agents without any permissions. Then, create an IAM role that enables Amazon S3 access, and specify the list of services and users that can assume this role. For more information, see IAM roles in the IAM User Guide.

To configure IAM roles to access your Amazon S3 bucket
  1. Create a new IAM user. For user credentials, choose Programmatic access type.

  2. Configure the host environment so that your data extraction agent can assume the role that AWS SCT provides. Make sure that the user that you configured in the previous step enables data extraction agents to use the credential provider chain. For more information, see Using credentials in the AWS SDK for Java Developer Guide.

  3. Create a new IAM role that has access to your Amazon S3 bucket.

  4. Modify the trust section of this role to trust the user that you created before to assume the role. In the following example, replace 111122223333:user/DataExtractionAgentName with the name of your user.

    { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::111122223333:user/DataExtractionAgentName" }, "Action": "sts:AssumeRole" }
  5. Modify the trust section of this role to trust redshift.amazonaws.com to assume the role.

    { "Effect": "Allow", "Principal": { "Service": [ "redshift.amazonaws.com" ] }, "Action": "sts:AssumeRole" }
  6. Attach this role to your Amazon Redshift cluster.

Now, you can run your data extraction agent in AWS SCT.

When you use IAM role assuming, the data migration works the following way. The data extraction agent starts and gets the user credentials using the credential provider chain. Next, you create a data migration task in AWS SCT, then specify the IAM role for data extraction agents to assume, and start the task. AWS Security Token Service (AWS STS) generates temporary credentials to access Amazon S3. The data extraction agent uses these credentials to upload data to Amazon S3.

Then, AWS SCT provides Amazon Redshift with the IAM role. In turn, Amazon Redshift gets new temporary credentials from AWS STS to access Amazon S3. Amazon Redshift uses these credentials to copy data from Amazon S3 to your Amazon Redshift table.

Security settings

The AWS Schema Conversion Tool and the extraction agents can communicate through Secure Sockets Layer (SSL). To enable SSL, set up a trust store and key store.

To set up secure communication with your extraction agent
  1. Start the AWS Schema Conversion Tool.

  2. Open the Settings menu, and then choose Global settings. The Global settings dialog box appears.

  3. Choose Security.

  4. Choose Generate trust and key store, or choose Select existing trust store.

    If you choose Generate trust and key store, you then specify the name and password for the trust and key stores, and the path to the location for the generated files. You use these files in later steps.

    If you choose Select existing trust store, you then specify the password and file name for the trust and key stores. You use these files in later steps.

  5. After you have specified the trust store and key store, choose OK to close the Global settings dialog box.

Configuring the environment for data extraction agents

You can install several data extraction agents on a single host. However, we recommend that you run one data extraction agent on one host.

To run your data extraction agent, make sure that you use a host with at least four vCPUs and 32 GB memory. Also, set the minimum memory available to AWS SCT to at least four GB. For more information, see Configuring additional memory.

Optimal configuration and number of agent hosts depend on the specific situation of each customer. Make sure that you consider such factors as amount of data to migrate, network bandwidth, time to extract data, and so on. You can perform a proof of concept (PoC) first, and then configure your data extraction agents and hosts according to the results of this PoC.

Installing extraction agents

We recommend that you install multiple extraction agents on individual computers, separate from the computer that is running the AWS Schema Conversion Tool.

Extraction agents are currently supported on the following operating systems:

  • Microsoft Windows

  • Red Hat Enterprise Linux (RHEL) 6.0

  • Ubuntu Linux (version 14.04 and higher)

Use the following procedure to install extraction agents. Repeat this procedure for each computer that you want to install an extraction agent on.

To install an extraction agent
  1. If you have not already downloaded the AWS SCT installer file, follow the instructions at Installing, verifying, and updating AWS SCT to download it. The .zip file that contains the AWS SCT installer file also contains the extraction agent installer file.

  2. Download and install the latest version of Amazon Corretto 11. For more information, see Downloads for Amazon Corretto 11 in the Amazon Corretto 11 User Guide.

  3. Locate the installer file for your extraction agent in a subfolder named agents. For each computer operating system, the correct file to install the extraction agent is shown following.

    Operating system File name

    Microsoft Windows

    aws-schema-conversion-tool-extractor-2.0.1.build-number.msi

    RHEL

    aws-schema-conversion-tool-extractor-2.0.1.build-number.x86_64.rpm

    Ubuntu Linux

    aws-schema-conversion-tool-extractor-2.0.1.build-number.deb

  4. Install the extraction agent on a separate computer by copying the installer file to the new computer.

  5. Run the installer file. Use the instructions for your operating system, shown following.

    Operating system Installation instructions

    Microsoft Windows

    Double-click the file to run the installer.

    RHEL

    Run the following commands in the folder that you downloaded or moved the file to.

    sudo rpm -ivh aws-schema-conversion-tool-extractor-2.0.1.build-number.x86_64.rpm sudo ./sct-extractor-setup.sh --config

    Ubuntu Linux

    Run the following commands in the folder that you downloaded or moved the file to.

    sudo dpkg -i aws-schema-conversion-tool-extractor-2.0.1.build-number.deb sudo ./sct-extractor-setup.sh --config
  6. Choose Next, accept the license agreement, and choose Next.

  7. Enter the path to install the AWS SCT data extraction agent, and choose Next.

  8. Choose Install to install your data extraction agent.

    AWS SCT installs your data extraction agent. To complete the installation, configure your data extraction agent. AWS SCT automatically launches the configuration setup program. For more information, see Configuring extraction agents.

  9. Choose Finish to close the installation wizard after you configure your data extraction agent.

Configuring extraction agents

Use the following procedure to configure extraction agents. Repeat this procedure on each computer that has an extraction agent installed.

To configure your extraction agent
  1. Launch the configuration setup program:

    • In Windows, AWS SCT launches the configuration setup program automatically during the installation of a data extraction agent.

      As needed, you can launch the setup program manually. To do so, run the ConfigAgent.bat file in Windows. You can find this file in the folder where you installed the agent.

    • In RHEL and Ubuntu, run the sct-extractor-setup.sh file from the location where you installed the agent.

    The setup program prompts you for information. For each prompt, a default value appears.

  2. Accept the default value at each prompt, or enter a new value.

    Specify the following information:

    • For Listening port, enter the port number the agent listens on.

    • For Add a source vendor, enter yes, and then enter your source data warehouse platform.

    • For JDBC driver, enter the location where you installed the JDBC drivers.

    • For Working folder, enter the path where the AWS SCT data extraction agent will store the extracted data. The working folder can be on a different computer from the agent, and a single working folder can be shared by multiple agents on different computers.

    • For Enable SSL communication, enter yes.

    • For Key store, enter the location of the key store file.

    • For Key store password, enter the password for the key store.

    • For Enable client SSL authentication, enter yes.

    • For Trust store, enter the location of the trust store file.

    • For Trust store password, enter the password for the trust store.

The setup program updates the settings file for the extraction agent. The settings file is named settings.properties, and is located where you installed the extraction agent.

The following is a sample settings file.

$ cat settings.properties #extractor.start.fetch.size=20000 #extractor.out.file.size=10485760 #extractor.source.connection.pool.size=20 #extractor.source.connection.pool.min.evictable.idle.time.millis=30000 #extractor.extracting.thread.pool.size=10 vendor=TERADATA driver.jars=/usr/share/lib/jdbc/terajdbc4.jar port=8192 redshift.driver.jars=/usr/share/lib/jdbc/RedshiftJDBC42-1.2.43.1067.jar working.folder=/data/sct extractor.private.folder=/home/ubuntu ssl.option=OFF

To change configuration settings, you can edit the settings.properties file using a text editor or run the agent configuration again.

Installing and configuring extraction agents with dedicated copying agents

You can install extraction agents in a configuration that has shared storage and a dedicated copying agent. The following diagram illustrates this scenario.

Extractor agents with dedicated copy agent configuration

That configuration can be useful when a source database server supports up to 120 connections, and your network has ample storage attached. Use the procedure following to configure extraction agents that have a dedicated copying agent.

To install and configure extraction agents and a dedicated copying agent
  1. Make sure that the working directory of all extracting agents uses the same folder on shared storage.

  2. Install extractor agents by following the steps in Installing extraction agents.

  3. Configure extraction agents by following the steps in Configuring extraction agents, but specify only the source JDBC driver.

  4. Configure a dedicated copying agent by following the steps in Configuring extraction agents, but specify only an Amazon Redshift JDBC driver.

Starting extraction agents

Use the following procedure to start extraction agents. Repeat this procedure on each computer that has an extraction agent installed.

Extraction agents act as listeners. When you start an agent with this procedure, the agent starts listening for instructions. You send the agents instructions to extract data from your data warehouse in a later section.

To start your extraction agent
  • On the computer that has the extraction agent installed, run the command listed following for your operating system.

    Operating system Start command

    Microsoft Windows

    Double-click the StartAgent.bat batch file.

    RHEL

    Run the following command in the path to the folder that you installed the agent:

    sudo initctl start sct-extractor

    Ubuntu Linux

    Run the following command in the path to the folder that you installed the agent. Use the command appropriate for your version of Ubuntu.

    Ubuntu 14.04: sudo initctl start sct-extractor

    Ubuntu 15.04 and higher: sudo systemctl start sct-extractor

To check the status of the agent, run the same command but replace start with status.

To stop an agent, run the same command but replace start with stop.

Registering extraction agents with the AWS Schema Conversion Tool

You manage your extraction agents by using AWS SCT. The extraction agents act as listeners. When they receive instructions from AWS SCT, they extract data from your data warehouse.

Use the following procedure to register extraction agents with your AWS SCT project.

To register an extraction agent
  1. Start the AWS Schema Conversion Tool, and open a project.

  2. Open the View menu, and then choose Data Migration view (other). The Agents tab appears. If you have previously registered agents, AWS SCT displays them in a grid at the top of the tab.

  3. Choose Register.

    After you register an agent with an AWS SCT project, you can't register the same agent with a different project. If you're no longer using an agent in an AWS SCT project, you can unregister it. You can then register it with a different project.

  4. Choose Redshift data agent, and then choose OK.

  5. Enter your information on the Connection tab of the dialog box:

    1. For Description, enter a description of the agent.

    2. For Host Name, enter the host name or IP address of the computer of the agent.

    3. For Port, enter the port number that the agent is listening on.

    4. Choose Register to register the agent with your AWS SCT project.

  6. Repeat the previous steps to register multiple agents with your AWS SCT project.

Hiding and recovering information for an AWS SCT agent

An AWS SCT agent encrypts a significant amount of information, for example passwords to user key-trust stores, database accounts, AWS account information, and similar items. It does so using a special file called seed.dat. By default, the agent creates this file in the working folder of the user who first configures the agent.

Because different users can configure and run the agent, the path to seed.dat is stored in the {extractor.private.folder} parameter of the settings.properties file. When the agent starts, it can use this path to find the seed.dat file to access the key-trust store information for the database it acts on.

You might need to recover passwords that an agent has stored in these cases:

  • If the user loses the seed.dat file and the AWS SCT agent's location and port didn't change.

  • If the user loses the seed.dat file and the AWS SCT agent's location and port has changed. In this case, the change usually occurs because the agent was migrated to another host or port and the information in the seed.dat file is no longer valid.

In these cases, if an agent is started without SSL, it starts and then accesses the previously created agent storage. It then goes to the Waiting for recovery state.

However, in these cases, if an agent is started with SSL you can't restart it. This is because the agent can't decrypt the passwords to certificates stored in the settings.properties file. In this type of startup, the agent fails to start. An error similar to the following is written in the log: "The agent could not start with SSL mode enabled. Please reconfigure the agent. Reason: The password for keystore is incorrect."

To fix this, create a new agent and configure the agent to use the existing passwords for accessing the SSL certificates. To do so, use the following procedure.

After you perform this procedure, the agent should run and go to the Waiting for recovery state. AWS SCT automatically sends the needed passwords to an agent in the Waiting for recovery state. When the agent has the passwords, it restarts any tasks. No further user action is required on the AWS SCT side.

To reconfigure the agent and restore passwords for accessing SSL certificates
  1. Install a new AWS SCT agent and run configuration.

  2. Change the agent.name property in the instance.properties file to the name of the agent the storage was created for, to have the new agent work with existing agent storage.

    The instance.properties file is stored in the agent's private folder, which is named using the following convention: {output.folder}\dmt\{hostName}_{portNumber}\.

  3. Change the name of {output.folder} to that of the previous agent's output folder.

    At this point, AWS SCT is still trying to access the old extractor at the old host and port. As a result, the inaccessible extractor gets the status FAILED. You can then change the host and port.

  4. Modify the host, port, or both for the old agent by using the Modify command to redirect the request flow to the new agent.

When AWS SCT can ping the new agent, AWS SCT receives the status Waiting for recovery from the agent. AWS SCT then automatically recovers the passwords for the agent.

Each agent that works with the agent storage updates a special file called storage.lck located at {output.folder}\{agentName}\storage\. This file contains the agent's network ID and the time until which the storage is locked. When the agent works with the agent storage, it updates the storage.lck file and extends the lease of the storage by 10 minutes every 5 minutes. No other instance can work with this agent storage before the lease expires.

Creating data migration rules in AWS SCT

Before you extract your data with the AWS Schema Conversion Tool, you can set up filters that reduce the amount of data that you extract. You can create data migration rules by using WHERE clauses to reduce the data that you extract. For example, you can write a WHERE clause that selects data from a single table.

You can create data migration rules and save the filters as part of your project. With your project open, use the following procedure to create data migration rules.

To create data migration rules
  1. Open the View menu, and then choose Data Migration view (other).

  2. Choose Data migration rules, and then choose Add new rule.

  3. Configure your data migration rule:

    1. For Name, enter a name for your data migration rule.

    2. For Where schema name is like, enter a filter to apply to schemas. In this filter, a WHERE clause is evaluated by using a LIKE clause. To choose one schema, enter an exact schema name. To choose multiple schemas, use the “%” character as a wildcard to match any number of characters in the schema name.

    3. For table name like, enter a filter to apply to tables. In this filter, a WHERE clause is evaluated by using a LIKE clause. To choose one table, enter an exact name. To choose multiple tables, use the “%” character as a wildcard to match any number of characters in the table name.

    4. For Where clause, enter a WHERE clause to filter data.

  4. After you have configured your filter, choose Save to save your filter, or Cancel to cancel your changes.

  5. After you are done adding, editing, and deleting filters, choose Save all to save all your changes.

To turn off a filter without deleting it, use the toggle icon. To duplicate an existing filter, use the copy icon. To delete an existing filter, use the delete icon. To save any changes you make to your filters, choose Save all.

Changing extractor and copy settings from project settings

From the Project settings window in AWS SCT, you can choose settings for data extraction agents and the Amazon Redshift COPY command.

To choose these settings, choose Settings, Project settings, and then choose Data migration. Here, you can edit Extraction settings, Amazon S3 settings, and Copy settings.

Use the instructions in the following table to provide the information for Extraction settings.

For this parameter Do this

Compression format

Specify the compression format of the input files. Choose one of the following options: GZIP, BZIP2, ZSTD, or No compression.

Delimiter character

Specify the ASCII character that separates fields in the input files. Nonprinting characters aren't supported.

NULL value as a string

Turn this option on if your data includes a null terminator. If this option is turned off, the Amazon Redshift COPY command treats null as an end of the record and ends the load process.

Sorting strategy

Use sorting to restart the extraction from the point of failure. Choose one of the following sorting strategies: Use sorting after the first fail (recommended), Use sorting if possible, or Never use sorting. For more information, see Sorting data before migrating using AWS SCT.

Source temp schema

Enter the name of the schema in the source database, where the extraction agent can create the temporary objects.

Out file size (in MB)

Enter the size, in MB, of the files uploaded to Amazon S3.

Snowball out file size (in MB)

Enter the size, in MB, of the files uploaded to AWS Snowball. Files can be 1–1,000 MB in size.

Use automatic partitioning. For Greenplum and Netezza, enter the minimal size of supported tables (in megabytes)

Turn this option on to use table partitioning, and then enter the size of tables to partition for Greenplum and Netezza source databases. For Oracle to Amazon Redshift migrations, you can keep this field empty because AWS SCT creates subtasks for all partitioned tables.

Extract LOBs

Turn this option on to extract large objects (LOBs) from your source database. LOBs include BLOBs, CLOBs, NCLOBs, XML files, and so on. For every LOB, AWS SCT extraction agents create a data file.

Amazon S3 bucket LOBs folder

Enter the location for AWS SCT extraction agents to store LOBs.

Apply RTRIM to string columns

Turn this option on to trim a specified set of characters from the end of the extracted strings.

Keep files locally after upload to Amazon S3

Turn this option on to keep files on your local machine after data extraction agents upload them to Amazon S3.

Use the instructions in the following table to provide the information for Amazon S3 settings.

For this parameter Do this

Use proxy

Turn this option on to use a proxy server to upload data to Amazon S3. Then choose the data transfer protocol, enter the host name, port, user name, and password.

Endpoint type

Choose FIPS to use the Federal Information Processing Standard (FIPS) endpoint. Choose VPCE to use the virtual private cloud (VPC) endpoint. Then for VPC endpoint, enter the Domain Name System (DNS) of your VPC endpoint.

Keep files on Amazon S3 after copying to Amazon Redshift

Turn this option on to keep extracted files on Amazon S3 after copying these files to Amazon Redshift.

Use the instructions in the following table to provide the information for Copy settings.

For this parameter Do this

Maximum error count

Enter the number of load errors. After the operation reaches this limit, the AWS SCT data extraction agents end the data load process. The default value is 0, which means that the AWS SCT data extraction agents continue the data load regardless of the failures.

Replace not valid UTF-8 characters

Turn this option on to replace not valid UTF-8 characters with the specified character and continue the data load operation.

Use blank as null value

Turn this option on to load blank fields that consist of white space characters as null.

Use empty as null value

Turn this option on to load empty CHAR and VARCHAR fields as null.

Truncate columns

Turn this option on to truncate data in columns to fit the data type specification.

Automatic compression

Turn this option on to apply compression encoding during a copy operation.

Automatic statistics refresh

Turn this option on to refresh the statistics at the end of a copy operation.

Check file before load

Turn this option on to validate data files before loading them to Amazon Redshift.

Sorting data before migrating using AWS SCT

Sorting your data before migration with AWS SCT provides some benefits. If you sort data first, AWS SCT can restart the extraction agent at the last saved point after a failure. Also, if you are migrating data to Amazon Redshift and you sort data first, AWS SCT can insert data into to Amazon Redshift faster.

These benefits have to do with how AWS SCT creates data extraction queries. In some cases, AWS SCT uses the DENSE_RANK analytic function in these queries. However, DENSE_RANK can use lots of time and server resources to sort the dataset that results from extraction, so if AWS SCT can work without it, it does.

To sort data before migrating using AWS SCT
  1. Open an AWS SCT project.

  2. Open the context (right-click) menu for the object, and then choose Create Local task.

  3. Choose the Advanced tab, and for Sorting strategy, choose an option:

    • Never use sorting – The extraction agent doesn't use the DENSE_RANK analytic function and restarts from the beginning if a failure occurs.

    • Use sorting if possible – The extraction agent uses DENSE_RANK if the table has a primary key or a unique constraint.

    • Use sorting after first fail (recommended) – The extraction agent first tries to get the data without using DENSE_RANK. If the first attempt fails, the extraction agent rebuilds the query using DENSE_RANK and preserves its location in case of failure.

    The Security tab on the Global Settings dialog box
  4. Set additional parameters as described following, and then choose Create to create your data extraction task.

Creating, running, and monitoring an AWS SCT data extraction task

Use the following procedures to create, run, and monitor data extraction tasks.

To assign tasks to agents and migrate data
  1. In the AWS Schema Conversion Tool, after you have converted your schema, choose one or more tables from the left panel of your project.

    You can choose all tables, but we recommend against that for performance reasons. We recommend that you create multiple tasks for multiple tables based on the size of the tables in your data warehouse.

  2. Open the context (right-click) menu for each table, and then choose Create task. The Create Local task dialog box opens.

  3. For Task name, enter a name for the task.

  4. For Migration mode, choose one of the following:

    • Extract only – Extract your data, and save the data to your local working folders.

    • Extract and upload – Extract your data, and upload your data to Amazon S3.

    • Extract, upload and copy – Extract your data, upload your data to Amazon S3, and copy it into your Amazon Redshift data warehouse.

  5. For Encryption type, choose one of the following:

    • NONE – Turn off data encryption for the entire data migration process.

    • CSE_SK – Use client-side encryption with a symmetric key to migrate data. AWS SCT automatically generates encryption keys and transmits them to data extraction agents using Secure Sockets Layer (SSL). AWS SCT doesn't encrypt large objects (LOBs) during data migration.

  6. Choose Extract LOBs to extract large objects. If you don't need to extract large objects, you can clear the check box. Doing this reduces the amount of data that you extract.

  7. To see detailed information about a task, choose Enable task logging. You can use the task log to debug problems.

    If you enable task logging, choose the level of detail that you want to see. The levels are the following, with each level including all messages from the previous level:

    • ERROR – The smallest amount of detail.

    • WARNING

    • INFO

    • DEBUG

    • TRACE – The largest amount of detail.

  8. To export data from BigQuery, AWS SCT uses the Google Cloud Storage bucket folder. In this folder, data extraction agents store your source data.

    To enter the path to your Google Cloud Storage bucket folder, choose Advanced. For Google CS bucket folder, enter the bucket name and the folder name.

  9. To assume a role for your data extraction agent user, choose Amazon S3 settings. For IAM role, enter the name of the role to use. For Region, choose the AWS Region for this role.

  10. Choose Test task to verify that you can connect to your working folder, Amazon S3 bucket, and Amazon Redshift data warehouse. The verification depends on the migration mode you chose.

  11. Choose Create to create the task.

  12. Repeat the previous steps to create tasks for all the data that you want to migrate.

To run and monitor tasks
  1. For View, choose Data Migration view. The Agents tab appears.

  2. Choose the Tasks tab. Your tasks appear in the grid at the top as shown following. You can see the status of a task in the top grid, and the status of its subtasks in the bottom grid.

    Tasks grid
  3. Choose a task in the top grid and expand it. Depending on the migration mode you chose, you see the task divided into Extract, Upload, and Copy.

  4. Choose Start for a task to start that task. You can monitor the status of your tasks while they work. The subtasks run in parallel. The extract, upload, and copy also run in parallel.

  5. If you enabled logging when you set up the task, you can view the log:

    1. Choose Download log. A message appears with the name of the folder that contains the log file. Dismiss the message.

    2. A link appears in the Task details tab. Choose the link to open the folder that contains the log file.

You can close AWS SCT, and your agents and tasks continue to run. You can reopen AWS SCT later to check the status of your tasks and view the task logs.

You can save data extraction tasks to your local disk and restore them to the same or another project by using export and import. To export a task, make sure that you have at least one extraction task created in a project. You can import a single extraction task or all of the tasks created in the project.

When you export an extraction task, AWS SCT creates a separate .xml file for that task. The .xml file stores that task's metadata information, such as task properties, description, and subtasks. The .xml file doesn't contain information about processing of an extraction task. Information like the following is recreated when the task is imported:

  • Task progress

  • Subtask and stage states

  • Distribution of extracting agents by subtasks and stages

  • Task and subtask IDs

  • Task name

Exporting and importing an AWS SCT data extraction task

You can quickly save an existing task from one project and restore it in another project (or the same project) using AWS SCT export and import. Use the following procedure to export and import data extraction tasks.

To export and import a data extraction task
  1. For View, choose Data Migration view. The Agents tab appears.

  2. Choose the Tasks tab. Your tasks are listed in the grid that appears.

  3. Choose the three vertically aligned dots (ellipsis icon) located at the lower right corner under the list of tasks.

  4. Choose Export task from the pop-up menu.

  5. Choose the folder where you want AWS SCT to place the task export .xml file.

    AWS SCT creates the task export file with a file name format of TASK-DESCRIPTION_TASK-ID.xml.

  6. Choose the three vertically aligned dots (ellipsis icon) at lower right under the list of tasks.

  7. Choose Import task from the pop-up menu.

    You can import an extraction task to a project connected to the source database, and the project has at least one active registered extraction agent.

  8. Select the .xml file for the extraction task you exported.

    AWS SCT gets the extraction task's parameters from the file, creates the task, and adds the task to the extracting agents.

  9. Repeat these steps to export and import additional data extraction tasks.

At the end of this process, your export and import are complete and your data extraction tasks are ready for use.

Data extraction using an AWS Snowball Edge device

The process of using AWS SCT and AWS Snowball Edge has several steps. The migration involves a local task, where AWS SCT uses a data extraction agent to move the data to the AWS Snowball Edge device, then an intermediate action where AWS copies the data from the AWS Snowball Edge device to an Amazon S3 bucket. The process finishes AWS SCT loading the data from the Amazon S3 bucket to Amazon Redshift.

The sections following this overview provide a step-by-step guide to each of these tasks. The procedure assumes that you have AWS SCT installed and that you have configured and registered a data extraction agent on a dedicated machine.

Perform the following steps to migrate data from a local data store to an AWS data store using AWS Snowball Edge.

  1. Create an AWS Snowball Edge job using the AWS Snowball console.

  2. Unlock the AWS Snowball Edge device using the local, dedicated Linux machine.

  3. Create a new project in AWS SCT.

  4. Install and configure your data extraction agents.

  5. Create and set permissions for the Amazon S3 bucket to use.

  6. Import an AWS Snowball job to your AWS SCT project.

  7. Register your data extraction agent in AWS SCT.

  8. Create a local task in AWS SCT.

  9. Run and monitor the data migration task in AWS SCT.

Step-by-step procedures for migrating data using AWS SCT and AWS Snowball Edge

The following sections provide detailed information on the migration steps.

Step 1: Create an AWS Snowball Edge job

Create an AWS Snowball job by following the steps outlined in the section Creating an AWS Snowball Edge Job in the AWS Snowball Edge Developer Guide.

Step 2: Unlock the AWS Snowball Edge device

Run the commands that unlock and provide credentials to the Snowball Edge device from the machine where you installed the AWS DMS agent. By running these commands, you can be sure that the AWS DMS agent call connects to the AWS Snowball Edge device. For more information about unlocking the AWS Snowball Edge device, see Unlocking the Snowball Edge.

aws s3 ls s3://<bucket-name> --profile <Snowball Edge profile> --endpoint http://<Snowball IP>:8080 --recursive

Step 3: Create a new AWS SCT project

Next, create a new AWS SCT project.

To create a new project in AWS SCT
  1. Start the AWS Schema Conversion Tool. On the File menu, choose New project. The New project dialog box appears.

  2. Enter a name for your project, which is stored locally on your computer.

  3. Enter the location for your local project file.

  4. Choose OK to create your AWS SCT project.

  5. Choose Add source to add a new source database to your AWS SCT project.

  6. Choose Add target to add a new target platform in your AWS SCT project.

  7. Choose the source database schema in the left panel.

  8. In the right panel, specify the target database platform for the selected source schema.

  9. Choose Create mapping. This button becomes active after you choose the source database schema and the target database platform.

Step 4: Install and configure your data extraction agent

AWS SCT uses a data extraction agent to migrate data to Amazon Redshift. The .zip file that you downloaded to install AWS SCT, includes the extraction agent installer file. You can install the data extraction agent in Windows, Red Hat Enterprise Linux, or Ubuntu. For more information, see Installing extraction agents.

To configure your data extraction agent, enter your source and target database engines. Also, make sure that you downloaded JDBC drivers for your source and target databases on the computer where you run your data extraction agent. Data extraction agents use these drivers to connect to your source and target databases. For more information, see Downloading the required database drivers.

In Windows, the data extraction agent installer launches the configuration wizard in the command prompt window. In Linux, run the sct-extractor-setup.sh file from the location where you installed the agent.

Step 5: Configure AWS SCT to access the Amazon S3 bucket

For information on configuring an Amazon S3 bucket, see Buckets overview in the Amazon Simple Storage Service User Guide.

Step 6: Import an AWS Snowball job to your AWS SCT project

To connect your AWS SCT project with your AWS Snowball Edge device, import your AWS Snowball job.

To import your AWS Snowball job
  1. Open the Settings menu, and then choose Global settings. The Global settings dialog box appears.

  2. Choose AWS service profiles, and then choose Import job.

  3. Choose your AWS Snowball job.

  4. Enter your AWS Snowball IP. For more information, see Changing Your IP Address in the AWS Snowball User Guide.

  5. Enter your AWS Snowball Port. For more information, see Ports Required to Use AWS Services on an AWS Snowball Edge Device in the AWS Snowball Edge Developer Guide.

  6. Enter your AWS Snowball access key and AWS Snowball secret key. For more information, see Authorization and Access Control in AWS Snowball in the AWS Snowball User Guide.

  7. Choose Apply, and then choose OK.

Step 7: Register a data extraction agent in AWS SCT

In this section, you register the data extraction agent in AWS SCT.

To register a data extraction agent
  1. On the View menu, choose Data migration view (other), and then choose Register.

  2. For Description, enter a name for your data extraction agent.

  3. For Host name, enter the IP address of the computer where you run your data extraction agent.

  4. For Port, enter the listening port that you configured.

  5. Choose Register.

Step 8: Creating a local task

Next, you create the migration task. The task includes two subtasks. One subtask migrates data from the source database to the AWS Snowball Edge appliance. The other subtask takes the data that the appliance loads into an Amazon S3 bucket and migrates it to the target database.

To create the migration task
  1. On the View menu, and then choose Data migration view (other).

  2. In the left panel that displays the schema from your source database, choose a schema object to migrate. Open the context (right-click) menu for the object, and then choose Create local task.

  3. For Task name, enter a descriptive name for your data migration task.

  4. For Migration mode, choose Extract, upload, and copy.

  5. Choose Amazon S3 settings.

  6. Select Use Snowball.

  7. Enter folders and subfolders in your Amazon S3 bucket where the data extraction agent can store data.

  8. Choose Create to create the task.

Step 9: Running and monitoring the data migration task in AWS SCT

To start your data migration task, choose Start. Make sure that you established connections to the source database, the Amazon S3 bucket, the AWS Snowball device, as well as the connection to the target database on AWS.

You can monitor and manage the data migration tasks and their subtasks in the Tasks tab. You can see the data migration progress, as well as pause or restart your data migration tasks.

Data extraction task output

After your migration tasks complete, your data is ready. Use the following information to determine how to proceed based on the migration mode you chose and the location of your data.

Migration mode Data location

Extract, upload and copy

The data is already in your Amazon Redshift data warehouse. You can verify that the data is there, and start using it. For more information, see Connecting to clusters from client tools and code.

Extract and upload

The extraction agents saved your data as files in your Amazon S3 bucket. You can use the Amazon Redshift COPY command to load your data to Amazon Redshift. For more information, see Loading data from Amazon S3 in the Amazon Redshift documentation.

There are multiple folders in your Amazon S3 bucket, corresponding to the extraction tasks that you set up. When you load your data to Amazon Redshift, specify the name of the manifest file created by each task. The manifest file appears in the task folder in your Amazon S3 bucket as shown following.

File list in Amazon S3 bucket

Extract only

The extraction agents saved your data as files in your working folder. Manually copy your data to your Amazon S3 bucket, and then proceed with the instructions for Extract and upload.

Using virtual partitioning with AWS Schema Conversion Tool

You can often best manage large non-partitioned tables by creating subtasks that create virtual partitions of the table data using filtering rules. In AWS SCT, you can create virtual partitions for your migrated data. There are three partition types, which work with specific data types:

  • The RANGE partition type works with numeric and date and time data types.

  • The LIST partition type works with numeric, character, and date and time data types.

  • The DATE AUTO SPLIT partition type works with numeric, date, and time data types.

AWS SCT validates the values you provide for creating a partition. For example, if you attempt to partition a column with data type NUMERIC but you provide values of a different data type, AWS SCT throws an error.

Also, if you are using AWS SCT to migrate data to Amazon Redshift, you can use native partitioning to manage the migration oif large tables. For more information, see Using native partitioning.

Limits when creating virtual partitioning

These are limitations to creating a virtual partition:

  • You can only use virtual partitioning only for nonpartitioned tables.

  • You can use virtual partitioning only in the data migration view.

  • You can't use the option UNION ALL VIEW with virtual partitioning.

RANGE partition type

The RANGE partition type partitions data based on a range of column values for numeric and date and time data types. This partition type creates a WHERE clause, and you provide the range of values for each partition. To specify a list of values for the partitioned column, use the Values box. You can load value information by using a .csv file.

The RANGE partition type creates default partitions at both ends of the partition values. These default partitions catch any data that is less than or greater than the specified partition values.

For example, you can create multiple partitions based on a value range that you provide. In the following example, the partitioning values for LO_TAX are specified to create multiple partitions.

Partition1: WHERE LO_TAX <= 10000.9 Partition2: WHERE LO_TAX > 10000.9 AND LO_TAX <= 15005.5 Partition3: WHERE LO_TAX > 15005.5 AND LO_TAX <= 25005.95
To create a RANGE virtual partition
  1. Open AWS SCT.

  2. Choose Data Migration view (other) mode.

  3. Choose the table where you want to set up virtual partitioning. Open the context (right-click) menu for the table, and choose Add virtual partitioning.

  4. In the Add virtual partitioning dialog box, enter the information as follows.

    Option Action

    Partition type

    Choose RANGE. The dialog box UI changes depending on the type you choose.

    Column name

    Choose the column that you want to partition.

    Column type

    Choose the data type for the values in the column.

    Values

    Add new values by typing each value in the New Value box, then choosing the plus sign to add the value.

    Load from file

    (Optional) Enter the name of a .csv file that contains the partition values.

  5. Choose OK.

LIST partition type

The LIST partition type partitions data based on column values for numeric, character, and date and time data types. This partition type creates a WHERE clause, and you provide the values for each partition. To specify a list of values for the partitioned column, use the Values box. You can load value information by using a .csv file.

For example, you can create multiple partitions based on a value you provide. In the following example, the partitioning values for LO_ORDERKEY are specified to create multiple partitions.

Partition1: WHERE LO_ORDERKEY = 1 Partition2: WHERE LO_ORDERKEY = 2 Partition3: WHERE LO_ORDERKEY = 3 … PartitionN: WHERE LO_ORDERKEY = USER_VALUE_N

You can also create a default partition for values not included in the ones specified.

You can use the LIST partition type to filter the source data if you want to exclude particular values from the migration. For example, suppose that you want to omit rows with LO_ORDERKEY = 4. In this case, don't include the value 4 in the list of partition values and make sure that Include other values isn't chosen.

To create a LIST virtual partition
  1. Open AWS SCT.

  2. Choose Data Migration view (other) mode.

  3. Choose the table where you want to set up virtual partitioning. Open the context (right-click) menu for the table, and choose Add virtual partitioning.

  4. In the Add virtual partitioning dialog box, enter the information as follows.

    Option Action

    Partition type

    Choose LIST. The dialog box UI changes depending on the type you choose.

    Column name

    Choose the column that you want to partition.

    New value

    Type a value here to add it to the set of partitioning values.

    Include other values

    Choose this option to create a default partition where all values that don’t meet the partitioning criteria are stored.

    Load from file

    (Optional) Enter the name of a .csv file that contains the partition values.

  5. Choose OK.

DATE AUTO SPLIT partition type

The DATE AUTO SPLIT partition type is an automated way of generating RANGE partitions. With DATA AUTO SPLIT, you tell AWS SCT the partitioning attribute, where to start and end, and the size of the range between the values. Then AWS SCT calculates the partition values automatically.

DATA AUTO SPLIT automates a lot of the work that is involved with creating range partitions. The tradeoff between using this technique and range partitioning is how much control you need over the partition boundaries. The automatic split process always creates equal size (uniform) ranges. Range partitioning enables you to vary the size of each range as needed for your particular data distribution. For example, you can use daily, weekly, biweekly, monthly, and so on.

Partition1: WHERE LO_ORDERDATE >= ‘1954-10-10’ AND LO_ORDERDATE < ‘1954-10-24’ Partition2: WHERE LO_ORDERDATE >= ‘1954-10-24’ AND LO_ORDERDATE < ‘1954-11-06’ Partition3: WHERE LO_ORDERDATE >= ‘1954-11-06’ AND LO_ORDERDATE < ‘1954-11-20’ … PartitionN: WHERE LO_ORDERDATE >= USER_VALUE_N AND LO_ORDERDATE <= ‘2017-08-13’
To create a DATE AUTO SPLIT virtual partition
  1. Open AWS SCT.

  2. Choose Data Migration view (other) mode.

  3. Choose the table where you want to set up virtual partitioning. Open the context (right-click) menu for the table, and choose Add virtual partitioning.

  4. In the Add virtual partitioning dialog box, enter information as follows.

    Option Action

    Partition type

    Choose DATE AUTO SPLIT. The dialog box UI changes depending on the type you choose.

    Column name

    Choose the column that you want to partition.

    Start date

    Type a start date.

    End date

    Type an end date.

    Interval

    Enter the interval unit, and choose the value for that unit.

  5. Choose OK.

Using native partitioning

To speed up data migration, your data extraction agents can use native partitions of tables on your source data warehouse server. AWS SCT supports native partitioning for migrations from Greenplum, Netezza, and Oracle to Amazon Redshift.

For example, after you create a project, you might collect statistics on a schema and analyze the size of the tables selected for migration. For tables that exceed the size specified, AWS SCT triggers the native partitioning mechanism.

To use native partitioning
  1. Open AWS SCT, and choose New project for File. The New project dialog box appears.

  2. Create a new project, add your source and target servers, and create mapping rules. For more information, see Creating an AWS SCT project.

  3. Choose View, and then choose Main view.

  4. For Project settings, choose the Data migration tab. Choose Use automatic partitioning. For Greenplum and Netezza source databases, enter the minimal size of supported tables in megabytes (for example, 100). AWS SCT automatically creates separate migration subtasks for each native partition that isn't empty. For Oracle to Amazon Redshift migrations, AWS SCT creates subtasks for all partitioned tables.

  5. In the left panel that displays the schema from your source database, choose a schema. Open the context (right-click) menu for the object, and chose Collect statistics. For data migration from Oracle to Amazon Redshift, you can skip this step.

  6. Choose all tables to migrate.

  7. Register the required number of agents. For more information, see Registering extraction agents with the AWS Schema Conversion Tool.

  8. Create a data extraction task for the selected tables. For more information, see Creating, running, and monitoring an AWS SCT data extraction task.

    Check if large tables are split into subtasks, and that each subtask matches the dataset that presents a part of the table located on one slice in your source data warehouse.

  9. Start and monitor the migration process until AWS SCT data extraction agents complete the migration of data from your source tables.

Migrating LOBs to Amazon Redshift

Amazon Redshift doesn't support storing large binary objects (LOBs). However, if you need to migrate one or more LOBs to Amazon Redshift, AWS SCT can perform the migration. To do so, AWS SCT uses an Amazon S3 bucket to store the LOBs and writes the URL for the Amazon S3 bucket into the migrated data stored in Amazon Redshift.

To migrate LOBs to Amazon Redshift
  1. Open an AWS SCT project.

  2. Connect to the source and target databases. Refresh metadata from the target database, and make sure that the converted tables exist there.

  3. For Actions, choose Create local task.

  4. For Migration mode, choose one of the following:

    • Extract and upload to extract your data, and upload your data to Amazon S3.

    • Extract, upload and copy to extract your data, upload your data to Amazon S3, and copy it into your Amazon Redshift data warehouse.

  5. Choose Amazon S3 settings.

  6. For Amazon S3 bucket LOBs folder, enter the name of the folder in an Amazon S3 bucket where you want the LOBs stored.

    If you use AWS service profile, this field is optional. AWS SCT can use the default settings from your profile. To use another Amazon S3 bucket, enter the path here.

  7. Turn on the Use proxy option to use a proxy server to upload data to Amazon S3. Then choose the data transfer protocol, enter the host name, port, user name, and password.

  8. For Endpoint type, choose FIPS to use the Federal Information Processing Standard (FIPS) endpoint. Choose VPCE to use the virtual private cloud (VPC) endpoint. Then for VPC endpoint, enter the Domain Name System (DNS) of your VPC endpoint.

  9. Turn on the Keep files on Amazon S3 after copying to Amazon Redshift option to keep extracted files on Amazon S3 after copying these files to Amazon Redshift.

  10. Choose Create to create the task.

Best practices and troubleshooting for data extraction agents

The following are some best practices and troubleshooting suggestions for using extraction agents.

Issue Troubleshooting suggestions

Performance is slow

To improve performance, we recommend the following:

  • Install multiple agents.

  • Install agents on computers close to your data warehouse.

  • Don't run all tables on a single agent task.

Contention delays

Avoid having too many agents accessing your data warehouse at the same time.

An agent goes down temporarily

If an agent is down, the status of each of its tasks appears as failed in AWS SCT. If you wait, in some cases the agent can recover. In this case, the status of its tasks updates in AWS SCT.

An agent goes down permanently

If the computer running an agent goes down permanently, and that agent is running a task, you can substitute a new agent to continue the task. You can substitute a new agent only if the working folder of the original agent was not on the same computer as the original agent. To substitute a new agent, do the following:

  • Install an agent on a new computer.

  • Configure the new agent with the same settings, including port number and working folder, as the original agent.

  • Start the agent. After the agent starts, the task discovers the new available agent and continues running on the new agent.