

AWS Data Pipeline is no longer available to new customers. Existing customers of AWS Data Pipeline can continue to use the service as normal. [Learn more](https://aws.amazon.com/blogs/big-data/migrate-workloads-from-aws-data-pipeline/)

# Copy Data to Amazon Redshift Using AWS Data Pipeline
<a name="dp-copydata-redshift"></a>

This tutorial walks you through the process of creating a pipeline that periodically moves data from Amazon S3 to Amazon Redshift using either the **Copy to Redshift** template in the AWS Data Pipeline console, or a pipeline definition file with the AWS Data Pipeline CLI.

Amazon S3 is a web service that enables you to store data in the cloud. For more information, see the [Amazon Simple Storage Service User Guide](https://docs.aws.amazon.com/AmazonS3/latest/userguide/). 

Amazon Redshift is a data warehouse service in the cloud. For more information, see the [Amazon Redshift Management Guide](https://docs.aws.amazon.com/redshift/latest/mgmt/).

This tutorial has several prerequisites. After completing the following steps, you can continue the tutorial using either the console or the CLI.

**Topics**
+ [Before You Begin: Configure COPY Options and Load Data](dp-learn-copy-redshift.md)
+ [Set up Pipeline, Create a Security Group, and Create an Amazon Redshift Cluster](dp-copydata-redshift-prereq.md)
+ [Copy Data to Amazon Redshift Using the Command Line](dp-copydata-redshift-cli.md)

# Before You Begin: Configure COPY Options and Load Data
<a name="dp-learn-copy-redshift"></a>

Before copying data to Amazon Redshift within AWS Data Pipeline, ensure that you: 
+ Load data from Amazon S3.
+ Set up the `COPY` activity in Amazon Redshift. 

Once you have these options working and successfully complete a data load, transfer these options to AWS Data Pipeline, for performing the copying within it.

 For `COPY` options, see [COPY](https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) in the Amazon Redshift *Database Developer Guide*. 

For steps to load data from Amazon S3, see [Loading data from Amazon S3](https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html) in the Amazon Redshift *Database Developer Guide*. 

For example, the following SQL command in Amazon Redshift creates a new table named `LISTING` and copies sample data from a publicly available bucket in Amazon S3. 

Replace the `<iam-role-arn>` and region with your own. 

For details about this example, see [Load Sample Data from Amazon S3](https://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-create-sample-db.html) in the Amazon Redshift *Getting Started Guide*.

```
create table listing(
	listid integer not null distkey,
	sellerid integer not null,
	eventid integer not null,
	dateid smallint not null  sortkey,
	numtickets smallint not null,
	priceperticket decimal(8,2),
	totalprice decimal(8,2),
	listtime timestamp);

copy listing from 's3://awssampledbuswest2/tickit/listings_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';
```

# Set up Pipeline, Create a Security Group, and Create an Amazon Redshift Cluster
<a name="dp-copydata-redshift-prereq"></a>

**To set up for the tutorial**

1. Complete the tasks in [Setting up for AWS Data Pipeline](dp-get-setup.md).

1. Create a security group.

   1. Open the Amazon EC2 console.

   1. In the navigation pane, click **Security Groups**.

   1. Click **Create Security Group**.

   1. Specify a name and description for the security group.

   1. [EC2-Classic] Select `No VPC` for **VPC**.

   1. [EC2-VPC] Select the ID of your VPC for **VPC**.

   1. Click **Create**.

1. [EC2-Classic] Create an Amazon Redshift cluster security group and specify the Amazon EC2 security group.

   1. Open the Amazon Redshift console.

   1. In the navigation pane, click **Security Groups**.

   1. Click **Create Cluster Security Group**.

   1. In the **Create Cluster Security Group** dialog box, specify a name and description for the cluster security group.

   1. Click the name of the new cluster security group.

   1. Click **Add Connection Type**.

   1. In the **Add Connection Type** dialog box, select **EC2 Security Group** from **Connection Type**, select the security group that you created from **EC2 Security Group Name**, and then click **Authorize**.

1. [EC2-VPC] Create an Amazon Redshift cluster security group and specify the VPC security group.

   1. Open the Amazon EC2 console.

   1. In the navigation pane, click **Security Groups**.

   1. Click **Create Security Group**.

   1. In the **Create Security Group** dialog box, specify a name and description for the security group, and select the ID of your VPC for **VPC**.

   1. Click **Add Rule**. Specify the type, protocol, and port range, and start typing the ID of the security group in **Source**. Select the security group that you created in the second step.

   1. Click **Create**.

1. The following is a summary of the steps.

   If you have an existing Amazon Redshift cluster, make a note of the cluster ID. 

   To create a new cluster and load sample data, follow the steps in [Getting Started with Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html). For more information about creating clusters, see [Creating a Cluster](https://docs.aws.amazon.com/redshift/latest/mgmt/managing-clusters-console.html#create-cluster) in the *Amazon Redshift Management Guide*. 

   1. Open the Amazon Redshift console.

   1. Click **Launch Cluster**.

   1. Provide the required details for your cluster, and then click **Continue**.

   1. Provide the node configuration, and then click **Continue**.

   1. On the page for additional configuration information, select the cluster security group that you created, and then click **Continue**.

   1. Review the specifications for your cluster, and then click **Launch Cluster**.

# Copy Data to Amazon Redshift Using the Command Line
<a name="dp-copydata-redshift-cli"></a>

This tutorial demonstrates how to copy data from Amazon S3 to Amazon Redshift. You'll create a new table in Amazon Redshift, and then use AWS Data Pipeline to transfer data to this table from a public Amazon S3 bucket, which contains sample input data in CSV format. The logs are saved to an Amazon S3 bucket that you own.

Amazon S3 is a web service that enables you to store data in the cloud. For more information, see the [Amazon Simple Storage Service User Guide](https://docs.aws.amazon.com/AmazonS3/latest/userguide/). Amazon Redshift is a data warehouse service in the cloud. For more information, see the [Amazon Redshift Management Guide](https://docs.aws.amazon.com/redshift/latest/mgmt/).

**Prerequisites**

Before you begin, you must complete the following steps:

1. Install and configure a command line interface (CLI). For more information, see [Accessing AWS Data Pipeline](what-is-datapipeline.md#accessing-datapipeline).

1. Ensure that the IAM roles named **DataPipelineDefaultRole** and **DataPipelineDefaultResourceRole** exist. The AWS Data Pipeline console creates these roles for you automatically. If you haven't used the AWS Data Pipeline console at least once, then you must create these roles manually. For more information, see [IAM Roles for AWS Data Pipeline](dp-iam-roles.md).

1. Set up the `COPY` command in Amazon Redshift, since you will need to have these same options working when you perform the copying within AWS Data Pipeline. For information, see [Before You Begin: Configure COPY Options and Load Data](dp-learn-copy-redshift.md).

1. Set up an Amazon Redshift database. For more information, see [Set up Pipeline, Create a Security Group, and Create an Amazon Redshift Cluster](dp-copydata-redshift-prereq.md).

**Topics**
+ [Define a Pipeline in JSON Format](dp-copydata-redshift-define-pipeline-cli.md)
+ [Upload and Activate the Pipeline Definition](dp-copydata-redshift-upload-cli.md)

# Define a Pipeline in JSON Format
<a name="dp-copydata-redshift-define-pipeline-cli"></a>

This example scenario shows how to copy data from an Amazon S3 bucket to Amazon Redshift.

This is the full pipeline definition JSON file followed by an explanation for each of its sections. We recommend that you use a text editor that can help you verify the syntax of JSON-formatted files, and name the file using the `.json` file extension.

```
{
  "objects": [
    {
      "id": "CSVId1",
      "name": "DefaultCSV1",
      "type": "CSV"
    },
    {
      "id": "RedshiftDatabaseId1",
      "databaseName": "dbname",
      "username": "user",
      "name": "DefaultRedshiftDatabase1",
      "*password": "password",
      "type": "RedshiftDatabase",
      "clusterId": "redshiftclusterId"
    },
    {
      "id": "Default",
      "scheduleType": "timeseries",
      "failureAndRerunMode": "CASCADE",
      "name": "Default",
      "role": "DataPipelineDefaultRole",
      "resourceRole": "DataPipelineDefaultResourceRole"
    },
    {
      "id": "RedshiftDataNodeId1",
      "schedule": {
        "ref": "ScheduleId1"
      },
      "tableName": "orders",
      "name": "DefaultRedshiftDataNode1",
      "createTableSql": "create table StructuredLogs (requestBeginTime CHAR(30) PRIMARY KEY DISTKEY SORTKEY, requestEndTime CHAR(30), hostname CHAR(100), requestDate varchar(20));",
      "type": "RedshiftDataNode",
      "database": {
        "ref": "RedshiftDatabaseId1"
      }
    },
    {
      "id": "Ec2ResourceId1",
      "schedule": {
        "ref": "ScheduleId1"
      },
      "securityGroups": "MySecurityGroup",
      "name": "DefaultEc2Resource1",
      "role": "DataPipelineDefaultRole",
      "logUri": "s3://myLogs",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "type": "Ec2Resource"
    },
    {
      "id": "ScheduleId1",
      "startDateTime": "yyyy-mm-ddT00:00:00",
      "name": "DefaultSchedule1",
      "type": "Schedule",
      "period": "period",
      "endDateTime": "yyyy-mm-ddT00:00:00"
    },
    {
      "id": "S3DataNodeId1",
      "schedule": {
        "ref": "ScheduleId1"
      },
      "filePath": "s3://datapipeline-us-east-1/samples/hive-ads-samples.csv",
      "name": "DefaultS3DataNode1",
      "dataFormat": {
        "ref": "CSVId1"
      },
      "type": "S3DataNode"
    },
    {
      "id": "RedshiftCopyActivityId1",
      "input": {
        "ref": "S3DataNodeId1"
      },
      "schedule": {
        "ref": "ScheduleId1"
      },
      "insertMode": "KEEP_EXISTING",
      "name": "DefaultRedshiftCopyActivity1",
      "runsOn": {
        "ref": "Ec2ResourceId1"
      },
      "type": "RedshiftCopyActivity",
      "output": {
        "ref": "RedshiftDataNodeId1"
      }
    }
  ]
}
```

For more information about these objects, see the following documentation.

**Topics**
+ [Data Nodes](dp-copydata-redshift-node-cli.md)
+ [Resource](dp-copydata-redshift-resource-cli.md)
+ [Activity](dp-copydata-redshift-activity-cli.md)

# Data Nodes
<a name="dp-copydata-redshift-node-cli"></a>

This example uses an input data node, an output data node, and a database.

**Input Data Node**  
The input `S3DataNode` pipeline component defines the location of the input data in Amazon S3 and the data format of the input data. For more information, see [S3DataNode](dp-object-s3datanode.md).

This input component is defined by the following fields:

```
{
  "id": "S3DataNodeId1",
  "schedule": {
    "ref": "ScheduleId1"
  },
  "filePath": "s3://datapipeline-us-east-1/samples/hive-ads-samples.csv",
  "name": "DefaultS3DataNode1",
  "dataFormat": {
    "ref": "CSVId1"
  },
  "type": "S3DataNode"
},
```

`id`  
The user-defined ID, which is a label for your reference only.

`schedule`  
A reference to the schedule component.

`filePath`  
The path to the data associated with the data node, which is an CSV input file in this example.

`name`  
The user-defined name, which is a label for your reference only.

`dataFormat`  
A reference to the format of the data for the activity to process.

**Output Data Node**  
The output `RedshiftDataNode` pipeline component defines a location for the output data; in this case, a table in an Amazon Redshift database. For more information, see [RedshiftDataNode](dp-object-redshiftdatanode.md). This output component is defined by the following fields: 

```
{
  "id": "RedshiftDataNodeId1",
  "schedule": {
    "ref": "ScheduleId1"
  },
  "tableName": "orders",
  "name": "DefaultRedshiftDataNode1",
  "createTableSql": "create table StructuredLogs (requestBeginTime CHAR(30) PRIMARY KEY DISTKEY SORTKEY, requestEndTime CHAR(30), hostname CHAR(100), requestDate varchar(20));",
  "type": "RedshiftDataNode",
  "database": {
    "ref": "RedshiftDatabaseId1"
  }
},
```

`id`  
The user-defined ID, which is a label for your reference only.

`schedule`  
A reference to the schedule component.

`tableName`  
The name of the Amazon Redshift table.

`name`  
The user-defined name, which is a label for your reference only.

`createTableSql`  
A SQL expression to create the table in the database.

`database`  
A reference to the Amazon Redshift database.

**Database**  
The `RedshiftDatabase` component is defined by the following fields. For more information, see [RedshiftDatabase](dp-object-redshiftdatabase.md).

```
{
  "id": "RedshiftDatabaseId1",
  "databaseName": "dbname",
  "username": "user",
  "name": "DefaultRedshiftDatabase1",
  "*password": "password",
  "type": "RedshiftDatabase",
  "clusterId": "redshiftclusterId"
},
```

`id`  
The user-defined ID, which is a label for your reference only.

`databaseName`  
The name of the logical database.

`username`  
The user name to connect to the database.

`name`  
The user-defined name, which is a label for your reference only.

`password`  
The password to connect to the database.

`clusterId`  
The ID of the Redshift cluster. 

# Resource
<a name="dp-copydata-redshift-resource-cli"></a>

This is a definition of the computational resource that performs the copy operation. In this example, AWS Data Pipeline should automatically create an EC2 instance to perform the copy task and terminate the instance after the task completes. The fields defined here control the creation and function of the instance that does the work. For more information, see [Ec2Resource](dp-object-ec2resource.md).

The `Ec2Resource` is defined by the following fields:

```
{
  "id": "Ec2ResourceId1",
  "schedule": {
    "ref": "ScheduleId1"
  },
  "securityGroups": "MySecurityGroup",
  "name": "DefaultEc2Resource1",
  "role": "DataPipelineDefaultRole",
  "logUri": "s3://myLogs",
  "resourceRole": "DataPipelineDefaultResourceRole",
  "type": "Ec2Resource"
},
```

`id`  
The user-defined ID, which is a label for your reference only.

`schedule`  
The schedule on which to create this computational resource.

`securityGroups`  
The security group to use for the instances in the resource pool.

`name`  
The user-defined name, which is a label for your reference only.

`role`  
The IAM role of the account that accesses resources, such as accessing an Amazon S3 bucket to retrieve data.

`logUri`  
The Amazon S3 destination path to back up Task Runner logs from the `Ec2Resource`.

`resourceRole`  
The IAM role of the account that creates resources, such as creating and configuring an EC2 instance on your behalf. Role and ResourceRole can be the same role, but separately provide greater granularity in your security configuration.

# Activity
<a name="dp-copydata-redshift-activity-cli"></a>

The last section in the JSON file is the definition of the activity that represents the work to perform. In this case, we use a `RedshiftCopyActivity` component to copy data from Amazon S3 to Amazon Redshift. For more information, see [RedshiftCopyActivity](dp-object-redshiftcopyactivity.md).

The `RedshiftCopyActivity` component is defined by the following fields:

```
{
  "id": "RedshiftCopyActivityId1",
  "input": {
    "ref": "S3DataNodeId1"
  },
  "schedule": {
    "ref": "ScheduleId1"
  },
  "insertMode": "KEEP_EXISTING",
  "name": "DefaultRedshiftCopyActivity1",
  "runsOn": {
    "ref": "Ec2ResourceId1"
  },
  "type": "RedshiftCopyActivity",
  "output": {
    "ref": "RedshiftDataNodeId1"
  }
},
```

`id`  
The user-defined ID, which is a label for your reference only.

`input`  
A reference to the Amazon S3 source file.

`schedule`  
The schedule on which to run this activity.

`insertMode`  
The insert type (`KEEP_EXISTING`, `OVERWRITE_EXISTING`, or `TRUNCATE`).

`name`  
The user-defined name, which is a label for your reference only.

`runsOn`  
The computational resource that performs the work that this activity defines.

`output`  
A reference to the Amazon Redshift destination table.

# Upload and Activate the Pipeline Definition
<a name="dp-copydata-redshift-upload-cli"></a>

You must upload your pipeline definition and activate your pipeline. In the following example commands, replace *pipeline\$1name* with a label for your pipeline and *pipeline\$1file* with the fully-qualified path for the pipeline definition `.json` file.

**AWS CLI**

To create your pipeline definition and activate your pipeline, use the following [create-pipeline](https://docs.aws.amazon.com/cli/latest/reference/datapipeline/create-pipeline.html) command. Note the ID of your pipeline, because you'll use this value with most CLI commands.

```
aws datapipeline create-pipeline --name pipeline_name --unique-id token
{
    "pipelineId": "df-00627471SOVYZEXAMPLE"
}
```

To upload your pipeline definition, use the following [put-pipeline-definition](https://docs.aws.amazon.com/cli/latest/reference/datapipeline/put-pipeline-definition.html) command.

```
aws datapipeline put-pipeline-definition --pipeline-id df-00627471SOVYZEXAMPLE --pipeline-definition file://MyEmrPipelineDefinition.json
```

If you pipeline validates successfully, the `validationErrors` field is empty. You should review any warnings.

To activate your pipeline, use the following [activate-pipeline](https://docs.aws.amazon.com/cli/latest/reference/datapipeline/activate-pipeline.html) command.

```
aws datapipeline activate-pipeline --pipeline-id df-00627471SOVYZEXAMPLE
```

You can verify that your pipeline appears in the pipeline list using the following [list-pipelines](https://docs.aws.amazon.com/cli/latest/reference/datapipeline/list-pipelines.html) command.

```
aws datapipeline list-pipelines
```