

AWS Data Pipeline is no longer available to new customers. Existing customers of AWS Data Pipeline can continue to use the service as normal. [Learn more](https://aws.amazon.com/blogs/big-data/migrate-workloads-from-aws-data-pipeline/)

# Define a Pipeline in JSON Format
<a name="dp-copydata-redshift-define-pipeline-cli"></a>

This example scenario shows how to copy data from an Amazon S3 bucket to Amazon Redshift.

This is the full pipeline definition JSON file followed by an explanation for each of its sections. We recommend that you use a text editor that can help you verify the syntax of JSON-formatted files, and name the file using the `.json` file extension.

```
{
  "objects": [
    {
      "id": "CSVId1",
      "name": "DefaultCSV1",
      "type": "CSV"
    },
    {
      "id": "RedshiftDatabaseId1",
      "databaseName": "dbname",
      "username": "user",
      "name": "DefaultRedshiftDatabase1",
      "*password": "password",
      "type": "RedshiftDatabase",
      "clusterId": "redshiftclusterId"
    },
    {
      "id": "Default",
      "scheduleType": "timeseries",
      "failureAndRerunMode": "CASCADE",
      "name": "Default",
      "role": "DataPipelineDefaultRole",
      "resourceRole": "DataPipelineDefaultResourceRole"
    },
    {
      "id": "RedshiftDataNodeId1",
      "schedule": {
        "ref": "ScheduleId1"
      },
      "tableName": "orders",
      "name": "DefaultRedshiftDataNode1",
      "createTableSql": "create table StructuredLogs (requestBeginTime CHAR(30) PRIMARY KEY DISTKEY SORTKEY, requestEndTime CHAR(30), hostname CHAR(100), requestDate varchar(20));",
      "type": "RedshiftDataNode",
      "database": {
        "ref": "RedshiftDatabaseId1"
      }
    },
    {
      "id": "Ec2ResourceId1",
      "schedule": {
        "ref": "ScheduleId1"
      },
      "securityGroups": "MySecurityGroup",
      "name": "DefaultEc2Resource1",
      "role": "DataPipelineDefaultRole",
      "logUri": "s3://myLogs",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "type": "Ec2Resource"
    },
    {
      "id": "ScheduleId1",
      "startDateTime": "yyyy-mm-ddT00:00:00",
      "name": "DefaultSchedule1",
      "type": "Schedule",
      "period": "period",
      "endDateTime": "yyyy-mm-ddT00:00:00"
    },
    {
      "id": "S3DataNodeId1",
      "schedule": {
        "ref": "ScheduleId1"
      },
      "filePath": "s3://datapipeline-us-east-1/samples/hive-ads-samples.csv",
      "name": "DefaultS3DataNode1",
      "dataFormat": {
        "ref": "CSVId1"
      },
      "type": "S3DataNode"
    },
    {
      "id": "RedshiftCopyActivityId1",
      "input": {
        "ref": "S3DataNodeId1"
      },
      "schedule": {
        "ref": "ScheduleId1"
      },
      "insertMode": "KEEP_EXISTING",
      "name": "DefaultRedshiftCopyActivity1",
      "runsOn": {
        "ref": "Ec2ResourceId1"
      },
      "type": "RedshiftCopyActivity",
      "output": {
        "ref": "RedshiftDataNodeId1"
      }
    }
  ]
}
```

For more information about these objects, see the following documentation.

**Topics**
+ [Data Nodes](dp-copydata-redshift-node-cli.md)
+ [Resource](dp-copydata-redshift-resource-cli.md)
+ [Activity](dp-copydata-redshift-activity-cli.md)

# Data Nodes
<a name="dp-copydata-redshift-node-cli"></a>

This example uses an input data node, an output data node, and a database.

**Input Data Node**  
The input `S3DataNode` pipeline component defines the location of the input data in Amazon S3 and the data format of the input data. For more information, see [S3DataNode](dp-object-s3datanode.md).

This input component is defined by the following fields:

```
{
  "id": "S3DataNodeId1",
  "schedule": {
    "ref": "ScheduleId1"
  },
  "filePath": "s3://datapipeline-us-east-1/samples/hive-ads-samples.csv",
  "name": "DefaultS3DataNode1",
  "dataFormat": {
    "ref": "CSVId1"
  },
  "type": "S3DataNode"
},
```

`id`  
The user-defined ID, which is a label for your reference only.

`schedule`  
A reference to the schedule component.

`filePath`  
The path to the data associated with the data node, which is an CSV input file in this example.

`name`  
The user-defined name, which is a label for your reference only.

`dataFormat`  
A reference to the format of the data for the activity to process.

**Output Data Node**  
The output `RedshiftDataNode` pipeline component defines a location for the output data; in this case, a table in an Amazon Redshift database. For more information, see [RedshiftDataNode](dp-object-redshiftdatanode.md). This output component is defined by the following fields: 

```
{
  "id": "RedshiftDataNodeId1",
  "schedule": {
    "ref": "ScheduleId1"
  },
  "tableName": "orders",
  "name": "DefaultRedshiftDataNode1",
  "createTableSql": "create table StructuredLogs (requestBeginTime CHAR(30) PRIMARY KEY DISTKEY SORTKEY, requestEndTime CHAR(30), hostname CHAR(100), requestDate varchar(20));",
  "type": "RedshiftDataNode",
  "database": {
    "ref": "RedshiftDatabaseId1"
  }
},
```

`id`  
The user-defined ID, which is a label for your reference only.

`schedule`  
A reference to the schedule component.

`tableName`  
The name of the Amazon Redshift table.

`name`  
The user-defined name, which is a label for your reference only.

`createTableSql`  
A SQL expression to create the table in the database.

`database`  
A reference to the Amazon Redshift database.

**Database**  
The `RedshiftDatabase` component is defined by the following fields. For more information, see [RedshiftDatabase](dp-object-redshiftdatabase.md).

```
{
  "id": "RedshiftDatabaseId1",
  "databaseName": "dbname",
  "username": "user",
  "name": "DefaultRedshiftDatabase1",
  "*password": "password",
  "type": "RedshiftDatabase",
  "clusterId": "redshiftclusterId"
},
```

`id`  
The user-defined ID, which is a label for your reference only.

`databaseName`  
The name of the logical database.

`username`  
The user name to connect to the database.

`name`  
The user-defined name, which is a label for your reference only.

`password`  
The password to connect to the database.

`clusterId`  
The ID of the Redshift cluster. 

# Resource
<a name="dp-copydata-redshift-resource-cli"></a>

This is a definition of the computational resource that performs the copy operation. In this example, AWS Data Pipeline should automatically create an EC2 instance to perform the copy task and terminate the instance after the task completes. The fields defined here control the creation and function of the instance that does the work. For more information, see [Ec2Resource](dp-object-ec2resource.md).

The `Ec2Resource` is defined by the following fields:

```
{
  "id": "Ec2ResourceId1",
  "schedule": {
    "ref": "ScheduleId1"
  },
  "securityGroups": "MySecurityGroup",
  "name": "DefaultEc2Resource1",
  "role": "DataPipelineDefaultRole",
  "logUri": "s3://myLogs",
  "resourceRole": "DataPipelineDefaultResourceRole",
  "type": "Ec2Resource"
},
```

`id`  
The user-defined ID, which is a label for your reference only.

`schedule`  
The schedule on which to create this computational resource.

`securityGroups`  
The security group to use for the instances in the resource pool.

`name`  
The user-defined name, which is a label for your reference only.

`role`  
The IAM role of the account that accesses resources, such as accessing an Amazon S3 bucket to retrieve data.

`logUri`  
The Amazon S3 destination path to back up Task Runner logs from the `Ec2Resource`.

`resourceRole`  
The IAM role of the account that creates resources, such as creating and configuring an EC2 instance on your behalf. Role and ResourceRole can be the same role, but separately provide greater granularity in your security configuration.

# Activity
<a name="dp-copydata-redshift-activity-cli"></a>

The last section in the JSON file is the definition of the activity that represents the work to perform. In this case, we use a `RedshiftCopyActivity` component to copy data from Amazon S3 to Amazon Redshift. For more information, see [RedshiftCopyActivity](dp-object-redshiftcopyactivity.md).

The `RedshiftCopyActivity` component is defined by the following fields:

```
{
  "id": "RedshiftCopyActivityId1",
  "input": {
    "ref": "S3DataNodeId1"
  },
  "schedule": {
    "ref": "ScheduleId1"
  },
  "insertMode": "KEEP_EXISTING",
  "name": "DefaultRedshiftCopyActivity1",
  "runsOn": {
    "ref": "Ec2ResourceId1"
  },
  "type": "RedshiftCopyActivity",
  "output": {
    "ref": "RedshiftDataNodeId1"
  }
},
```

`id`  
The user-defined ID, which is a label for your reference only.

`input`  
A reference to the Amazon S3 source file.

`schedule`  
The schedule on which to run this activity.

`insertMode`  
The insert type (`KEEP_EXISTING`, `OVERWRITE_EXISTING`, or `TRUNCATE`).

`name`  
The user-defined name, which is a label for your reference only.

`runsOn`  
The computational resource that performs the work that this activity defines.

`output`  
A reference to the Amazon Redshift destination table.