AWS Data Pipeline is no longer available to new customers. Existing customers of AWS Data Pipeline can continue to use the service as normal. [Learn more](https://aws.amazon.com/blogs/big-data/migrate-workloads-from-aws-data-pipeline/)

# Pipeline Object Reference
<a name="dp-pipeline-objects"></a>

You can use the following pipeline objects and components in your pipeline definition.

**Topics**
+ [Data Nodes](dp-object-datanodes.md)
+ [Activities](dp-object-activities.md)
+ [Resources](dp-object-resources.md)
+ [Preconditions](dp-object-preconditions.md)
+ [Databases](dp-object-databases.md)
+ [Data Formats](dp-object-dataformats.md)
+ [Actions](dp-object-actions.md)
+ [Schedule](dp-object-schedule.md)
+ [Utilities](dp-object-utilities.md)

**Note**  
For an example application that uses the AWS Data Pipeline Java SDK, see [Data Pipeline DynamoDB Export Java Sample](https://github.com/awslabs/data-pipeline-samples/tree/master/samples/DynamoDBExportJava) on GitHub.

The following is the object hierarchy for AWS Data Pipeline.

![\[AWS Data Pipeline object hierarchy\]](http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/images/object_hierarchy.png)


# Data Nodes
<a name="dp-object-datanodes"></a>

The following are the AWS Data Pipeline data node objects:

**Topics**
+ [DynamoDBDataNode](dp-object-dynamodbdatanode.md)
+ [MySqlDataNode](dp-object-mysqldatanode.md)
+ [RedshiftDataNode](dp-object-redshiftdatanode.md)
+ [S3DataNode](dp-object-s3datanode.md)
+ [SqlDataNode](dp-object-sqldatanode.md)

# DynamoDBDataNode
<a name="dp-object-dynamodbdatanode"></a>

 Defines a data node using DynamoDB, which is specified as an input to a `HiveActivity` or `EMRActivity` object. 

**Note**  
The `DynamoDBDataNode` object does not support the `Exists` precondition.

## Example
<a name="dynamodbdatanode-example"></a>

The following is an example of this object type. This object references two other objects that you'd define in the same pipeline definition file. `CopyPeriod` is a `Schedule` object and `Ready` is a precondition object.

```
{
  "id" : "MyDynamoDBTable",
  "type" : "DynamoDBDataNode",
  "schedule" : { "ref" : "CopyPeriod" },
  "tableName" : "adEvents",
  "precondition" : { "ref" : "Ready" }
}
```

## Syntax
<a name="dynamodbdatanode-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| tableName | The DynamoDB table. | String | 

 
****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Users must specify a schedule reference to another object to set the dependency execution order for this object. Users can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), users can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [Schedule](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html). | Reference Object, for example, "schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If this field is set, then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| dataFormat | DataFormat for the data described by this data node. Currently supported for HiveActivity and HiveCopyActivity. | Reference Object, "dataFormat":\$1"ref":"myDynamoDBDataFormatId"\$1 | 
| dependsOn | Specify dependency on another runnable object | Reference Object, e.g. "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun | Enumeration | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, e.g. "precondition":\$1"ref":"myPreconditionId"\$1 | 
| readThroughputPercent | Sets the rate of read operations to keep your DynamoDB provisioned throughput rate in the allocated range for your table. The value is a double between 0.1 and 1.0, inclusively. | Double | 
| region | The code for the region where the DynamoDB table exists. For example, us-east-1. This is used by HiveActivity when it performs staging for DynamoDB tables in Hive. | Enumeration | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| runsOn | The computational resource to run the activity or command. For example, an Amazon EC2 instance or Amazon EMR cluster. | Reference Object, e.g. "runsOn":\$1"ref":"myResourceId"\$1 | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries. | Enumeration | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 
| writeThroughputPercent | Sets the rate of write operations to keep your DynamoDB provisioned throughput rate in the allocated range for your table. The value is a double between .1 and 1.0, inclusively. | Double | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object | DateTime | 
| @scheduledStartTime | Schedule start time for object | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object | String | 
| @pipelineId | Id of the pipeline to which this object belongs to | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects | String | 

# MySqlDataNode
<a name="dp-object-mysqldatanode"></a>

 Defines a data node using MySQL. 

**Note**  
The `MySqlDataNode` type is deprecated. We recommend that you use [SqlDataNode](dp-object-sqldatanode.md) instead.

## Example
<a name="mysqldatanode-example"></a>

The following is an example of this object type. This object references two other objects that you'd define in the same pipeline definition file. `CopyPeriod` is a `Schedule` object and `Ready` is a precondition object.

```
{
  "id" : "Sql Table",
  "type" : "MySqlDataNode",
  "schedule" : { "ref" : "CopyPeriod" },
  "table" : "adEvents",
  "username": "user_name",
  "*password": "my_password",
  "connectionString": "jdbc:mysql://mysqlinstance-rds.example.us-east-1.rds.amazonaws.com:3306/database_name",
  "selectQuery" : "select * from #{table} where eventTime >= '#{@scheduledStartTime.format('YYYY-MM-dd HH:mm:ss')}' and eventTime < '#{@scheduledEndTime.format('YYYY-MM-dd HH:mm:ss')}'",
  "precondition" : { "ref" : "Ready" }
}
```

### Syntax
<a name="mysqldatanode-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| table | The name of the table in the MySQL database. | String | 

 
****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Users must specify a schedule reference to another object to set the dependency execution order for this object. Users can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), users can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html) | Reference Object, e.g. "schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| createTableSql | An SQL create table expression that creates the table. | String | 
| database | The name of the database. | Reference Object, e.g. "database":\$1"ref":"myDatabaseId"\$1 | 
| dependsOn | Specifies dependency on another runnable object. | Reference Object, e.g. "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun. | Enumeration | 
| insertQuery | An SQL statement to insert data into the table. | String | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, e.g. "precondition":\$1"ref":"myPreconditionId"\$1 | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| runsOn | The computational resource to run the activity or command. For example, an Amazon EC2 instance or Amazon EMR cluster. | Reference Object, e.g. "runsOn":\$1"ref":"myResourceId"\$1 | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries. | Enumeration | 
| schemaName | The name of the schema holding the table | String | 
| selectQuery | A SQL statement to fetch data from the table. | String | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object. | DateTime | 
| @scheduledStartTime | Schedule start time for object. | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | Id of the pipeline to which this object belongs to. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

## See Also
<a name="mysqldatanode-seealso"></a>
+ [S3DataNode](dp-object-s3datanode.md)

# RedshiftDataNode
<a name="dp-object-redshiftdatanode"></a>

Defines a data node using Amazon Redshift. `RedshiftDataNode` represents the properties of the data inside a database, such as a data table, used by your pipeline.

## Example
<a name="redshiftdatanode-example"></a>

The following is an example of this object type.

```
{
  "id" : "MyRedshiftDataNode",
  "type" : "RedshiftDataNode",
  "database": { "ref": "MyRedshiftDatabase" },
  "tableName": "adEvents",
  "schedule": { "ref": "Hour" }
}
```

## Syntax
<a name="redshiftdatanode-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| database | The database on which the table resides. | Reference Object, e.g. "database":\$1"ref":"myRedshiftDatabaseId"\$1 | 
| tableName | The name of the Amazon Redshift table. The table is created if it doesn't already exist and you've provided createTableSql. | String | 

 
****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Users must specify a schedule reference to another object to set the dependency execution order for this object. Users can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), users can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html) | Reference Object, e.g. "schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| createTableSql | An SQL expression to create the table in the database. We recommend that you specify the schema where the table should be created, for example: CREATE TABLE mySchema.myTable (bestColumn varchar(25) primary key distkey, numberOfWins integer sortKey). AWS Data Pipeline runs the script in the createTableSql field if the table, specified by tableName, does not exist in the schema, specified by the schemaName field. For example, if you specify schemaName as mySchema but do not include mySchema in the createTableSql field, the table is created in the wrong schema (by default, it would be created in PUBLIC). This occurs because AWS Data Pipeline does not parse your CREATE TABLE statements.  | String | 
| dependsOn | Specify dependency on another runnable object | Reference Object, e.g. "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun | Enumeration | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | The maximum number attempt retries on failure. | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, e.g. "precondition":\$1"ref":"myPreconditionId"\$1 | 
| primaryKeys | If you do not specify primaryKeys for a destination table in RedShiftCopyActivity, you can specify a list of columns using primaryKeys which will act as a mergeKey. However, if you have an existing primaryKey defined in an Amazon Redshift table, this setting overrides the existing key. | String | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| runsOn | The computational resource to run the activity or command. For example, an Amazon EC2 instance or Amazon EMR cluster. | Reference Object, e.g. "runsOn":\$1"ref":"myResourceId"\$1 | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries. | Enumeration | 
| schemaName | This optional field specifies the name of the schema for the Amazon Redshift table. If not specified, the schema name is PUBLIC, which is the default schema in Amazon Redshift. For more information, see the Amazon Redshift Database Developer Guide.  | String | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object | DateTime | 
| @scheduledStartTime | Schedule start time for object | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | Id of the pipeline to which this object belongs to. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

# S3DataNode
<a name="dp-object-s3datanode"></a>

 Defines a data node using Amazon S3. By default, the S3DataNode uses server-side encryption. If you would like to disable this, set s3EncryptionType to NONE. 

**Note**  
When you use an `S3DataNode` as input to `CopyActivity`, only the CSV and TSV data formats are supported.

## Example
<a name="s3datanode-example"></a>

The following is an example of this object type. This object references another object that you'd define in the same pipeline definition file. `CopyPeriod` is a `Schedule` object.

```
{
  "id" : "OutputData",
  "type" : "S3DataNode",
  "schedule" : { "ref" : "CopyPeriod" },
  "filePath" : "s3://amzn-s3-demo-bucket/#{@scheduledStartTime}.csv"
}
```

## Syntax
<a name="s3datanode-syntax"></a>


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Users must specify a schedule reference to another object to set the dependency execution order for this object. Users can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), users can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html) | Reference Object, e.g. "schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| compression | The type of compression for the data described by the S3DataNode. "none" is no compression and "gzip" is compressed with the gzip algorithm. This field is only supported for use with Amazon Redshift and when you use S3DataNode with CopyActivity. | Enumeration | 
| dataFormat | DataFormat for the data described by this S3DataNode. | Reference Object, e.g. "dataFormat":\$1"ref":"myDataFormatId"\$1 | 
| dependsOn | Specify dependency on another runnable object | Reference Object, e.g. "dependsOn":\$1"ref":"myActivityId"\$1 | 
| directoryPath | Amazon S3 directory path as a URI: s3://my-bucket/my-key-for-directory. You must provide either a filePath or directoryPath value. | String | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun | Enumeration | 
| filePath | The path to the object in Amazon S3 as a URI, for example: s3://my-bucket/my-key-for-file. You must provide either a filePath or directoryPath value. These represent a folder and a file name. Use the directoryPath value to accommodate multiple files in a directory. | String | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| manifestFilePath | The Amazon S3 path to a manifest file in the format supported by Amazon Redshift. AWS Data Pipeline uses the manifest file to copy the specified Amazon S3 files into the table. This field is valid only when a RedShiftCopyActivity references the S3DataNode. | String | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, e.g. "precondition":\$1"ref":"myPreconditionId"\$1 | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| runsOn | The computational resource to run the activity or command. For example, an Amazon EC2 instance or Amazon EMR cluster. | Reference Object, e.g. "runsOn":\$1"ref":"myResourceId"\$1 | 
| s3EncryptionType | Overrides the Amazon S3 encryption type. Values are SERVER\$1SIDE\$1ENCRYPTION or NONE. Server-side encryption is enabled by default.  | Enumeration | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries. | Enumeration | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object | DateTime | 
| @scheduledStartTime | Schedule start time for object | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object | String | 
| @pipelineId | Id of the pipeline to which this object belongs to | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects | String | 

## See Also
<a name="s3datanode-seealso"></a>
+ [MySqlDataNode](dp-object-mysqldatanode.md)

# SqlDataNode
<a name="dp-object-sqldatanode"></a>

 Defines a data node using SQL. 

## Example
<a name="example-sql-data-node"></a>

The following is an example of this object type. This object references two other objects that you'd define in the same pipeline definition file. `CopyPeriod` is a `Schedule` object and `Ready` is a precondition object.

```
{
  "id" : "Sql Table",
  "type" : "SqlDataNode",
  "schedule" : { "ref" : "CopyPeriod" },
  "table" : "adEvents",
  "database":"myDataBaseName",
  "selectQuery" : "select * from #{table} where eventTime >= '#{@scheduledStartTime.format('YYYY-MM-dd HH:mm:ss')}' and eventTime < '#{@scheduledEndTime.format('YYYY-MM-dd HH:mm:ss')}'",
  "precondition" : { "ref" : "Ready" }
}
```

## Syntax
<a name="sql-data-node-slots"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| table | The name of the table in the SQL database. | String | 

 
****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Users must specify a schedule reference to another object to set the dependency execution order for this object. Users can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), users can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html) | Reference Object, e.g. "schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| createTableSql | An SQL create table expression that creates the table. | String | 
| database | The name of the database. | Reference Object, e.g. "database":\$1"ref":"myDatabaseId"\$1 | 
| dependsOn | Specifies the dependency on another runnable object. | Reference Object, e.g. "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun. | Enumeration | 
| insertQuery | An SQL statement to insert data into the table. | String | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, e.g. "precondition":\$1"ref":"myPreconditionId"\$1 | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| runsOn | The computational resource to run the activity or command. For example, an Amazon EC2 instance or Amazon EMR cluster. | Reference Object, e.g. "runsOn":\$1"ref":"myResourceId"\$1 | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries. | Enumeration | 
| schemaName | The name of the schema holding the table | String | 
| selectQuery | A SQL statement to fetch data from the table. | String | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object | DateTime | 
| @scheduledStartTime | Schedule start time for object | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object | String | 
| @pipelineId | Id of the pipeline to which this object belongs to | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects | String | 

## See Also
<a name="sql-data-node-see-also"></a>
+ [S3DataNode](dp-object-s3datanode.md)

# Activities
<a name="dp-object-activities"></a>

The following are the AWS Data Pipeline activity objects:

**Topics**
+ [CopyActivity](dp-object-copyactivity.md)
+ [EmrActivity](dp-object-emractivity.md)
+ [HadoopActivity](dp-object-hadoopactivity.md)
+ [HiveActivity](dp-object-hiveactivity.md)
+ [HiveCopyActivity](dp-object-hivecopyactivity.md)
+ [PigActivity](dp-object-pigactivity.md)
+ [RedshiftCopyActivity](dp-object-redshiftcopyactivity.md)
+ [ShellCommandActivity](dp-object-shellcommandactivity.md)
+ [SqlActivity](dp-object-sqlactivity.md)

# CopyActivity
<a name="dp-object-copyactivity"></a>

Copies data from one location to another. `CopyActivity` supports [S3DataNode](dp-object-s3datanode.md) and [SqlDataNode](dp-object-sqldatanode.md) as input and output and the copy operation is normally performed record-by-record. However, `CopyActivity` provides a high-performance Amazon S3 to Amazon S3 copy when all the following conditions are met:
+ The input and output are S3DataNodes
+ The `dataFormat` field is the same for input and output

If you provide compressed data files as input and do not indicate this using the `compression` field on the S3 data nodes, `CopyActivity` might fail. In this case, `CopyActivity` does not properly detect the end of record character and the operation fails. Further, `CopyActivity` supports copying from a directory to another directory and copying a file to a directory, but record-by-record copy occurs when copying a directory to a file. Finally, `CopyActivity` does not support copying multipart Amazon S3 files. 

`CopyActivity` has specific limitations to its CSV support. When you use an S3DataNode as input for `CopyActivity`, you can only use a Unix/Linux variant of the CSV data file format for the Amazon S3 input and output fields. The Unix/Linux variant requires the following: 
+ The separator must be the "," (comma) character.
+ The records are not quoted.
+ The default escape character is ASCII value 92 (backslash).
+ The end of record identifier is ASCII value 10 (or "\$1n").

Windows-based systems typically use a different end-of-record character sequence: a carriage return and line feed together (ASCII value 13 and ASCII value 10). You must accommodate this difference using an additional mechanism, such as a pre-copy script to modify the input data, to ensure that `CopyActivity` can properly detect the end of a record; otherwise, the `CopyActivity` fails repeatedly.

When using `CopyActivity` to export from a PostgreSQL RDS object to a TSV data format, the default NULL character is \$1n.

## Example
<a name="copyactivity-example"></a>

The following is an example of this object type. This object references three other objects that you would define in the same pipeline definition file. `CopyPeriod` is a `Schedule` object and `InputData` and `OutputData` are data node objects.

```
{
  "id" : "S3ToS3Copy",
  "type" : "CopyActivity",
  "schedule" : { "ref" : "CopyPeriod" },
  "input" : { "ref" : "InputData" },
  "output" : { "ref" : "OutputData" },
  "runsOn" : { "ref" : "MyEc2Resource" }
}
```

## Syntax
<a name="copyactivity-syntax"></a>


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Users must specify a schedule reference to another object to set the dependency execution order for this object. Users can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), users can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html) | Reference Object, e.g. "schedule":\$1"ref":"myScheduleId"\$1 | 


****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | The computational resource to run the activity or command. For example, an Amazon EC2 instance or Amazon EMR cluster. | Reference Object, e.g. "runsOn":\$1"ref":"myResourceId"\$1 | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| dependsOn | Specify dependency on another runnable object. | Reference Object, e.g. "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun | Enumeration | 
| input | The input data source. | Reference Object, e.g. "input":\$1"ref":"myDataNodeId"\$1 | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| output | The output data source. | Reference Object, e.g. "output":\$1"ref":"myDataNodeId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, e.g. "precondition":\$1"ref":"myPreconditionId"\$1 | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries. | Enumeration | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object | DateTime | 
| @scheduledStartTime | Schedule start time for object | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object | String | 
| @pipelineId | Id of the pipeline to which this object belongs to | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects | String | 

## See Also
<a name="copyactivity-seealso"></a>
+ [ShellCommandActivity](dp-object-shellcommandactivity.md)
+ [EmrActivity](dp-object-emractivity.md)
+ [Export MySQL Data to Amazon S3 Using AWS Data Pipeline](dp-copydata-mysql.md)

# EmrActivity
<a name="dp-object-emractivity"></a>

 Runs an EMR cluster. 

AWS Data Pipeline uses a different format for steps than Amazon EMR; for example, AWS Data Pipeline uses comma-separated arguments after the JAR name in the `EmrActivity` step field. The following example shows a step formatted for Amazon EMR, followed by its AWS Data Pipeline equivalent:

```
s3://amzn-s3-demo-bucket/MyWork.jar arg1 arg2 arg3
```

```
"s3://amzn-s3-demo-bucket/MyWork.jar,arg1,arg2,arg3"
```

## Examples
<a name="emractivity-example"></a>

The following is an example of this object type. This example uses older versions of Amazon EMR. Verify this example for correctness with the version of Amazon EMR cluster that you are using. 

This object references three other objects that you would define in the same pipeline definition file. `MyEmrCluster` is an `EmrCluster` object and `MyS3Input` and `MyS3Output` are `S3DataNode` objects. 

**Note**  
In this example, you can replace the `step` field with your desired cluster string, which could be a Pig script, Hadoop streaming cluster, your own custom JAR including its parameters, or so on.

Hadoop 2.x (AMI 3.x)

```
{
  "id" : "MyEmrActivity",
  "type" : "EmrActivity",
  "runsOn" : { "ref" : "MyEmrCluster" },
  "preStepCommand" : "scp remoteFiles localFiles",
  "step" : ["s3://amzn-s3-demo-bucket/myPath/myStep.jar,firstArg,secondArg,-files,s3://amzn-s3-demo-bucket/myPath/myFile.py,-input,s3://myinputbucket/path,-output,s3://myoutputbucket/path,-mapper,myFile.py,-reducer,reducerName","s3://amzn-s3-demo-bucket/myPath/myotherStep.jar,..."],
  "postStepCommand" : "scp localFiles remoteFiles",
  "input" : { "ref" : "MyS3Input" },
  "output" : { "ref" : "MyS3Output" }
}
```

**Note**  
To pass arguments to an application in a step, you need to specify the Region in the path of the script, as in the following example. In addition, you may need to escape the arguments that you pass. For example, if you use `script-runner.jar` to run a shell script and want to pass arguments to the script, you must escape the commas that separate them. The following step slot illustrates how to do this:   

```
"step" : "s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,s3://datapipeline/echo.sh,a\\\\,b\\\\,c"
```
This step uses `script-runner.jar` to run the `echo.sh` shell script and passes `a`, `b`, and `c` as a single argument to the script. The first escape character is removed from the resultant argument so you may need to escape again. For example, if you had `File\.gz` as an argument in JSON, you could escape it using `File\\\\.gz`. However, because the first escape is discarded, you must use `File\\\\\\\\.gz `.

## Syntax
<a name="emractivity-syntax"></a>


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Specify a schedule reference to another object to set the dependency execution order for this object. You can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), you can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html) | Reference Object, for example, "schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | The Amazon EMR cluster on which this job will run. | Reference Object, for example, "runsOn":\$1"ref":"myEmrClusterId"\$1 | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set, then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| dependsOn | Specify dependency on another runnable object. | Reference Object, for example, "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun. | Enumeration | 
| input | The location of the input data. | Reference Object, for example, "input":\$1"ref":"myDataNodeId"\$1 | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | The maximum number of attempt retries on failure. | Integer | 
| onFail | An action to run when current object fails. | Reference Object, for example, "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, for example, "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when the current object succeeds. | Reference Object, for example, "onSuccess":\$1"ref":"myActionId"\$1 | 
| output | The location of the output data. | Reference Object, for example, "output":\$1"ref":"myDataNodeId"\$1 | 
| parent | The parent of the current object from which slots will be inherited. | Reference Object, for example, "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The Amazon S3 URI, such as 's3://BucketName/Prefix/' for uploading logs for the pipeline. | String | 
| postStepCommand | Shell scripts to be run after all steps are finished. To specify multiple scripts, up to 255, add multiple postStepCommand fields. | String | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, for example, "precondition":\$1"ref":"myPreconditionId"\$1 | 
| preStepCommand | Shell scripts to be run before any steps are run. To specify multiple scripts, up to 255, add multiple preStepCommand fields. | String | 
| reportProgressTimeout | The timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| resizeClusterBeforeRunning |  Resize the cluster before performing this activity to accommodate DynamoDB tables specified as inputs or outputs.   If your `EmrActivity` uses a `DynamoDBDataNode` as either an input or output data node, and if you set the `resizeClusterBeforeRunning` to `TRUE`, AWS Data Pipeline starts using `m3.xlarge` instance types. This overwrites your instance type choices with `m3.xlarge`, which could increase your monthly costs.   | Boolean | 
| resizeClusterMaxInstances | A limit on the maximum number of instances that can be requested by the resize algorithm. | Integer | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of the interval, or end of the interval. Values are: cron, ondemand, and timeseries. The timeseries scheduling means that instances are scheduled at the end of each interval. The cron scheduling means that instances are scheduled at the beginning of each interval. An ondemand schedule allows you to run a pipeline one time per activation. You do not have to clone or re-create the pipeline to run it again. If you use an ondemand schedule, it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use ondemand pipelines, call the ActivatePipeline operation for each subsequent run.  | Enumeration | 
| step | One or more steps for the cluster to run. To specify multiple steps, up to 255, add multiple step fields. Use comma-separated arguments after the JAR name; for example, "s3://amzn-s3-demo-bucket/MyWork.jar,arg1,arg2,arg3". | String | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, for example, "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | Amazon EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for the object. | DateTime | 
| @scheduledStartTime | Schedule start time for the object. | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version that the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, for example, "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | ID of the pipeline to which this object belongs. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

## See Also
<a name="emractivity-seealso"></a>
+ [ShellCommandActivity](dp-object-shellcommandactivity.md)
+ [CopyActivity](dp-object-copyactivity.md)
+ [EmrCluster](dp-object-emrcluster.md)

# HadoopActivity
<a name="dp-object-hadoopactivity"></a>

 Runs a MapReduce job on a cluster. The cluster can be an EMR cluster managed by AWS Data Pipeline or another resource if you use TaskRunner. Use HadoopActivity when you want to run work in parallel. This allows you to use the scheduling resources of the YARN framework or the MapReduce resource negotiator in Hadoop 1. If you would like to run work sequentially using the Amazon EMR Step action, you can still use [EmrActivity](dp-object-emractivity.md).

## Examples
<a name="hadoopactivity-example"></a>

**HadoopActivity using an EMR cluster managed by AWS Data Pipeline**  
The following HadoopActivity object uses an EmrCluster resource to run a program:

```
 {
   "name": "MyHadoopActivity",
   "schedule": {"ref": "ResourcePeriod"},
   "runsOn": {"ref": “MyEmrCluster”},
   "type": "HadoopActivity",
   "preActivityTaskConfig":{"ref":"preTaskScriptConfig”},   
   "jarUri": "/home/hadoop/contrib/streaming/hadoop-streaming.jar",
   "argument": [
     "-files",
     “s3://elasticmapreduce/samples/wordcount/wordSplitter.py“,
     "-mapper",
     "wordSplitter.py",
     "-reducer",
     "aggregate",
     "-input",
     "s3://elasticmapreduce/samples/wordcount/input/",
     "-output",
     “s3://amzn-s3-demo-bucket/MyHadoopActivity/#{@pipelineId}/#{format(@scheduledStartTime,'YYYY-MM-dd')}"
   ],
   "maximumRetries": "0",
   "postActivityTaskConfig":{"ref":"postTaskScriptConfig”},
   "hadoopQueue" : “high”
 }
```

Here is the corresponding *MyEmrCluster*, which configures the FairScheduler and queues in YARN for Hadoop 2-based AMIs:

```
{
  "id" : "MyEmrCluster",
  "type" : "EmrCluster",
   "hadoopSchedulerType" : "PARALLEL_FAIR_SCHEDULING",
  “amiVersion” : “3.7.0”,
  "bootstrapAction" : ["s3://Region.elasticmapreduce/bootstrap-actions/configure-hadoop,-z,yarn.scheduler.capacity.root.queues=low\,high\,default,-z,yarn.scheduler.capacity.root.high.capacity=50,-z,yarn.scheduler.capacity.root.low.capacity=10,-z,yarn.scheduler.capacity.root.default.capacity=30”]
}
```

This is the EmrCluster you use to configure FairScheduler in Hadoop 1:

```
{
      "id": "MyEmrCluster",
      "type": "EmrCluster",    
      "hadoopSchedulerType": "PARALLEL_FAIR_SCHEDULING",
      "amiVersion": "2.4.8",
      "bootstrapAction": "s3://Region.elasticmapreduce/bootstrap-actions/configure-hadoop,-m,mapred.queue.names=low\\\\,high\\\\,default,-m,mapred.fairscheduler.poolnameproperty=mapred.job.queue.name"
          }
```

The following EmrCluster configures CapacityScheduler for Hadoop 2-based AMIs:

```
{
      "id": "MyEmrCluster",
      "type": "EmrCluster",
      "hadoopSchedulerType": "PARALLEL_CAPACITY_SCHEDULING",
      "amiVersion": "3.7.0",
      "bootstrapAction": "s3://Region.elasticmapreduce/bootstrap-actions/configure-hadoop,-z,yarn.scheduler.capacity.root.queues=low\\\\,high,-z,yarn.scheduler.capacity.root.high.capacity=40,-z,yarn.scheduler.capacity.root.low.capacity=60"
    }
```

**HadoopActivity using an existing EMR cluster**  
In this example, you use workergroups and a TaskRunner to run a program on an existing EMR cluster. The following pipeline definition uses HadoopActivity to: 
+ Run a MapReduce program only on *myWorkerGroup* resources. For more information about worker groups, see [Executing Work on Existing Resources Using Task Runner](dp-how-task-runner-user-managed.md).
+ Run a preActivityTaskConfig and postActivityTaskConfig

```
{
  "objects": [
    {
      "argument": [
        "-files",
        "s3://elasticmapreduce/samples/wordcount/wordSplitter.py",
        "-mapper",
        "wordSplitter.py",
        "-reducer",
        "aggregate",
        "-input",
        "s3://elasticmapreduce/samples/wordcount/input/",
        "-output",
        "s3://amzn-s3-demo-bucket/MyHadoopActivity/#{@pipelineId}/#{format(@scheduledStartTime,'YYYY-MM-dd')}"
      ],
      "id": "MyHadoopActivity",
      "jarUri": "/home/hadoop/contrib/streaming/hadoop-streaming.jar",
      "name": "MyHadoopActivity",
      "type": "HadoopActivity"
    },
    {
      "id": "SchedulePeriod",
      "startDateTime": "start_datetime",
      "name": "SchedulePeriod",
      "period": "1 day",
      "type": "Schedule",
      "endDateTime": "end_datetime"
    },
    {
      "id": "ShellScriptConfig",
      "scriptUri": "s3://amzn-s3-demo-bucket/scripts/preTaskScript.sh",
      "name": "preTaskScriptConfig",
      "scriptArgument": [
        "test",
        "argument"
      ],
      "type": "ShellScriptConfig"
    },
    {
      "id": "ShellScriptConfig",
      "scriptUri": "s3://amzn-s3-demo-bucket/scripts/postTaskScript.sh",
      "name": "postTaskScriptConfig",
      "scriptArgument": [
        "test",
        "argument"
      ],
      "type": "ShellScriptConfig"
    },
    {
      "id": "Default",
      "scheduleType": "cron",
      "schedule": {
        "ref": "SchedulePeriod"
      },
      "name": "Default",
      "pipelineLogUri": "s3://amzn-s3-demo-bucket/logs/2015-05-22T18:02:00.343Z642f3fe415",
      "maximumRetries": "0",    
      "workerGroup": "myWorkerGroup",
      "preActivityTaskConfig": {
        "ref": "preTaskScriptConfig"
      },
      "postActivityTaskConfig": {
        "ref": "postTaskScriptConfig"
      }    
    }
  ] 
}
```

## Syntax
<a name="hadoopactivity-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| jarUri | Location of a JAR in Amazon S3 or the local file system of the cluster to run with HadoopActivity. | String | 

 
****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Users must specify a schedule reference to another object to set the dependency execution order for this object. Users can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), users can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html) | Reference Object, e.g. "schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | EMR Cluster on which this job will run. | Reference Object, e.g. "runsOn":\$1"ref":"myEmrClusterId"\$1 | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| argument | Arguments to pass to the JAR. | String | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| dependsOn | Specify dependency on another runnable object. | Reference Object, e.g. "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun | Enumeration | 
| hadoopQueue | The Hadoop scheduler queue name on which the activity will be submitted. | String | 
| input | Location of the input data. | Reference Object, e.g. "input":\$1"ref":"myDataNodeId"\$1 | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| mainClass | The main class of the JAR you are executing with HadoopActivity. | String | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| output | Location of the output data. | Reference Object, e.g. "output":\$1"ref":"myDataNodeId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| postActivityTaskConfig | Post-activity configuration script to be run. This consists of a URI of the shell script in Amazon S3 and a list of arguments. | Reference Object, e.g. "postActivityTaskConfig":\$1"ref":"myShellScriptConfigId"\$1 | 
| preActivityTaskConfig | Pre-activity configuration script to be run. This consists of a URI of the shell script in Amazon S3 and a list of arguments. | Reference Object, e.g. "preActivityTaskConfig":\$1"ref":"myShellScriptConfigId"\$1 | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, e.g. "precondition":\$1"ref":"myPreconditionId"\$1 | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries. | Enumeration | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object | DateTime | 
| @scheduledStartTime | Schedule start time for object | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | Id of the pipeline to which this object belongs to. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

## See Also
<a name="hadoopactivity-seealso"></a>
+ [ShellCommandActivity](dp-object-shellcommandactivity.md)
+ [CopyActivity](dp-object-copyactivity.md)
+ [EmrCluster](dp-object-emrcluster.md)

# HiveActivity
<a name="dp-object-hiveactivity"></a>

Runs a Hive query on an EMR cluster. `HiveActivity` makes it easier to set up an Amazon EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon RDS. All you need to specify is the HiveQL to run on the source data. AWS Data Pipeline automatically creates Hive tables with `${input1}`, `${input2}`, and so on, based on the input fields in the `HiveActivity` object. 

For Amazon S3 inputs, the `dataFormat` field is used to create the Hive column names. 

For MySQL (Amazon RDS) inputs, the column names for the SQL query are used to create the Hive column names.

**Note**  
This activity uses the Hive [CSV Serde](https://cwiki.apache.org/confluence/display/Hive/CSV+Serde).

## Example
<a name="hiveactivity-example"></a>

The following is an example of this object type. This object references three other objects that you define in the same pipeline definition file. `MySchedule` is a `Schedule` object and `MyS3Input` and `MyS3Output` are data node objects.

```
{
  "name" : "ProcessLogData",
  "id" : "MyHiveActivity",
  "type" : "HiveActivity",
  "schedule" : { "ref": "MySchedule" },
  "hiveScript" : "INSERT OVERWRITE TABLE ${output1} select host,user,time,request,status,size from ${input1};",
  "input" : { "ref": "MyS3Input" },
  "output" : { "ref": "MyS3Output" },
  "runsOn" : { "ref": "MyEmrCluster" }
}
```

## Syntax
<a name="hiveactivity-syntax"></a>


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Specify a schedule reference to another object to set the dependency execution order for this object. You can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), you can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html). | Reference Object, e.g. "schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| hiveScript | The Hive script to run. | String | 
| scriptUri | The location of the Hive script to run (for example, s3://scriptLocation). | String | 

 
****  

| Required Group | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | The EMR cluster on which this HiveActivity runs. | Reference Object, e.g. "runsOn":\$1"ref":"myEmrClusterId"\$1 | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 
| input | The input data source. | Reference Object, such as "input":\$1"ref":"myDataNodeId"\$1 | 
| output | The output data source. | Reference Object, such as "output":\$1"ref":"myDataNodeId"\$1 | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set, then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| dependsOn | Specify dependency on another runnable object. | Reference Object, such as "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun. | Enumeration | 
| hadoopQueue | The Hadoop scheduler queue name on which the job will be submitted. | String | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | The maximum number of attempt retries on failure. | Integer | 
| onFail | An action to run when current object fails. | Reference Object, such as "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, such as "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, such as "onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, such as "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| postActivityTaskConfig | Post-activity configuration script to be run. This consists of a URI of the shell script in Amazon S3 and a list of arguments. | Reference Object, such as "postActivityTaskConfig":\$1"ref":"myShellScriptConfigId"\$1 | 
| preActivityTaskConfig | Pre-activity configuration script to be run. This consists of a URI of the shell script in Amazon S3 and a list of arguments. | Reference Object, such as "preActivityTaskConfig":\$1"ref":"myShellScriptConfigId"\$1 | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, such as "precondition":\$1"ref":"myPreconditionId"\$1 | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| resizeClusterBeforeRunning | Resize the cluster before performing this activity to accommodate DynamoDB data nodes specified as inputs or outputs.  If your activity uses a `DynamoDBDataNode` as either an input or output data node, and if you set the `resizeClusterBeforeRunning` to `TRUE`, AWS Data Pipeline starts using `m3.xlarge` instance types. This overwrites your instance type choices with `m3.xlarge`, which could increase your monthly costs.  | Boolean | 
| resizeClusterMaxInstances | A limit on the maximum number of instances that can be requested by the resize algorithm. | Integer | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries. | Enumeration | 
| scriptVariable | Specifies script variables for Amazon EMR to pass to Hive while running a script. For example, the following example script variables would pass a SAMPLE and FILTER\$1DATE variable to Hive: SAMPLE=s3://elasticmapreduce/samples/hive-ads and FILTER\$1DATE=\$1\$1format(@scheduledStartTime,'YYYY-MM-dd')\$1%. This field accepts multiple values and works with both script and scriptUri fields. In addition, scriptVariable functions regardless of whether stage is set to true or false. This field is especially useful to send dynamic values to Hive using AWS Data Pipeline expressions and functions. | String | 
| stage | Determines whether staging is enabled before or after running the script. Not permitted with Hive 11, so use an Amazon EMR AMI version 3.2.0 or greater. | Boolean | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, such as "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, such as "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | Amazon EMR step logs available only on EMR activity attempts. | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for an object. | DateTime | 
| @scheduledStartTime | Schedule start time for an object. | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, such as "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | Id of the pipeline to which this object belongs. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

## See Also
<a name="hiveactivity-seealso"></a>
+ [ShellCommandActivity](dp-object-shellcommandactivity.md)
+ [EmrActivity](dp-object-emractivity.md)

# HiveCopyActivity
<a name="dp-object-hivecopyactivity"></a>

Runs a Hive query on an EMR cluster. `HiveCopyActivity` makes it easier to copy data between DynamoDB tables. `HiveCopyActivity` accepts a HiveQL statement to filter input data from DynamoDB at the column and row level.

## Example
<a name="hivecopyactivity-example"></a>

The following example shows how to use `HiveCopyActivity` and `DynamoDBExportDataFormat` to copy data from one `DynamoDBDataNode` to another, while filtering data, based on a time stamp.

```
{
  "objects": [
    {
      "id" : "DataFormat.1",
      "name" : "DataFormat.1",
      "type" : "DynamoDBExportDataFormat",
      "column" : "timeStamp BIGINT"
    },
    {
      "id" : "DataFormat.2",
      "name" : "DataFormat.2",
      "type" : "DynamoDBExportDataFormat"
    },
    {
      "id" : "DynamoDBDataNode.1",
      "name" : "DynamoDBDataNode.1",
      "type" : "DynamoDBDataNode",
      "tableName" : "item_mapped_table_restore_temp",
      "schedule" : { "ref" : "ResourcePeriod" },
      "dataFormat" : { "ref" : "DataFormat.1" }
    },
    {
      "id" : "DynamoDBDataNode.2",
      "name" : "DynamoDBDataNode.2",
      "type" : "DynamoDBDataNode",
      "tableName" : "restore_table",
      "region" : "us_west_1",
      "schedule" : { "ref" : "ResourcePeriod" },
      "dataFormat" : { "ref" : "DataFormat.2" }
    },
    {
      "id" : "EmrCluster.1",
      "name" : "EmrCluster.1",
      "type" : "EmrCluster",
      "schedule" : { "ref" : "ResourcePeriod" },
      "masterInstanceType" : "m1.xlarge",
      "coreInstanceCount" : "4"
    },
    {
      "id" : "HiveTransform.1",
      "name" : "Hive Copy Transform.1",
      "type" : "HiveCopyActivity",
      "input" : { "ref" : "DynamoDBDataNode.1" },
      "output" : { "ref" : "DynamoDBDataNode.2" },
      "schedule" :{ "ref" : "ResourcePeriod" },
      "runsOn" : { "ref" : "EmrCluster.1" },
      "filterSql" : "`timeStamp` > unix_timestamp(\"#{@scheduledStartTime}\", \"yyyy-MM-dd'T'HH:mm:ss\")"
    },
    {
      "id" : "ResourcePeriod",
      "name" : "ResourcePeriod",
      "type" : "Schedule",
      "period" : "1 Hour",
      "startDateTime" : "2013-06-04T00:00:00",
      "endDateTime" : "2013-06-04T01:00:00"
    }
  ]
}
```

## Syntax
<a name="hivecopyactivity-syntax"></a>


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Users must specify a schedule reference to another object to set the dependency execution order for this object. Users can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), users can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html) | Reference Object, e.g. "schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | Specify cluster to run on. | Reference Object, e.g. "runsOn":\$1"ref":"myResourceId"\$1 | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | The most recently reported status from the remote activity. | String | 
| attemptTimeout | The timeout for remote work completion. If set, then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| dependsOn | Specifies the dependency on another runnable object. | Reference Object, e.g. "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun. | Enumeration | 
| filterSql | A Hive SQL statement fragment that filters a subset of DynamoDB or Amazon S3 data to copy. The filter should only contain predicates and not begin with a WHERE clause, because AWS Data Pipeline adds it automatically. | String | 
| input | The input data source. This must be a S3DataNode or DynamoDBDataNode. If you use DynamoDBNode, specify a DynamoDBExportDataFormat. | Reference Object, e.g. "input":\$1"ref":"myDataNodeId"\$1 | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | The maximum number attempt retries on failure. | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| output | The output data source. If input is S3DataNode, this must be DynamoDBDataNode. Otherwise, this can be S3DataNode or DynamoDBDataNode. If you use DynamoDBNode, specify a DynamoDBExportDataFormat. | Reference Object, e.g. "output":\$1"ref":"myDataNodeId"\$1 | 
| parent | The parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The Amazon S3 URI, such as  's3://BucketName/Key/', for uploading logs for the pipeline. | String | 
| postActivityTaskConfig | The post-activity configuration script to be run. This consists of a URI of the shell script in Amazon S3 and a list of arguments. | Reference Object, e.g. "postActivityTaskConfig":\$1"ref":"myShellScriptConfigId"\$1 | 
| preActivityTaskConfig | The pre-activity configuration script to be run. This consists of a URI of the shell script in Amazon S3 and a list of arguments. | Reference Object, e.g. "preActivityTaskConfig":\$1"ref":"myShellScriptConfigId"\$1 | 
| precondition | Optionally defines a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, e.g. "precondition":\$1"ref":"myPreconditionId"\$1 | 
| reportProgressTimeout | The timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| resizeClusterBeforeRunning | Resize the cluster before performing this activity to accommodate DynamoDB data nodes specified as inputs or outputs.  If your activity uses a `DynamoDBDataNode` as either an input or output data node, and if you set the `resizeClusterBeforeRunning` to `TRUE`, AWS Data Pipeline starts using `m3.xlarge` instance types. This overwrites your instance type choices with `m3.xlarge`, which could increase your monthly costs.  | Boolean | 
| resizeClusterMaxInstances | A limit on the maximum number of instances that can be requested by the resize algorithm | Integer | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries. | Enumeration | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | Amazon EMR step logs available only on EMR activity attempts. | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | The most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object. | DateTime | 
| @scheduledStartTime | Schedule start time for object. | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | Id of the pipeline to which this object belongs to. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Object. | String | 

## See Also
<a name="hivecopyactivity-seealso"></a>
+ [ShellCommandActivity](dp-object-shellcommandactivity.md)
+ [EmrActivity](dp-object-emractivity.md)

# PigActivity
<a name="dp-object-pigactivity"></a>

PigActivity provides native support for Pig scripts in AWS Data Pipeline without the requirement to use `ShellCommandActivity` or `EmrActivity`. In addition, PigActivity supports data staging. When the stage field is set to true, AWS Data Pipeline stages the input data as a schema in Pig without additional code from the user. 

## Example
<a name="pigactivity-example"></a>

The following example pipeline shows how to use `PigActivity`. The example pipeline performs the following steps:
+ MyPigActivity1 loads data from Amazon S3 and runs a Pig script that selects a few columns of data and uploads it to Amazon S3.
+ MyPigActivity2 loads the first output, selects a few columns and three rows of data, and uploads it to Amazon S3 as a second output.
+ MyPigActivity3 loads the second output data, inserts two rows of data and only the column named "fifth" to Amazon RDS.
+ MyPigActivity4 loads Amazon RDS data, selects the first row of data, and uploads it to Amazon S3.

```
{
  "objects": [
    {
      "id": "MyInputData1",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "directoryPath": "s3://amzn-s3-demo-bucket/pigTestInput",
      "name": "MyInputData1",
      "dataFormat": {
        "ref": "MyInputDataType1"
      },
      "type": "S3DataNode"
    },
    {
      "id": "MyPigActivity4",
      "scheduleType": "CRON",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "input": {
        "ref": "MyOutputData3"
      },
      "pipelineLogUri": "s3://amzn-s3-demo-bucket/path/",
      "name": "MyPigActivity4",
      "runsOn": {
        "ref": "MyEmrResource"
      },
      "type": "PigActivity",
      "dependsOn": {
        "ref": "MyPigActivity3"
      },
      "output": {
        "ref": "MyOutputData4"
      },
      "script": "B = LIMIT ${input1} 1; ${output1} = FOREACH B GENERATE one;",
      "stage": "true"
    },
    {
      "id": "MyPigActivity3",
      "scheduleType": "CRON",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "input": {
        "ref": "MyOutputData2"
      },
      "pipelineLogUri": "s3://amzn-s3-demo-bucket/path",
      "name": "MyPigActivity3",
      "runsOn": {
        "ref": "MyEmrResource"
      },
      "script": "B = LIMIT ${input1} 2; ${output1} = FOREACH B GENERATE Fifth;",
      "type": "PigActivity",
      "dependsOn": {
        "ref": "MyPigActivity2"
      },
      "output": {
        "ref": "MyOutputData3"
      },
      "stage": "true"
    },
    {
      "id": "MyOutputData2",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "name": "MyOutputData2",
      "directoryPath": "s3://amzn-s3-demo-bucket/PigActivityOutput2",
      "dataFormat": {
        "ref": "MyOutputDataType2"
      },
      "type": "S3DataNode"
    },
    {
      "id": "MyOutputData1",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "name": "MyOutputData1",
      "directoryPath": "s3://amzn-s3-demo-bucket/PigActivityOutput1",
      "dataFormat": {
        "ref": "MyOutputDataType1"
      },
      "type": "S3DataNode"
    },
    {
      "id": "MyInputDataType1",
      "name": "MyInputDataType1",
      "column": [
        "First STRING",
        "Second STRING",
        "Third STRING",
        "Fourth STRING",
        "Fifth STRING",
        "Sixth STRING",
        "Seventh STRING",
        "Eighth STRING",
        "Ninth STRING",
        "Tenth STRING"
      ],
      "inputRegEx": "^(\\\\S+) (\\\\S+) (\\\\S+) (\\\\S+) (\\\\S+) (\\\\S+) (\\\\S+) (\\\\S+) (\\\\S+) (\\\\S+)",
      "type": "RegEx"
    },
    {
      "id": "MyEmrResource",
      "region": "us-east-1",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "keyPair": "example-keypair",
      "masterInstanceType": "m1.small",
      "enableDebugging": "true",
      "name": "MyEmrResource",
      "actionOnTaskFailure": "continue",
      "type": "EmrCluster"
    },
    {
      "id": "MyOutputDataType4",
      "name": "MyOutputDataType4",
      "column": "one STRING",
      "type": "CSV"
    },
    {
      "id": "MyOutputData4",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "directoryPath": "s3://amzn-s3-demo-bucket/PigActivityOutput3",
      "name": "MyOutputData4",
      "dataFormat": {
        "ref": "MyOutputDataType4"
      },
      "type": "S3DataNode"
    },
    {
      "id": "MyOutputDataType1",
      "name": "MyOutputDataType1",
      "column": [
        "First STRING",
        "Second STRING",
        "Third STRING",
        "Fourth STRING",
        "Fifth STRING",
        "Sixth STRING",
        "Seventh STRING",
        "Eighth STRING"
      ],
      "columnSeparator": "*",
      "type": "Custom"
    },
    {
      "id": "MyOutputData3",
      "username": "___",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "insertQuery": "insert into #{table} (one) values (?)",
      "name": "MyOutputData3",
      "*password": "___",
      "runsOn": {
        "ref": "MyEmrResource"
      },
      "connectionString": "jdbc:mysql://example-database-instance:3306/example-database",
      "selectQuery": "select * from #{table}",
      "table": "example-table-name",
      "type": "MySqlDataNode"
    },
    {
      "id": "MyOutputDataType2",
      "name": "MyOutputDataType2",
      "column": [
        "Third STRING",
        "Fourth STRING",
        "Fifth STRING",
        "Sixth STRING",
        "Seventh STRING",
        "Eighth STRING"
      ],
      "type": "TSV"
    },
    {
      "id": "MyPigActivity2",
      "scheduleType": "CRON",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "input": {
        "ref": "MyOutputData1"
      },
      "pipelineLogUri": "s3://amzn-s3-demo-bucket/path",
      "name": "MyPigActivity2",
      "runsOn": {
        "ref": "MyEmrResource"
      },
      "dependsOn": {
        "ref": "MyPigActivity1"
      },
      "type": "PigActivity",
      "script": "B = LIMIT ${input1} 3; ${output1} = FOREACH B GENERATE Third, Fourth, Fifth, Sixth, Seventh, Eighth;",
      "output": {
        "ref": "MyOutputData2"
      },
      "stage": "true"
    },
    {
      "id": "MyEmrResourcePeriod",
      "startDateTime": "2013-05-20T00:00:00",
      "name": "MyEmrResourcePeriod",
      "period": "1 day",
      "type": "Schedule",
      "endDateTime": "2013-05-21T00:00:00"
    },
    {
      "id": "MyPigActivity1",
      "scheduleType": "CRON",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "input": {
        "ref": "MyInputData1"
      },
      "pipelineLogUri": "s3://amzn-s3-demo-bucket/path",
      "scriptUri": "s3://amzn-s3-demo-bucket/script/pigTestScipt.q",
      "name": "MyPigActivity1",
      "runsOn": {
        "ref": "MyEmrResource"
      },
      "scriptVariable": [
        "column1=First",
        "column2=Second",
        "three=3"
      ],
      "type": "PigActivity",
      "output": {
        "ref": "MyOutputData1"
      },
      "stage": "true"
    }
  ]
}
```

The content of `pigTestScript.q` is as follows.

```
B = LIMIT ${input1} $three; ${output1} = FOREACH B GENERATE $column1, $column2, Third, Fourth, Fifth, Sixth, Seventh, Eighth;
```

## Syntax
<a name="pigactivity-syntax"></a>


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Users must specify a schedule reference to another object to set the dependency execution order for this object. Users can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), users can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html) | Reference Object, for example, "schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| script | The Pig script to run. | String | 
| scriptUri | The location of the Pig script to run (for example, s3://scriptLocation). | String | 

 
****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | EMR Cluster on which this PigActivity runs. | Reference Object, for example, "runsOn":\$1"ref":"myEmrClusterId"\$1 | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | The most recently reported status from the remote activity. | String | 
| attemptTimeout | The timeout for remote work completion. If set, then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| dependsOn | Specifies the dependency on another runnable object. | Reference Object, for example, "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun. | Enumeration | 
| input | The input data source. | Reference Object, for example, "input":\$1"ref":"myDataNodeId"\$1 | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | The maximum number attempt retries on failure. | Integer | 
| onFail | An action to run when current object fails. | Reference Object, for example, "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, for example, "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, for example, "onSuccess":\$1"ref":"myActionId"\$1 | 
| output | The output data source. | Reference Object, for example, "output":\$1"ref":"myDataNodeId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, for example, "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The Amazon S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| postActivityTaskConfig | Post-activity configuration script to be run. This consists of a URI of the shell script in Amazon S33 and a list of arguments. | Reference Object, for example, "postActivityTaskConfig":\$1"ref":"myShellScriptConfigId"\$1 | 
| preActivityTaskConfig | Pre-activity configuration script to be run. This consists of a URI of the shell script in Amazon S3 and a list of arguments. | Reference Object, for example, "preActivityTaskConfig":\$1"ref":"myShellScriptConfigId"\$1 | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, for example, "precondition":\$1"ref":"myPreconditionId"\$1 | 
| reportProgressTimeout | The timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| resizeClusterBeforeRunning | Resize the cluster before performing this activity to accommodate DynamoDB data nodes specified as inputs or outputs. If your activity uses a `DynamoDBDataNode` as either an input or output data node, and if you set the `resizeClusterBeforeRunning` to `TRUE`, AWS Data Pipeline starts using `m3.xlarge` instance types. This overwrites your instance type choices with `m3.xlarge`, which could increase your monthly costs.  | Boolean | 
| resizeClusterMaxInstances | A limit on the maximum number of instances that can be requested by the resize algorithm. | Integer | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries. | Enumeration | 
| scriptVariable | The arguments to pass to the Pig script. You can use scriptVariable with script or scriptUri. | String | 
| stage | Determines whether staging is enabled and allows your Pig script to have access to the staged-data tables, such as \$1\$1INPUT1\$1 and \$1\$1OUTPUT1\$1. | Boolean | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, for example, "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, for example, "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | Amazon EMR step logs available only on EMR activity attempts. | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for the object. | DateTime | 
| @scheduledStartTime | Schedule start time for the object. | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version that the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, for example, "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | ID of the pipeline to which this object belongs. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

## See Also
<a name="pigactivity-seealso"></a>
+ [ShellCommandActivity](dp-object-shellcommandactivity.md)
+ [EmrActivity](dp-object-emractivity.md)

# RedshiftCopyActivity
<a name="dp-object-redshiftcopyactivity"></a>

Copies data from DynamoDB or Amazon S3 to Amazon Redshift. You can load data into a new table, or easily merge data into an existing table.

Here is an overview of a use case in which to use `RedshiftCopyActivity`:

1. Start by using AWS Data Pipeline to stage your data in Amazon S3. 

1. Use `RedshiftCopyActivity` to move the data from Amazon RDS and Amazon EMR to Amazon Redshift.

   This lets you load your data into Amazon Redshift where you can analyze it.

1. Use [SqlActivity](dp-object-sqlactivity.md) to perform SQL queries on the data that you've loaded into Amazon Redshift.

 In addition, `RedshiftCopyActivity` let's you work with an `S3DataNode`, since it supports a manifest file. For more information, see [S3DataNode](dp-object-s3datanode.md).

## Example
<a name="redshiftcopyactivity-example"></a>

The following is an example of this object type. 

To ensure formats conversion, this example uses [EMPTYASNULL](https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-emptyasnull) and [IGNOREBLANKLINES](https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-ignoreblanklines) special conversion parameters in `commandOptions`. For information, see [Data Conversion Parameters](https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html) in the *Amazon Redshift Database Developer Guide*.

```
{
  "id" : "S3ToRedshiftCopyActivity",
  "type" : "RedshiftCopyActivity",
  "input" : { "ref": "MyS3DataNode" },
  "output" : { "ref": "MyRedshiftDataNode" },
  "insertMode" : "KEEP_EXISTING",
  "schedule" : { "ref": "Hour" },
  "runsOn" : { "ref": "MyEc2Resource" },
  "commandOptions": ["EMPTYASNULL", "IGNOREBLANKLINES"]
}
```

The following example pipeline definition shows an activity that uses the `APPEND` insert mode:

```
{
  "objects": [
    {
      "id": "CSVId1",
      "name": "DefaultCSV1",
      "type": "CSV"
    },
    {
      "id": "RedshiftDatabaseId1",
      "databaseName": "dbname",
      "username": "user",
      "name": "DefaultRedshiftDatabase1",
      "*password": "password",
      "type": "RedshiftDatabase",
      "clusterId": "redshiftclusterId"
    },
    {
      "id": "Default",
      "scheduleType": "timeseries",
      "failureAndRerunMode": "CASCADE",
      "name": "Default",
      "role": "DataPipelineDefaultRole",
      "resourceRole": "DataPipelineDefaultResourceRole"
    },
    {
      "id": "RedshiftDataNodeId1",
      "schedule": {
        "ref": "ScheduleId1"
      },
      "tableName": "orders",
      "name": "DefaultRedshiftDataNode1",
      "createTableSql": "create table StructuredLogs (requestBeginTime CHAR(30) PRIMARY KEY DISTKEY SORTKEY, requestEndTime CHAR(30), hostname CHAR(100), requestDate varchar(20));",
      "type": "RedshiftDataNode",
      "database": {
        "ref": "RedshiftDatabaseId1"
      }
    },
    {
      "id": "Ec2ResourceId1",
      "schedule": {
        "ref": "ScheduleId1"
      },
      "securityGroups": "MySecurityGroup",
      "name": "DefaultEc2Resource1",
      "role": "DataPipelineDefaultRole",
      "logUri": "s3://myLogs",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "type": "Ec2Resource"
    },
    {
      "id": "ScheduleId1",
      "startDateTime": "yyyy-mm-ddT00:00:00",
      "name": "DefaultSchedule1",
      "type": "Schedule",
      "period": "period",
      "endDateTime": "yyyy-mm-ddT00:00:00"
    },
    {
      "id": "S3DataNodeId1",
      "schedule": {
        "ref": "ScheduleId1"
      },
      "filePath": "s3://datapipeline-us-east-1/samples/hive-ads-samples.csv",
      "name": "DefaultS3DataNode1",
      "dataFormat": {
        "ref": "CSVId1"
      },
      "type": "S3DataNode"
    },
    {
      "id": "RedshiftCopyActivityId1",
      "input": {
        "ref": "S3DataNodeId1"
      },
      "schedule": {
        "ref": "ScheduleId1"
      },
      "insertMode": "APPEND",
      "name": "DefaultRedshiftCopyActivity1",
      "runsOn": {
        "ref": "Ec2ResourceId1"
      },
      "type": "RedshiftCopyActivity",
      "output": {
        "ref": "RedshiftDataNodeId1"
      }
    }
  ]
}
```

`APPEND` operation adds items to a table regardless of the primary or sort keys. For example, if you have the following table, you can append a record with the same ID and user value.

```
ID(PK)     USER
1          aaa
2          bbb
```

You can append a record with the same ID and user value:

```
ID(PK)     USER
1          aaa
2          bbb
1          aaa
```

**Note**  
If an `APPEND` operation is interrupted and retried, the resulting rerun pipeline potentially appends from the beginning. This may cause further duplication, so you should be aware of this behavior, especially if you have any logic that counts the number of rows.

For a tutorial, see [Copy Data to Amazon Redshift Using AWS Data Pipeline](dp-copydata-redshift.md).

## Syntax
<a name="redshiftcopyactivity-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| insertMode |   Determines what AWS Data Pipeline does with pre-existing data in the target table that overlaps with rows in the data to be loaded. Valid values are: `KEEP_EXISTING`, `OVERWRITE_EXISTING`, `TRUNCATE` and `APPEND`. `KEEP_EXISTING` adds new rows to the table, while leaving any existing rows unmodified. `KEEP_EXISTING` and ` OVERWRITE_EXISTING` use the primary key, sort, and distribution keys to identify which incoming rows to match with existing rows. See [Updating and Inserting New Data](https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html) in the Amazon Redshift *Database Developer Guide*.  `TRUNCATE` deletes all the data in the destination table before writing the new data.  `APPEND` adds all records to the end of the Redshift table. `APPEND` does not require a primary, distribution key, or sort key so items that may be potential duplicates may be appended.  | Enumeration | 

 
****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule |  This object is invoked within the execution of a schedule interval.  Specify a schedule reference to another object to set the dependency execution order for this object.  In most cases, we recommend to put the schedule reference on the default pipeline object so that all objects inherit that schedule. For example, you can explicitly set a schedule on the object by specifying `"schedule": {"ref": "DefaultSchedule"}`.  If the master schedule in your pipeline contains nested schedules, create a parent object that has a schedule reference.  For more information about example optional schedule configurations, see [Schedule](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html).   | Reference Object, such as: "schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | The computational resource to run the activity or command. For example, an Amazon EC2 instance or Amazon EMR cluster. | Reference Object, e.g. "runsOn":\$1"ref":"myResourceId"\$1 | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set, then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| commandOptions |  Takes parameters to pass to the Amazon Redshift data node during the `COPY` operation. For information on parameters, see [COPY](https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) in the Amazon Redshift *Database Developer Guide*. As it loads the table, `COPY` attempts to implicitly convert the strings to the data type of the target column. In addition to the default data conversions that happen automatically, if you receive errors or have other conversion needs, you can specify additional conversion parameters. For information, see [Data Conversion Parameters](https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html) in the Amazon Redshift *Database Developer Guide*. If a data format is associated with the input or output data node, then the provided parameters are ignored.  Because the copy operation first uses `COPY` to insert data into a staging table, and then uses an `INSERT` command to copy the data from the staging table into the destination table, some `COPY` parameters do not apply, such as the `COPY` command's ability to enable automatic compression of the table. If compression is required, add column encoding details to the `CREATE TABLE` statement.  Also, in some cases when it needs to unload data from the Amazon Redshift cluster and create files in Amazon S3, the `RedshiftCopyActivity` relies on the `UNLOAD` operation from Amazon Redshift. To improve performance during copying and unloading, specify `PARALLEL OFF` parameter from the `UNLOAD` command. For information on parameters, see [UNLOAD](https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html) in the Amazon Redshift *Database Developer Guide*.  | String | 
| dependsOn | Specify dependency on another runnable object. | Reference Object: "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun | Enumeration | 
| input | The input data node. The data source can be Amazon S3, DynamoDB, or Amazon Redshift. | Reference Object: "input":\$1"ref":"myDataNodeId"\$1 | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object: "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object:  "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object: "onSuccess":\$1"ref":"myActionId"\$1 | 
| output | The output data node. The output location can be Amazon S3 or Amazon Redshift. | Reference Object: "output":\$1"ref":"myDataNodeId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object: "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object: "precondition":\$1"ref":"myPreconditionId"\$1 | 
| queue |  Corresponds to the `query_group ` setting in Amazon Redshift, which allows you to assign and prioritize concurrent activities based on their placement in queues.  Amazon Redshift limits the number of simultaneous connections to 15. For more information, see [Assigning Queries to Queues](https://docs.aws.amazon.com/AmazonRDS/latest/DeveloperGuide/cm-c-executing-queries.html) in the Amazon RDS *Database Developer Guide*.  | String | 
| reportProgressTimeout |  Timeout for remote work successive calls to `reportProgress`.  If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried.  | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType |  Allows you to specify whether the schedule for objects in your pipeline. Values are: `cron`, `ondemand`, and `timeseries`. The `timeseries` scheduling means instances are scheduled at the end of each interval. The `Cron` scheduling means instances are scheduled at the beginning of each interval.  An `ondemand` schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again.  To use `ondemand` pipelines, call the `ActivatePipeline` operation for each subsequent run.  If you use an `ondemand` schedule, you must specify it in the default object, and it must be the only `scheduleType` specified for objects in the pipeline.  | Enumeration | 
| transformSql |  The `SQL SELECT` expression used to transform the input data.  Run the `transformSql` expression on the table named `staging`.  When you copy data from DynamoDB or Amazon S3, AWS Data Pipeline creates a table called "staging" and initially loads data in there. Data from this table is used to update the target table.  The output schema of `transformSql` must match the final target table's schema. If you specify the `transformSql` option, a second staging table is created from the specified SQL statement. The data from this second staging table is then updated in the final target table.  | String | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object: "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object: "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object. | DateTime | 
| @scheduledStartTime | Schedule start time for object. | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object: "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | Id of the pipeline to which this object belongs to. | String | 
| @sphere | The sphere of an object. Denotes its place in the life cycle. For example, Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

# ShellCommandActivity
<a name="dp-object-shellcommandactivity"></a>

 Runs a command or script. You can use `ShellCommandActivity` to run time-series or cron-like scheduled tasks. 

When the `stage` field is set to true and used with an `S3DataNode`, `ShellCommandActivity` supports the concept of staging data, which means that you can move data from Amazon S3 to a stage location, such as Amazon EC2 or your local environment, perform work on the data using scripts and the `ShellCommandActivity`, and move it back to Amazon S3. 

In this case, when your shell command is connected to an input `S3DataNode`, your shell scripts operate directly on the data using `${INPUT1_STAGING_DIR}`, `${INPUT2_STAGING_DIR}`, and other fields, referring to the `ShellCommandActivity` input fields. 

Similarly, output from the shell-command can be staged in an output directory to be automatically pushed to Amazon S3, referred to by `${OUTPUT1_STAGING_DIR}`, `${OUTPUT2_STAGING_DIR}`, and so on. 

These expressions can pass as command-line arguments to the shell-command for you to use in data transformation logic.

`ShellCommandActivity` returns Linux-style error codes and strings. If a `ShellCommandActivity` results in error, the `error` returned is a non-zero value.

## Example
<a name="shellcommandactivity-example"></a>

The following is an example of this object type.

```
{
  "id" : "CreateDirectory",
  "type" : "ShellCommandActivity",
  "command" : "mkdir new-directory"
}
```

## Syntax
<a name="shellcommandactivity-syntax"></a>


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule |  This object is invoked within the execution of a `schedule` interval. To set the dependency execution order for this object, specify a `schedule` reference to another object.  To satisfy this requirement, explicitly set a `schedule` on the object, for example, by specifying `"schedule": {"ref": "DefaultSchedule"}`.  In most cases, it is better to put the `schedule` reference on the default pipeline object so that all objects inherit that schedule. If the pipeline consists of a tree of schedules (schedules within the master schedule), create a parent object that has a schedule reference.  To spread the load, AWS Data Pipeline creates physical objects slightly ahead of schedule, but runs them on schedule.  For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html)  | Reference Object, e.g. "schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| command | The command to run. Use \$1 to reference positional parameters and scriptArgument to specify the parameters for the command. This value and any associated parameters must function in the environment from which you are running the Task Runner. | String | 
| scriptUri | An Amazon S3 URI path for a file to download and run as a shell command. Specify only one scriptUri, or command field. scriptUri cannot use parameters, use command instead. | String | 

 
****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | The computational resource to run the activity or command, for example, an Amazon EC2 instance or an Amazon EMR cluster. | Reference Object, e.g. "runsOn":\$1"ref":"myResourceId"\$1 | 
| workerGroup | Used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | The most recently reported status from the remote activity. | String | 
| attemptTimeout | The timeout for the remote work completion. If set, then a remote activity that does not complete within the specified starting time may be retried. | Period | 
| dependsOn | Specifies a dependency on another runnable object. | Reference Object, e.g. "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun. | Enumeration | 
| input | The location of the input data. | Reference Object, e.g. "input":\$1"ref":"myDataNodeId"\$1 | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | The maximum number attempt retries on failure. | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or is not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| output | The location of the output data. | Reference Object, e.g. "output":\$1"ref":"myDataNodeId"\$1 | 
| parent | The parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The Amazon S3 URI, such as 's3://BucketName/Key/' for uploading logs for the pipeline. | String | 
| precondition | Optionally defines a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, e.g. "precondition":\$1"ref":"myPreconditionId"\$1 | 
| reportProgressTimeout | The timeout for successive calls to reportProgress by remote activities. If set, then remote activities that do not report progress for the specified period may be considered stalled and are retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType |  Allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of the interval or at the end of the interval.  The values are: `cron`, `ondemand`, and `timeseries`. If set to `timeseries`, instances are scheduled at the end of each interval.  If set to `Cron`, instances are scheduled at the beginning of each interval.  If set to `ondemand`, you can run a pipeline one time, per activation. This means you do not have to clone or recreate the pipeline to run it again. If you use an `ondemand` schedule, specify it in the default object as the only `scheduleType` for objects in the pipeline. To use `ondemand` pipelines, call the `ActivatePipeline` operation for each subsequent run.   | Enumeration | 
| scriptArgument | A JSON-formatted array of strings to pass to the command specified by the command. For example, if command is echo \$11 \$12, specify scriptArgument as "param1", "param2". For multiple arguments and parameters, pass the scriptArgument as follows: "scriptArgument":"arg1","scriptArgument":"param1","scriptArgument":"arg2","scriptArgument":"param2". The scriptArgument can only be used with command; Using it with scriptUri causes an error. | String | 
| stage | Determines whether staging is enabled and allows your shell commands to have access to the staged-data variables, such as \$1\$1INPUT1\$1STAGING\$1DIR\$1 and \$1\$1OUTPUT1\$1STAGING\$1DIR\$1. | Boolean | 
| stderr | The path that receives redirected system error messages from the command. If you use the runsOn field, this must be an Amazon S3 path because of the transitory nature of the resource running your activity. However, if you specify the workerGroup field, a local file path is permitted. | String | 
| stdout | The Amazon S3 path that receives redirected output from the command. If you use the runsOn field, this must be an Amazon S3 path because of the transitory nature of the resource running your activity. However, if you specify the workerGroup field, a local file path is permitted. | String | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | The list of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | The time when the execution of this object finished. | DateTime | 
| @actualStartTime | The time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | The description of the dependency chain that caused the object failure. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | Amazon EMR step logs available only on Amazon EMR activity attempts. | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which the object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for Amazon EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | The Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | The time at which the health status was updated last time. | DateTime | 
| hostname | The host name of the client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | The time of the latest run for which the execution completed. | DateTime | 
| @latestRunTime | The time of the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | The time of the run to be scheduled next. | DateTime | 
| reportProgressTime | The most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | The schedule end time for object. | DateTime | 
| @scheduledStartTime | The schedule start time for object. | DateTime | 
| @status | The status of the object. | String | 
| @version | The AWS Data Pipeline version used to create the object. | String | 
| @waitingOn | The description of the list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | The error describing the ill-formed object. | String | 
| @pipelineId | The Id of the pipeline to which this object belongs. | String | 
| @sphere | The place of an object in the lifecycle. Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

## See Also
<a name="shellcommandactivity-seealso"></a>
+ [CopyActivity](dp-object-copyactivity.md)
+ [EmrActivity](dp-object-emractivity.md)

# SqlActivity
<a name="dp-object-sqlactivity"></a>

Runs an SQL query (script) on a database.

## Example
<a name="sqlactivity-example"></a>

The following is an example of this object type.

```
{
  "id" : "MySqlActivity",
  "type" : "SqlActivity",
  "database" : { "ref": "MyDatabaseID" },
  "script" : "SQLQuery" | "scriptUri" : s3://scriptBucket/query.sql,
  "schedule" : { "ref": "MyScheduleID" },
}
```

## Syntax
<a name="sqlactivity-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| database | The database on which to run the supplied SQL script. | Reference Object, e.g. "database":\$1"ref":"myDatabaseId"\$1 | 

 
****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule |  This object is invoked within the execution of a schedule interval. You must specify a schedule reference to another object to set the dependency execution order for this object. You can set a schedule explicitly on the object, for example, by specifying `"schedule": {"ref": "DefaultSchedule"}`.  In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule.  If the pipeline has a tree of schedules nested within the master schedule, create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html)  | Reference Object, e.g. "schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| script | The SQL script to run. You must specify script or scriptUri. When the script is stored in Amazon S3, then script is not evaluated as an expression. Specifying multiple values for scriptArgument is helpful when the script is stored in Amazon S3. | String | 
| scriptUri | A URI specifying the location of an SQL script to execute in this activity. | String | 

 
****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | The computational resource to run the activity or command. For example, an Amazon EC2 instance or Amazon EMR cluster. | Reference Object, e.g. "runsOn":\$1"ref":"myResourceId"\$1 | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| dependsOn | Specify dependency on another runnable object. | Reference Object, e.g. "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun | Enumeration | 
| input | Location of the input data. | Reference Object, e.g. "input":\$1"ref":"myDataNodeId"\$1 | 
| lateAfterTimeout | The time period since the scheduled start of the pipeline within which the object run must start. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed in the time period since the scheduled start of the pipeline as specified by 'lateAfterTimeout'. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| output | Location of the output data. This is only useful for referencing from within a script (for example \$1\$1output.tablename\$1) and for creating the output table by setting 'createTableSql' in the output data node. The output of the SQL query is not written to the output data node. | Reference Object, e.g. "output":\$1"ref":"myDataNodeId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, e.g. "precondition":\$1"ref":"myPreconditionId"\$1 | 
| queue | [Amazon Redshift only] Corresponds to the query\$1group setting in Amazon Redshift, which allows you to assign and prioritize concurrent activities based on their placement in queues. Amazon Redshift limits the number of simultaneous connections to 15. For more information, see [Assigning Queries to Queues](https://docs.aws.amazon.com/redshift/latest/dg/cm-c-executing-queries.html) in the Amazon Redshift Database Developer Guide. | String | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType |  Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Values are: `cron`, `ondemand`, and `timeseries`.  `timeseries` scheduling means instances are scheduled at the end of each interval. `cron` scheduling means instances are scheduled at the beginning of each interval.  An `ondemand` schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an `ondemand` schedule, it must be specified in the default object and must be the only `scheduleType` specified for objects in the pipeline. To use `ondemand` pipelines, call the `ActivatePipeline` operation for each subsequent run.  | Enumeration | 
| scriptArgument | A list of variables for the script. You can alternatively put expressions directly into the script field. Multiple values for scriptArgument are helpful when the script is stored in Amazon S3. Example: \$1\$1format(@scheduledStartTime, "YY-MM-DD HH:MM:SS"\$1\$1n\$1\$1format(plusPeriod(@scheduledStartTime, "1 day"), "YY-MM-DD HH:MM:SS"\$1 | String | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object | DateTime | 
| @scheduledStartTime | Schedule start time for object | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | Id of the pipeline to which this object belongs to. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

# Resources
<a name="dp-object-resources"></a>

The following are the AWS Data Pipeline resource objects:

**Topics**
+ [Ec2Resource](dp-object-ec2resource.md)
+ [EmrCluster](dp-object-emrcluster.md)
+ [HttpProxy](dp-object-httpproxy.md)

# Ec2Resource
<a name="dp-object-ec2resource"></a>

An Amazon EC2 instance that performs the work defined by a pipeline activity.

AWS Data Pipeline now supports IMDSv2 for the Amazon EC2 instance, which uses a session-oriented method to better handle authentication when retrieving metadata information from instances. A session begins and ends a series of requests that software running on an Amazon EC2 instance uses to access the locally-stored Amazon EC2 instance metadata and credentials. The software starts a session with a simple HTTP PUT request to IMDSv2. IMDSv2 returns a secret token to the software running on the Amazon EC2 instance, which will use the token as a password to make requests to IMDSv2 for metadata and credentials.

**Note**  
To use IMDSv2 for your Amazon EC2 instance, you will need to modify the settings, as the default AMI is not compatible with IMDSv2. You can specify a new AMI version which you can retrieve via the following SSM parameter: `/aws/service/ami-amazon-linux-latest/amzn-ami-hvm-x86_64-ebs`.

For information about default Amazon EC2 instances that AWS Data Pipeline creates if you do not specify an instance, see [Default Amazon EC2 Instances by AWS Region](dp-ec2-default-instance-types.md).

## Examples
<a name="ec2resource-example"></a>

**EC2-Classic**

**Important**  
Only AWS accounts created before December 4, 2013 support the EC2-Classic platform. If you have one of these accounts, you may have the option to create EC2Resource objects for a pipeline in an EC2-Classic network rather than a VPC. We strongly recommend that you create resources for all your pipelines in VPCs. In addition, if you have existing resources in EC2-Classic, we recommend that you migrate them to a VPC.

The following example object launches an EC2 instance into EC2-Classic, with some optional fields set.

```
{
  "id" : "MyEC2Resource",
  "type" : "Ec2Resource",
  "actionOnTaskFailure" : "terminate",
  "actionOnResourceFailure" : "retryAll",
  "maximumRetries" : "1",
  "instanceType" : "m5.large",
  "securityGroups" : [
    "test-group",
    "default"
  ],
  "keyPair" : "my-key-pair"
}
```

**EC2-VPC**

The following example object launches an EC2 instance into a nondefault VPC, with some optional fields set.

```
{
  "id" : "MyEC2Resource",
  "type" : "Ec2Resource",
  "actionOnTaskFailure" : "terminate",
  "actionOnResourceFailure" : "retryAll",
  "maximumRetries" : "1",
  "instanceType" : "m5.large",
  "securityGroupIds" : [
    "sg-12345678",
    "sg-12345678"
  ],
  "subnetId": "subnet-12345678",
  "associatePublicIpAddress": "true",
  "keyPair" : "my-key-pair"
}
```

## Syntax
<a name="ec2resource-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| resourceRole | The IAM role that controls the resources that the Amazon EC2 instance can access. | String | 
| role | The IAM role that AWS Data Pipeline uses to create the EC2 instance. | String | 

 
****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule |  This object is invoked within the execution of a schedule interval.  To set the dependency execution order for this object,specify a schedule reference to another object. You can do this in one of the following ways: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-ec2resource.html)  | Reference Object, for example "schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| actionOnResourceFailure | The action taken after a resource failure for this resource. Valid values are "retryall" and "retrynone". | String | 
| actionOnTaskFailure | The action taken after a task failure for this resource. Valid values are "continue" or "terminate". | String | 
| associatePublicIpAddress | Indicates whether to assign a public IP address to the instance. If the instance is in Amazon EC2 or Amazon VPC, the default value is true. Otherwise, the default value is false. | Boolean | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for the remote work completion. If set, then a remote activity that does not complete within the specified starting time may be retried. | Period | 
| availabilityZone | The Availability Zone in which to launch the Amazon EC2 instance. | String | 
| disableIMDSv1 | Default value is false and enables both IMDSv1 and IMDSv2. If you set it to true then it disables IMDSv1 and only provides IMDSv2s | Boolean | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun. | Enumeration | 
| httpProxy | The proxy host that clients use to connect to AWS services. | Reference Object, for example, "httpProxy":\$1"ref":"myHttpProxyId"\$1 | 
| imageId | The ID of the AMI to use for the instance. By default, AWS Data Pipeline uses the HVM AMI virtualization type. The specific AMI IDs used are based on a Region. You can overwrite the default AMI by specifying the HVM AMI of your choice. For more information about AMI types, see [Linux AMI Virtualization Types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/virtualization_types.html) and [Finding a Linux AMI](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html) in the *Amazon EC2 User Guide*.  | String | 
| initTimeout | The amount of time to wait for the resource to start.  | Period | 
| instanceCount | Deprecated. | Integer | 
| instanceType | The type of Amazon EC2 instance to start. | String | 
| keyPair | The name of the key pair. If you launch an Amazon EC2 instance without specifying a key pair, you cannot log on to it. | String | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | The maximum number of attempt retries on failure. | Integer | 
| minInstanceCount | Deprecated. | Integer | 
| onFail | An action to run when the current object fails. | Reference Object, for example "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or is still running. | Reference Object, for example"onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when the current object succeeds. | Reference Object, for example, "onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | The parent of the current object from which slots are inherited. | Reference Object, for example, "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The Amazon S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| region |  The code for the Region in which the Amazon EC2 instance should run. By default, the instance runs in the same Region as the pipeline. You can run the instance in the same Region as a dependent dataset. | Enumeration | 
| reportProgressTimeout | The timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and will be retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| runAsUser | The user to run the TaskRunner. | String | 
| runsOn | This field is not allowed on this object. | Reference Object, for example, "runsOn":\$1"ref":"myResourceId"\$1 | 
| scheduleType |  Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of an interval, at the end of the interval, or on demand. Values are: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-ec2resource.html)  | Enumeration | 
| securityGroupIds | The IDs of one or more Amazon EC2 security groups to use for the instances in the resource pool. | String | 
| securityGroups | One or more Amazon EC2 security groups to use for the instances in the resource pool. | String | 
| spotBidPrice | The maximum amount per hour for your Spot Instance in dollars, which is a decimal value between 0 and 20.00, exclusive. | String | 
| subnetId | The ID of the Amazon EC2 subnet in which to start the instance. | String | 
| terminateAfter | The number of hours after which to terminate the resource. | Period | 
| useOnDemandOnLastAttempt | On the last attempt to request a Spot Instance, make a request for On-Demand Instances rather than a Spot Instance. This ensures that if all previous attempts have failed, the last attempt is not interrupted. | Boolean | 
| workerGroup | This field is not allowed on this object. | String | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, for example, "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain on which the object failed. | Reference Object, for example, "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | Step logs are available only on Amazon EMR activity attempts. | String | 
| errorId | The error ID if this object failed. | String | 
| errorMessage | The error message if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @failureReason | The reason for the resource failure. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for Amazon EMR activities. | String | 
| @healthStatus | The health status of the object that reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | The most recent time that the remote activity reported progress. | DateTime | 
| @scheduledEndTime | The schedule end time for the object. | DateTime | 
| @scheduledStartTime | The schedule start time for the object. | DateTime | 
| @status | The status of this object. | String | 
| @version | The pipeline version with which the object was created. | String | 
| @waitingOn | Description of the list of dependencies on which this object is waiting. | Reference Object, for example, "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | ID of the pipeline to which this object belongs. | String | 
| @sphere | The place of an object in the lifecycle. Component objects give rise to instance objects, which execute attempt objects. | String | 

# EmrCluster
<a name="dp-object-emrcluster"></a>

Represents the configuration of an Amazon EMR cluster. This object is used by [EmrActivity](dp-object-emractivity.md) and [HadoopActivity](dp-object-hadoopactivity.md) to launch a cluster.

**Topics**
+ [Schedulers](#emrcluster-schedulers)
+ [Amazon EMR Release Versions](#dp-emrcluster-release-versions)
+ [Amazon EMR permissions](#w2aac52c17b9c11)
+ [Syntax](#emrcluster-syntax)
+ [Examples](emrcluster-example.md)
+ [See Also](#emrcluster-seealso)

## Schedulers
<a name="emrcluster-schedulers"></a>

Schedulers provide a way to specify resource allocation and job prioritization within a Hadoop cluster. Administrators or users can choose a scheduler for various classes of users and applications. A scheduler could use queues to allocate resources to users and applications. You set up those queues when you create the cluster. You can then set up priority for certain types of work and user over others. This provides for efficient use of cluster resources, while allowing more than one user to submit work to the cluster. There are three types of scheduler available:
+ [FairScheduler](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/FairScheduler.html) — Attempts to schedule resources evenly over a significant period of time.
+ [CapacityScheduler](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html) — Uses queues to allow cluster administrators to assign users to queues of varying priority and resource allocation. 
+ Default — Used by the cluster, which could be configured by your site.

## Amazon EMR Release Versions
<a name="dp-emrcluster-release-versions"></a>

An Amazon EMR release is a set of open-source applications from the big data ecosystem. Each release comprises different big data applications, components, and features that you select to have Amazon EMR install and configure when you create a cluster. You specify the release version using the release label. Release labels are in the form `emr-x.x.x`. For example, `emr-5.30.0`. Amazon EMR clusters based on release label `emr-4.0.0` and later use the `releaseLabel` property to specify the release label of an `EmrCluster` object. Earlier versions use the `amiVersion` property.

**Important**  
All Amazon EMR clusters created using release version 5.22.0 or later use [Signature Version 4](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html) to authenticate requests to Amazon S3. Some earlier release versions use Signature Version 2. Signature Version 2 support is being discontinued. For more information, see [Amazon S3 Update – SigV2 Deprecation Period Extended and Modified](https://aws.amazon.com/blogs/aws/amazon-s3-update-sigv2-deprecation-period-extended-modified/). We strongly recommend that you use an Amazon EMR release version that supports Signature Version 4. For earlier version releases, beginning with EMR 4.7.x, the most recent release in the series has been updated to support Signature Version 4. When using an earlier version EMR release, we recommend that you use the latest release in the series. In addition, avoid releases earlier than EMR 4.7.0.

### Considerations and Limitations
<a name="dp-emrcluster-considerations"></a>

#### Use the latest version of Task Runner
<a name="dp-task-runner-latest"></a>

If you are using a self-managed `EmrCluster` object with a release label, use the latest Task Runner. For more information about Task Runner, see [Working with Task Runner](dp-using-task-runner.md). You can configure property values for all Amazon EMR configuration classifications. For more information, see [Configuring Applications](http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-configure-apps.html) in the *Amazon EMR Release Guide*, the [EmrConfiguration](dp-object-emrconfiguration.md), and [Property](dp-object-property.md) object references. 

#### Support for IMDSv2
<a name="dp-emr-imdsv2-support"></a>

Earlier, AWS Data Pipeline supported only IMDSv1. Now, AWS Data Pipeline supports IMDSv2 in Amazon EMR 5.23.1, 5.27.1, and 5.32 or later, and Amazon EMR 6.2 or later. IMDSv2 uses a session-oriented method to better handle authentication when retrieving metadata information from instances. You should configure your instances to make IMDSv2 calls by creating user-managed resources using TaskRunner-2.0.

#### Amazon EMR 5.32 or later and Amazon EMR 6.x
<a name="dp-emr-6-classpath"></a>

The Amazon EMR 5.32 or later and 6.x release series uses Hadoop version 3.x, which introduced breaking changes in how Hadoop's classpath is evaluated as compared to Hadoop version 2.x. Common libraries like Joda-Time were removed from the classpath.

If [EmrActivity](dp-object-emractivity.md) or [HadoopActivity](dp-object-hadoopactivity.md) runs a Jar file that has dependencies on a library that was removed in Hadoop 3.x, the step fails with the error `java.lang.NoClassDefFoundError` or `java.lang.ClassNotFoundException`. This can happen for Jar files that ran with no issues using Amazon EMR 5.x release versions.

To fix the issue, you must copy Jar file dependencies to the Hadoop classpath on an `EmrCluster` object before starting the `EmrActivity` or the `HadoopActivity`. We provide a bash script to do this. The bash script is available in the following location, where *MyRegion* is the AWS Region where your `EmrCluster` object runs, for example `us-west-2`.

```
s3://datapipeline-MyRegion/MyRegion/bootstrap-actions/latest/TaskRunner/copy-jars-to-hadoop-classpath.sh
```

The way to run the script depends on whether `EmrActivity` or `HadoopActivity` runs on a resource managed by AWS Data Pipeline or runs on a self-managed resource.

If you use a resource managed by AWS Data Pipeline, add a `bootstrapAction` to the `EmrCluster` object. The `bootstrapAction` specifies the script and the Jar files to copy as arguments. You can add up to 255 `bootstrapAction` fields per `EmrCluster` object, and you can add a `bootstrapAction` field to an `EmrCluster` object that already has bootstrap actions.

To specify this script as a bootstrap action, use the following syntax, where `JarFileRegion` is the Region where the Jar file is saved, and each *MyJarFile*n** is the absolute path in Amazon S3 of a Jar file to be copied to the Hadoop classpath. Do not specify Jar files that are in the Hadoop classpath by default.

```
s3://datapipeline-MyRegion/MyRegion/bootstrap-actions/latest/TaskRunner/copy-jars-to-hadoop-classpath.sh,JarFileRegion,MyJarFile1,MyJarFile2[, ...]
```

The following example specifies a bootstrap action that copies two Jar files in Amazon S3: `my-jar-file.jar` and the `emr-dynamodb-tool-4.14.0-jar-with-dependencies.jar`. The Region used in the example is us-west-2.

```
{
  "id" : "MyEmrCluster",
  "type" : "EmrCluster",
  "keyPair" : "my-key-pair",
  "masterInstanceType" : "m5.xlarge",
  "coreInstanceType" : "m5.xlarge",
  "coreInstanceCount" : "2",
  "taskInstanceType" : "m5.xlarge",
  "taskInstanceCount": "2",
  "bootstrapAction" : ["s3://datapipeline-us-west-2/us-west-2/bootstrap-actions/latest/TaskRunner/copy-jars-to-hadoop-classpath.sh,us-west-2,s3://path/to/my-jar-file.jar,s3://dynamodb-dpl-us-west-2/emr-ddb-storage-handler/4.14.0/emr-dynamodb-tools-4.14.0-jar-with-dependencies.jar"]
}
```

You must save and activate the pipeline for the change to the new `bootstrapAction` to take effect.

If you use a self-managed resource, you can download the script to the cluster instance and run it from the command line using SSH. The script creates a directory named `/etc/hadoop/conf/shellprofile.d` and a file named `datapipeline-jars.sh` in that directory. The jar files provided as command-line arguments are copied to a directory that the script creates named `/home/hadoop/datapipeline_jars`. If your cluster is set up differently, modify the script appropriately after downloading it.

The syntax for running the script on the command line is slightly different from using the `bootstrapAction` shown in the previous example. Use spaces instead of commas between arguments, as shown in the following example.

```
./copy-jars-to-hadoop-classpath.sh us-west-2 s3://path/to/my-jar-file.jar s3://dynamodb-dpl-us-west-2/emr-ddb-storage-handler/4.14.0/emr-dynamodb-tools-4.14.0-jar-with-dependencies.jar
```

## Amazon EMR permissions
<a name="w2aac52c17b9c11"></a>

When you create a custom IAM role, carefully consider the minimum permissions necessary for your cluster to perform its work. Be sure to grant access to required resources, such as files in Amazon S3 or data in Amazon RDS, Amazon Redshift, or DynamoDB. If you wish to set `visibleToAllUsers` to False, your role must have the proper permissions to do so. Note that `DataPipelineDefaultRole` does not have these permissions. You must either provide a union of the `DefaultDataPipelineResourceRole` and `DataPipelineDefaultRole` roles as the `EmrCluster` object role, or create your own role for this purpose.

## Syntax
<a name="emrcluster-syntax"></a>


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Specify a schedule reference to another object to set the dependency execution order for this object. You can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), you can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html) | Reference Object, for example, "schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| actionOnResourceFailure | The action taken after a resource failure for this resource. Valid values are "retryall", which retries all tasks to the cluster for the specified duration, and "retrynone". | String | 
| actionOnTaskFailure | The action taken after task failure for this resource. Valid values are "continue", meaning do not terminate the cluster, and "terminate." | String | 
| additionalMasterSecurityGroupIds | The identifier of additional master security groups of the EMR cluster, which follows the form sg-01XXXX6a. For more information, see [Amazon EMR Additional Security Groups](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-additional-sec-groups.html) in the Amazon EMR Management Guide. | String | 
| additionalSlaveSecurityGroupIds | The identifier of additional slave security groups of the EMR cluster, which follows the form sg-01XXXX6a. | String | 
| amiVersion | The Amazon Machine Image (AMI) version that Amazon EMR uses to install the cluster nodes. For more information, see the [Amazon EMR Management Guide](https://docs.aws.amazon.com/emr/latest/ManagementGuide/). | String | 
| applications | Applications to install in the cluster with comma-separated arguments. By default, Hive and Pig are installed. This parameter is applicable only for Amazon EMR version 4.0 and later. | String | 
| attemptStatus | The most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set, then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| availabilityZone | The Availability Zone in which to run the cluster. | String | 
| bootstrapAction | An action to run when the cluster starts. You can specify comma-separated arguments. To specify multiple actions, up to 255, add multiple bootstrapAction fields. The default behavior is to start the cluster without any bootstrap actions. | String | 
| configuration | Configuration for the Amazon EMR cluster. This parameter is applicable only for Amazon EMR version 4.0 and later. | Reference Object, for example, "configuration":\$1"ref":"myEmrConfigurationId"\$1 | 
| coreInstanceBidPrice | The maximum Spot price your are willing to pay for Amazon EC2 instances. If a bid price is specified, Amazon EMR uses Spot Instances for the instance group. Specified in USD. | String | 
| coreInstanceCount | The number of core nodes to use for the cluster. | Integer | 
| coreInstanceType | The type of Amazon EC2 instance to use for core nodes. See [Supported Amazon EC2 Instances for Amazon EMR Clusters](dp-emr-supported-instance-types.md). | String | 
| coreGroupConfiguration | The configuration for the Amazon EMR cluster core instance group. This parameter is applicable only for Amazon EMR version 4.0 and later. | Reference Object, for example “configuration”: \$1“ref”: “myEmrConfigurationId”\$1 | 
| coreEbsConfiguration | The configuration for Amazon EBS volumes that will be attached to each of the core nodes in the core group in the Amazon EMR cluster. For more information, see [Instance Types That Support EBS Optimization](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html) in the Amazon EC2 User Guide. | Reference Object, for example “coreEbsConfiguration”: \$1“ref”: “myEbsConfiguration”\$1 | 
| customAmiId | Applies only to Amazon EMR release version 5.7.0 and later. Specifies the AMI ID of a custom AMI to use when Amazon EMR provisions Amazon EC2 instances. It can also be used instead of bootstrap actions to customize cluster node configurations. For more information, see the following topic in the Amazon EMR Management Guide. [Using a custom AMI](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-custom-ami.html) | String | 
| EbsBlockDeviceConfig |  The configuration of a requested Amazon EBS block device associated with the instance group. Includes a specified number of volumes that will be associated with each instance in the instance group. Includes `volumesPerInstance` and `volumeSpecification`, where:  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-emrcluster.html)  | Reference Object, for example “EbsBlockDeviceConfig”: \$1“ref”: “myEbsBlockDeviceConfig”\$1 | 
| emrManagedMasterSecurityGroupId | The identifier of the master security group of the Amazon EMR cluster, which follows the form of sg-01XXXX6a. For more information, see [Configure Security Groups](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-security-groups.html) in the Amazon EMR Management Guide. | String | 
| emrManagedSlaveSecurityGroupId | The identifier of the slave security group of the Amazon EMR cluster, which follows the form sg-01XXXX6a. | String | 
| enableDebugging | Enables debugging on the Amazon EMR cluster. | String | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun. | Enumeration | 
| hadoopSchedulerType | The scheduler type of the cluster. Valid types are: PARALLEL\$1FAIR\$1SCHEDULING, PARALLEL\$1CAPACITY\$1SCHEDULING, and DEFAULT\$1SCHEDULER. | Enumeration | 
| httpProxy | The proxy host that clients use to connect to AWS services. | Reference Object, for example, "httpProxy":\$1"ref":"myHttpProxyId"\$1 | 
| initTimeout | The amount of time to wait for the resource to start.  | Period | 
| keyPair | The Amazon EC2 key pair to use to log on to the master node of the Amazon EMR cluster. | String | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| masterInstanceBidPrice | The maximum Spot price your are willing to pay for Amazon EC2 instances. It is a decimal value between 0 and 20.00, exclusive. Specified in USD. Setting this value enables Spot Instances for the Amazon EMR cluster master node. If a bid price is specified, Amazon EMR uses Spot Instances for the instance group. | String | 
| masterInstanceType | The type of Amazon EC2 instance to use for the master node. See [Supported Amazon EC2 Instances for Amazon EMR Clusters](dp-emr-supported-instance-types.md). | String | 
| masterGroupConfiguration | The configuration for the Amazon EMR cluster master instance group. This parameter is applicable only for Amazon EMR version 4.0 and later. | Reference Object, for example “configuration”: \$1“ref”: “myEmrConfigurationId”\$1 | 
| masterEbsConfiguration | The configuration for Amazon EBS volumes that will be attached to each of the master nodes in the master group in the Amazon EMR cluster. For more information, see [Instance Types That Support EBS Optimization](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html) in the Amazon EC2 User Guide. | Reference Object, for example “masterEbsConfiguration”: \$1“ref”: “myEbsConfiguration”\$1 | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | Maximum number attempt retries on failure. | Integer | 
| onFail | An action to run when the current object fails. | Reference Object, for example, "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or is still not completed. | Reference Object, for example, "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when the current object succeeds. | Reference Object, for example, "onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | Parent of the current object from which slots are inherited. | Reference Object, for example. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The Amazon S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| region | The code for the region that the Amazon EMR cluster should run in. By default, the cluster runs in the same region as the pipeline. You can run the cluster in the same region as a dependent dataset.  | Enumeration | 
| releaseLabel | Release label for the EMR cluster. | String | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| resourceRole | The IAM role that AWS Data Pipeline uses to create the Amazon EMR cluster. The default role is DataPipelineDefaultRole.  | String | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| role | The IAM role passed to Amazon EMR to create EC2 nodes. | String | 
| runsOn | This field is not allowed on this object. | Reference Object, for example, "runsOn":\$1"ref":"myResourceId"\$1 | 
| securityConfiguration | The identifier of the EMR security configuration that will be applied to the cluster. This parameter is applicable only for Amazon EMR version 4.8.0 and later. | String | 
| serviceAccessSecurityGroupId | The identifier for the service access security group of the Amazon EMR cluster.  | String. It follows the form of sg-01XXXX6a, for example, sg-1234abcd. | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of the interval, or end of the interval. Values are: cron, ondemand, and timeseries. The timeseries scheduling means that instances are scheduled at the end of each interval. The cron scheduling means that instances are scheduled at the beginning of each interval. An ondemand schedule allows you to run a pipeline one time per activation. You do not have to clone or re-create the pipeline to run it again. If you use an ondemand schedule, it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use ondemand pipelines, call the ActivatePipeline operation for each subsequent run. | Enumeration | 
| subnetId | The identifier of the subnet into which to launch the Amazon EMR cluster. | String | 
| supportedProducts | A parameter that installs third-party software on an Amazon EMR cluster, for example, a third-party distribution of Hadoop. | String | 
| taskInstanceBidPrice | The maximum Spot price your are willing to pay for EC2 instances. A decimal value between 0 and 20.00, exclusive. Specified in USD. If a bid price is specified, Amazon EMR uses Spot Instances for the instance group. | String | 
| taskInstanceCount | The number of task nodes to use for the Amazon EMR cluster. | Integer | 
| taskInstanceType | The type of Amazon EC2 instance to use for task nodes. | String | 
| taskGroupConfiguration | The configuration for the Amazon EMR cluster task instance group. This parameter is applicable only for Amazon EMR version 4.0 and later.  | Reference Object, for example “configuration”: \$1“ref”: “myEmrConfigurationId”\$1 | 
| taskEbsConfiguration | The configuration for Amazon EBS volumes that will be attached to each of the task nodes in the task group in the Amazon EMR cluster. For more information, see [Instance Types That Support EBS Optimization](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html) in the Amazon EC2 User Guide. | Reference Object, for example “taskEbsConfiguration”: \$1“ref”: “myEbsConfiguration”\$1 | 
| terminateAfter | Terminate the resource after these many hours. | Integer | 
| VolumeSpecification |   The Amazon EBS volume specifications, such as volume type, IOPS, and size in Gigibytes (GiB) that will be requested for the Amazon EBS volume attached to an Amazon EC2 instance in the Amazon EMR cluster. The node can be a core, master or task node.  The `VolumeSpecification` includes: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-emrcluster.html)  | Reference Object, for example “VolumeSpecification”: \$1“ref”: “myVolumeSpecification”\$1 | 
| useOnDemandOnLastAttempt | On the last attempt to request a resource, make a request for On-Demand Instances rather than Spot Instances. This ensures that if all previous attempts have failed, the last attempt is not interrupted.  | Boolean | 
| workerGroup | Field not allowed on this object. | String | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, for example, "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain on which the object failed. | Reference Object, for example, "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | Step logs available only on Amazon EMR activity attempts. | String | 
| errorId | The error ID if this object failed. | String | 
| errorMessage | The error message if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @failureReason | The reason for the resource failure. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for Amazon EMR activities. | String | 
| @healthStatus | The health status of the object that reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | ID of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object. | DateTime | 
| @scheduledStartTime | Schedule start time for object. | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version with which the object was created. | String | 
| @waitingOn | Description of the list of dependencies on which this object is waiting. | Reference Object, for example, "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | ID of the pipeline to which this object belongs. | String | 
| @sphere | The place of an object in the lifecycle. Component objects give rise to instance objects, which execute attempt objects. | String | 

# Examples
<a name="emrcluster-example"></a>

The following are examples of this object type.

**Topics**
+ [Launch an Amazon EMR cluster with hadoopVersion](emrcluster-example-launch.md)
+ [Launch an Amazon EMR cluster with release label emr-4.x or greater](emrcluster-example-release-label.md)
+ [Install additional software on your Amazon EMR cluster](emrcluster-example-install-software.md)
+ [Disable server-side encryption on 3.x releases](emrcluster-example1-disable-encryption.md)
+ [Disable server-side encryption on 4.x releases](emrcluster-example2-disable-encryption.md)
+ [Configure Hadoop KMS ACLs and create encryption zones in HDFS](emrcluster-example-hadoop-kms.md)
+ [Specify custom IAM roles](emrcluster-example-custom-iam-roles.md)
+ [Use EmrCluster Resource in AWS SDK for Java](emrcluster-example-java.md)
+ [Configure an Amazon EMR cluster in a private subnet](emrcluster-example-private-subnet.md)
+ [Attach EBS volumes to cluster nodes](emrcluster-example-ebs.md)

# Launch an Amazon EMR cluster with hadoopVersion
<a name="emrcluster-example-launch"></a>

**Example**  <a name="example1"></a>
The following example launches an Amazon EMR cluster using AMI version 1.0 and Hadoop 0.20.  

```
{
  "id" : "MyEmrCluster",
  "type" : "EmrCluster",
  "hadoopVersion" : "0.20",
  "keyPair" : "my-key-pair",
  "masterInstanceType" : "m3.xlarge",
  "coreInstanceType" : "m3.xlarge",
  "coreInstanceCount" : "10",
  "taskInstanceType" : "m3.xlarge",
  "taskInstanceCount": "10",
  "bootstrapAction" : ["s3://Region.elasticmapreduce/bootstrap-actions/configure-hadoop,arg1,arg2,arg3","s3://Region.elasticmapreduce/bootstrap-actions/configure-hadoop/configure-other-stuff,arg1,arg2"]
}
```

# Launch an Amazon EMR cluster with release label emr-4.x or greater
<a name="emrcluster-example-release-label"></a>

**Example**  
The following example launches an Amazon EMR cluster using the newer `releaseLabel` field:  

```
{
  "id" : "MyEmrCluster",
  "type" : "EmrCluster",
  "keyPair" : "my-key-pair",
  "masterInstanceType" : "m3.xlarge",
  "coreInstanceType" : "m3.xlarge",
  "coreInstanceCount" : "10",
  "taskInstanceType" : "m3.xlarge",
  "taskInstanceCount": "10",
  "releaseLabel": "emr-4.1.0",
  "applications": ["spark", "hive", "pig"],
  "configuration": {"ref":"myConfiguration"}  
}
```

# Install additional software on your Amazon EMR cluster
<a name="emrcluster-example-install-software"></a>

**Example**  <a name="example2"></a>
`EmrCluster` provides the `supportedProducts` field that installs third-party software on an Amazon EMR cluster, for example, it lets you install a custom distribution of Hadoop, such as MapR. It accepts a comma-separated list of arguments for the third-party software to read and act on. The following example shows how to use the `supportedProducts` field of `EmrCluster` to create a custom MapR M3 edition cluster with Karmasphere Analytics installed, and run an `EmrActivity` object on it.  

```
{
    "id": "MyEmrActivity",
    "type": "EmrActivity",
    "schedule": {"ref": "ResourcePeriod"},
    "runsOn": {"ref": "MyEmrCluster"},
    "postStepCommand": "echo Ending job >> /mnt/var/log/stepCommand.txt",    
    "preStepCommand": "echo Starting job > /mnt/var/log/stepCommand.txt",
    "step": "/home/hadoop/contrib/streaming/hadoop-streaming.jar,-input,s3n://elasticmapreduce/samples/wordcount/input,-output, \
     hdfs:///output32113/,-mapper,s3n://elasticmapreduce/samples/wordcount/wordSplitter.py,-reducer,aggregate"
  },
  {    
    "id": "MyEmrCluster",
    "type": "EmrCluster",
    "schedule": {"ref": "ResourcePeriod"},
    "supportedProducts": ["mapr,--edition,m3,--version,1.2,--key1,value1","karmasphere-enterprise-utility"],
    "masterInstanceType": "m3.xlarge",
    "taskInstanceType": "m3.xlarge"
}
```

# Disable server-side encryption on 3.x releases
<a name="emrcluster-example1-disable-encryption"></a>

**Example**  <a name="example3"></a>
An `EmrCluster` activity with a Hadoop version 2.x created by AWS Data Pipeline enables server-side encryption by default. If you would like to disable server-side encryption, you must specify a bootstrap action in the cluster object definition.  
The following example creates an `EmrCluster` activity with server-side encryption disabled:  

```
{  
   "id":"NoSSEEmrCluster",
   "type":"EmrCluster",
   "hadoopVersion":"2.x",
   "keyPair":"my-key-pair",
   "masterInstanceType":"m3.xlarge",
   "coreInstanceType":"m3.large",
   "coreInstanceCount":"10",
   "taskInstanceType":"m3.large",
   "taskInstanceCount":"10",
   "bootstrapAction":["s3://Region.elasticmapreduce/bootstrap-actions/configure-hadoop,-e, fs.s3.enableServerSideEncryption=false"]
}
```

# Disable server-side encryption on 4.x releases
<a name="emrcluster-example2-disable-encryption"></a>

**Example**  <a name="example4"></a>
You must disable server-side encryption using a `EmrConfiguration` object.  
The following example creates an `EmrCluster` activity with server-side encryption disabled:  

```
   {
      "name": "ReleaseLabelCluster",
      "releaseLabel": "emr-4.1.0",
      "applications": ["spark", "hive", "pig"],
      "id": "myResourceId",
      "type": "EmrCluster",
      "configuration": {
        "ref": "disableSSE"
      }
    },
    {
      "name": "disableSSE",
      "id": "disableSSE",
      "type": "EmrConfiguration",
      "classification": "emrfs-site",
      "property": [{
        "ref": "enableServerSideEncryption"
      }
      ]
    },
    {
      "name": "enableServerSideEncryption",
      "id": "enableServerSideEncryption",
      "type": "Property",
      "key": "fs.s3.enableServerSideEncryption",
      "value": "false"
    }
```

# Configure Hadoop KMS ACLs and create encryption zones in HDFS
<a name="emrcluster-example-hadoop-kms"></a>

**Example**  <a name="example5"></a>
The following objects create ACLs for Hadoop KMS and create encryption zones and corresponding encryption keys in HDFS:  

```
{
      "name": "kmsAcls",
      "id": "kmsAcls",
      "type": "EmrConfiguration",
      "classification": "hadoop-kms-acls",
      "property": [
        {"ref":"kmsBlacklist"},
        {"ref":"kmsAcl"}
      ]
    },
    {
      "name": "hdfsEncryptionZone",
      "id": "hdfsEncryptionZone",
      "type": "EmrConfiguration",
      "classification": "hdfs-encryption-zones",
      "property": [
        {"ref":"hdfsPath1"},
        {"ref":"hdfsPath2"}
      ]
    },
    {
      "name": "kmsBlacklist",
      "id": "kmsBlacklist",
      "type": "Property",
      "key": "hadoop.kms.blacklist.CREATE",
      "value": "foo,myBannedUser"
    },
    {
      "name": "kmsAcl",
      "id": "kmsAcl",
      "type": "Property",
      "key": "hadoop.kms.acl.ROLLOVER",
      "value": "myAllowedUser"
    },
    {
      "name": "hdfsPath1",
      "id": "hdfsPath1",
      "type": "Property",
      "key": "/myHDFSPath1",
      "value": "path1_key"
    },
    {
      "name": "hdfsPath2",
      "id": "hdfsPath2",
      "type": "Property",
      "key": "/myHDFSPath2",
      "value": "path2_key"
    }
```

# Specify custom IAM roles
<a name="emrcluster-example-custom-iam-roles"></a>

**Example**  <a name="example6"></a>
By default, AWS Data Pipeline passes `DataPipelineDefaultRole` as the Amazon EMR service role and `DataPipelineDefaultResourceRole` as the Amazon EC2 instance profile to create resources on your behalf. However, you can create a custom Amazon EMR service role and a custom instance profile and use them instead. AWS Data Pipeline should have sufficient permissions to create clusters using the custom role, and you must add AWS Data Pipeline as a trusted entity.  
The following example object specifies custom roles for the Amazon EMR cluster:  

```
{  
   "id":"MyEmrCluster",
   "type":"EmrCluster",
   "hadoopVersion":"2.x",
   "keyPair":"my-key-pair",
   "masterInstanceType":"m3.xlarge",
   "coreInstanceType":"m3.large",
   "coreInstanceCount":"10",
   "taskInstanceType":"m3.large",
   "taskInstanceCount":"10",
   "role":"emrServiceRole",
   "resourceRole":"emrInstanceProfile"
}
```

# Use EmrCluster Resource in AWS SDK for Java
<a name="emrcluster-example-java"></a>

**Example**  <a name="example7"></a>
The following example shows how to use an `EmrCluster` and `EmrActivity` to create an Amazon EMR 4.x cluster to run a Spark step using the Java SDK:  

```
public class dataPipelineEmr4 {

  public static void main(String[] args) {
    
	AWSCredentials credentials = null;
	credentials = new ProfileCredentialsProvider("/path/to/AwsCredentials.properties","default").getCredentials();
	DataPipelineClient dp = new DataPipelineClient(credentials);
	CreatePipelineRequest createPipeline = new CreatePipelineRequest().withName("EMR4SDK").withUniqueId("unique");
	CreatePipelineResult createPipelineResult = dp.createPipeline(createPipeline);
	String pipelineId = createPipelineResult.getPipelineId();
    
	PipelineObject emrCluster = new PipelineObject()
	    .withName("EmrClusterObj")
	    .withId("EmrClusterObj")
	    .withFields(
			new Field().withKey("releaseLabel").withStringValue("emr-4.1.0"),
			new Field().withKey("coreInstanceCount").withStringValue("3"),
			new Field().withKey("applications").withStringValue("spark"),
			new Field().withKey("applications").withStringValue("Presto-Sandbox"),
			new Field().withKey("type").withStringValue("EmrCluster"),
			new Field().withKey("keyPair").withStringValue("myKeyName"),
			new Field().withKey("masterInstanceType").withStringValue("m3.xlarge"),
			new Field().withKey("coreInstanceType").withStringValue("m3.xlarge")        
			);
  
	PipelineObject emrActivity = new PipelineObject()
	    .withName("EmrActivityObj")
	    .withId("EmrActivityObj")
	    .withFields(
			new Field().withKey("step").withStringValue("command-runner.jar,spark-submit,--executor-memory,1g,--class,org.apache.spark.examples.SparkPi,/usr/lib/spark/lib/spark-examples.jar,10"),
			new Field().withKey("runsOn").withRefValue("EmrClusterObj"),
			new Field().withKey("type").withStringValue("EmrActivity")
			);
      
	PipelineObject schedule = new PipelineObject()
	    .withName("Every 15 Minutes")
	    .withId("DefaultSchedule")
	    .withFields(
			new Field().withKey("type").withStringValue("Schedule"),
			new Field().withKey("period").withStringValue("15 Minutes"),
			new Field().withKey("startAt").withStringValue("FIRST_ACTIVATION_DATE_TIME")
			);
      
	PipelineObject defaultObject = new PipelineObject()
	    .withName("Default")
	    .withId("Default")
	    .withFields(
			new Field().withKey("failureAndRerunMode").withStringValue("CASCADE"),
			new Field().withKey("schedule").withRefValue("DefaultSchedule"),
			new Field().withKey("resourceRole").withStringValue("DataPipelineDefaultResourceRole"),
			new Field().withKey("role").withStringValue("DataPipelineDefaultRole"),
			new Field().withKey("pipelineLogUri").withStringValue("s3://myLogUri"),
			new Field().withKey("scheduleType").withStringValue("cron")
			);     
      
	List<PipelineObject> pipelineObjects = new ArrayList<PipelineObject>();
    
	pipelineObjects.add(emrActivity);
	pipelineObjects.add(emrCluster);
	pipelineObjects.add(defaultObject);
	pipelineObjects.add(schedule);
    
	PutPipelineDefinitionRequest putPipelineDefintion = new PutPipelineDefinitionRequest()
	    .withPipelineId(pipelineId)
	    .withPipelineObjects(pipelineObjects);
    
	PutPipelineDefinitionResult putPipelineResult = dp.putPipelineDefinition(putPipelineDefintion);
	System.out.println(putPipelineResult);
    
	ActivatePipelineRequest activatePipelineReq = new ActivatePipelineRequest()
	    .withPipelineId(pipelineId);
	ActivatePipelineResult activatePipelineRes = dp.activatePipeline(activatePipelineReq);
	
      System.out.println(activatePipelineRes);
      System.out.println(pipelineId);
    
    }

}
```

# Configure an Amazon EMR cluster in a private subnet
<a name="emrcluster-example-private-subnet"></a>

**Example**  <a name="example8"></a>
This example includes a configuration that launches the cluster into a private subnet in a VPC. For more information, see [Launch Amazon EMR Clusters into a VPC](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-vpc-launching-job-flows.html) in the *Amazon EMR Management Guide*. This configuration is optional. You can use it in any pipeline that uses an `EmrCluster` object.  
To launch an Amazon EMR cluster in a private subnet, specify `SubnetId`, `emrManagedMasterSecurityGroupId`, `emrManagedSlaveSecurityGroupId`, and `serviceAccessSecurityGroupId` in your `EmrCluster` configuration.  

```
{
  "objects": [
    {
      "output": {
        "ref": "S3BackupLocation"
      },
      "input": {
        "ref": "DDBSourceTable"
      },
      "maximumRetries": "2",
      "name": "TableBackupActivity",
      "step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}",
      "id": "TableBackupActivity",
      "runsOn": {
        "ref": "EmrClusterForBackup"
      },
      "type": "EmrActivity",
      "resizeClusterBeforeRunning": "false"
    },
    {
      "readThroughputPercent": "#{myDDBReadThroughputRatio}",
      "name": "DDBSourceTable",
      "id": "DDBSourceTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}"
    },
    {
      "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
      "name": "S3BackupLocation",
      "id": "S3BackupLocation",
      "type": "S3DataNode"
    },
    {
      "name": "EmrClusterForBackup",
      "coreInstanceCount": "1",
      "taskInstanceCount": "1",
      "taskInstanceType": "m4.xlarge",
      "coreInstanceType": "m4.xlarge",
      "releaseLabel": "emr-4.7.0",
      "masterInstanceType": "m4.xlarge",
      "id": "EmrClusterForBackup",
      "subnetId": "#{mySubnetId}",
      "emrManagedMasterSecurityGroupId": "#{myMasterSecurityGroup}",
      "emrManagedSlaveSecurityGroupId": "#{mySlaveSecurityGroup}",
      "serviceAccessSecurityGroupId": "#{myServiceAccessSecurityGroup}",
      "region": "#{myDDBRegion}",
      "type": "EmrCluster",
      "keyPair": "user-key-pair"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "pipelineLogUri": "#{myPipelineLogUri}",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    }
  ],
  "parameters": [
    {
      "description": "Output S3 folder",
      "id": "myOutputS3Loc",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "Source DynamoDB table name",
      "id": "myDDBTableName",
      "type": "String"
    },
    {
      "default": "0.25",
      "watermark": "Enter value between 0.1-1.0",
      "description": "DynamoDB read throughput ratio",
      "id": "myDDBReadThroughputRatio",
      "type": "Double"
    },
    {
      "default": "us-east-1",
      "watermark": "us-east-1",
      "description": "Region of the DynamoDB table",
      "id": "myDDBRegion",
      "type": "String"
    }
  ],
  "values": {
     "myDDBRegion": "us-east-1",
      "myDDBTableName": "ddb_table",
      "myDDBReadThroughputRatio": "0.25",
      "myOutputS3Loc": "s3://s3_path",
      "mySubnetId": "subnet_id",
      "myServiceAccessSecurityGroup":  "service access security group",
      "mySlaveSecurityGroup": "slave security group",
      "myMasterSecurityGroup": "master security group",
      "myPipelineLogUri": "s3://s3_path"
  }
}
```

# Attach EBS volumes to cluster nodes
<a name="emrcluster-example-ebs"></a>

**Example**  <a name="example8"></a>
You can attach EBS volumes to any type of node in the EMR cluster within your pipeline. To attach EBS volumes to nodes, use `coreEbsConfiguration`, `masterEbsConfiguration`, and `TaskEbsConfiguration` in your `EmrCluster` configuration.   
This example of the Amazon EMR cluster uses Amazon EBS volumes for its master, task, and core nodes. For more information, see [Amazon EBS volumes in Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html) in the *Amazon EMR Management Guide*.  
These configurations are optional. You can use them in any pipeline that uses an `EmrCluster` object.  
In the pipeline, click the `EmrCluster` object configuration, choose **Master EBS Configuration**, **Core EBS Configuration**, or **Task EBS Configuration**, and enter the configuration details similar to the following example.  

```
{
  "objects": [
    {
      "output": {
        "ref": "S3BackupLocation"
      },
      "input": {
        "ref": "DDBSourceTable"
      },
      "maximumRetries": "2",
      "name": "TableBackupActivity",
      "step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}",
      "id": "TableBackupActivity",
      "runsOn": {
        "ref": "EmrClusterForBackup"
      },
      "type": "EmrActivity",
      "resizeClusterBeforeRunning": "false"
    },
    {
      "readThroughputPercent": "#{myDDBReadThroughputRatio}",
      "name": "DDBSourceTable",
      "id": "DDBSourceTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}"
    },
    {
      "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
      "name": "S3BackupLocation",
      "id": "S3BackupLocation",
      "type": "S3DataNode"
    },
    {
      "name": "EmrClusterForBackup",
      "coreInstanceCount": "1",
      "taskInstanceCount": "1",
      "taskInstanceType": "m4.xlarge",
      "coreInstanceType": "m4.xlarge",
      "releaseLabel": "emr-4.7.0",
      "masterInstanceType": "m4.xlarge",
      "id": "EmrClusterForBackup",
      "subnetId": "#{mySubnetId}",
      "emrManagedMasterSecurityGroupId": "#{myMasterSecurityGroup}",
      "emrManagedSlaveSecurityGroupId": "#{mySlaveSecurityGroup}",
      "region": "#{myDDBRegion}",
      "type": "EmrCluster",
      "coreEbsConfiguration": {
        "ref": "EBSConfiguration"
      },
      "masterEbsConfiguration": {
        "ref": "EBSConfiguration"
      },
      "taskEbsConfiguration": {
        "ref": "EBSConfiguration"
      },
      "keyPair": "user-key-pair"
    },
    {
       "name": "EBSConfiguration",
        "id": "EBSConfiguration",
        "ebsOptimized": "true",
        "ebsBlockDeviceConfig" : [
            { "ref": "EbsBlockDeviceConfig" }
        ],
        "type": "EbsConfiguration"
    },
    {
        "name": "EbsBlockDeviceConfig",
        "id": "EbsBlockDeviceConfig",
        "type": "EbsBlockDeviceConfig",
        "volumesPerInstance" : "2",
        "volumeSpecification" : {
            "ref": "VolumeSpecification"
        }
    },
    {
      "name": "VolumeSpecification",
      "id": "VolumeSpecification",
      "type": "VolumeSpecification",
      "sizeInGB": "500",
      "volumeType": "io1",
      "iops": "1000"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "pipelineLogUri": "#{myPipelineLogUri}",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    }
  ],
  "parameters": [
    {
      "description": "Output S3 folder",
      "id": "myOutputS3Loc",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "Source DynamoDB table name",
      "id": "myDDBTableName",
      "type": "String"
    },
    {
      "default": "0.25",
      "watermark": "Enter value between 0.1-1.0",
      "description": "DynamoDB read throughput ratio",
      "id": "myDDBReadThroughputRatio",
      "type": "Double"
    },
    {
      "default": "us-east-1",
      "watermark": "us-east-1",
      "description": "Region of the DynamoDB table",
      "id": "myDDBRegion",
      "type": "String"
    }
  ],
  "values": {
     "myDDBRegion": "us-east-1",
      "myDDBTableName": "ddb_table",
      "myDDBReadThroughputRatio": "0.25",
      "myOutputS3Loc": "s3://s3_path",
      "mySubnetId": "subnet_id",
      "mySlaveSecurityGroup": "slave security group",
      "myMasterSecurityGroup": "master security group",
      "myPipelineLogUri": "s3://s3_path"
  }
}
```

## See Also
<a name="emrcluster-seealso"></a>
+ [EmrActivity](dp-object-emractivity.md)

# HttpProxy
<a name="dp-object-httpproxy"></a>

HttpProxy allows you to configure your own proxy and make Task Runner access the AWS Data Pipeline service through it. You do not need to configure a running Task Runner with this information.

## Example of an HttpProxy in TaskRunner
<a name="example9"></a>

The following pipeline definition shows an `HttpProxy` object:

```
{
  "objects": [
    {
      "schedule": {
        "ref": "Once"
      },
      "pipelineLogUri": "s3://myDPLogUri/path",
      "name": "Default",
      "id": "Default"
    },
    {
      "name": "test_proxy",
      "hostname": "hostname",
      "port": "port",
      "username": "username",
      "*password": "password",
      "windowsDomain": "windowsDomain",
      "type": "HttpProxy",
      "id": "test_proxy",
    },
    {
      "name": "ShellCommand",
      "id": "ShellCommand",
      "runsOn": {
        "ref": "Resource"
      },
      "type": "ShellCommandActivity",
      "command": "echo 'hello world' "
    },
    {
      "period": "1 day",
      "startDateTime": "2013-03-09T00:00:00",
      "name": "Once",
      "id": "Once",
      "endDateTime": "2013-03-10T00:00:00",
      "type": "Schedule"
    },
    {
      "role": "dataPipelineRole",
      "httpProxy": {
        "ref": "test_proxy"
      },
      "actionOnResourceFailure": "retrynone",
      "maximumRetries": "0",
      "type": "Ec2Resource",
      "terminateAfter": "10 minutes",
      "resourceRole": "resourceRole",
      "name": "Resource",
      "actionOnTaskFailure": "terminate",
      "securityGroups": "securityGroups",
      "keyPair": "keyPair",
      "id": "Resource",
      "region": "us-east-1"
    }
  ],
  "parameters": []
}
```

## Syntax
<a name="httpproxy-slots"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| hostname | Host of the proxy which clients will use to connect to AWS Services. | String | 
| port | Port of the proxy host which the clients will use to connect to AWS Services. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| \$1password | Password for proxy. | String | 
| s3NoProxy | Disable the HTTP proxy when connecting to Amazon S3 | Boolean | 
| username | User name for proxy. | String | 
| windowsDomain | The Windows domain name for NTLM Proxy. | String | 
| windowsWorkgroup | The Windows workgroup name for NTLM Proxy. | String | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @version | Pipeline version the object was created with. | String | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | Id of the pipeline to which this object belongs to. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

# Preconditions
<a name="dp-object-preconditions"></a>

The following are the AWS Data Pipeline precondition objects:

**Topics**
+ [DynamoDBDataExists](dp-dynamodbdataexists.md)
+ [DynamoDBTableExists](dp-dynamodbtableexists.md)
+ [Exists](dp-object-exists.md)
+ [S3KeyExists](dp-object-S3KeyExists.md)
+ [S3PrefixNotEmpty](dp-object-s3prefixnotempty.md)
+ [ShellCommandPrecondition](dp-object-shellcommandprecondition.md)

# DynamoDBDataExists
<a name="dp-dynamodbdataexists"></a>

 A precondition to check that data exists in a DynamoDB table. 

## Syntax
<a name="dp-dynamodbdataexists-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| role | Specifies the role to be used to execute the precondition. | String | 
| tableName | DynamoDB Table to check. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun | Enumeration | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| preconditionTimeout | The period from start after which precondition is marked as failed if still not satisfied | Period | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| currentRetryCount | Number of times the precondition was tried in this attempt. | String | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| hostname | The host name of client that picked up the task attempt. | String | 
| lastRetryTime | Last time when the precondition was tried within this attempt. | String | 
| node | The node for which this precondition is being performed | Reference Object, e.g. "node":\$1"ref":"myRunnableObjectId"\$1 | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object. | DateTime | 
| @scheduledStartTime | Schedule start time for object. | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | Id of the pipeline to which this object belongs. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

# DynamoDBTableExists
<a name="dp-dynamodbtableexists"></a>

 A precondition to check that the DynamoDB table exists. 

## Syntax
<a name="dp-dynamodbtableexists-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| role | Specifies the role to be used to execute the precondition. | String | 
| tableName | DynamoDB Table to check. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun | Enumeration | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| preconditionTimeout | The period from start after which precondition is marked as failed if still not satisfied | Period | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| currentRetryCount | Number of times the precondition was tried in this attempt. | String | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| hostname | The host name of client that picked up the task attempt. | String | 
| lastRetryTime | Last time when the precondition was tried within this attempt. | String | 
| node | The node for which this precondition is being performed | Reference Object, e.g. "node":\$1"ref":"myRunnableObjectId"\$1 | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object | DateTime | 
| @scheduledStartTime | Schedule start time for object | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object | String | 
| @pipelineId | Id of the pipeline to which this object belongs to | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects | String | 

# Exists
<a name="dp-object-exists"></a>

 Checks whether a data node object exists. 

**Note**  
We recommend that you use system-managed preconditions instead. For more information, see [Preconditions](dp-concepts-preconditions.md).

## Example
<a name="exists-example"></a>

The following is an example of this object type. The `InputData` object references this object, `Ready`, plus another object that you'd define in the same pipeline definition file. `CopyPeriod` is a `Schedule` object.

```
{
  "id" : "InputData",
  "type" : "S3DataNode",
  "schedule" : { "ref" : "CopyPeriod" },
  "filePath" : "s3://amzn-s3-demo-bucket/InputData/#{@scheduledStartTime.format('YYYY-MM-dd-hh:mm')}.csv",
  "precondition" : { "ref" : "Ready" }
},
{
  "id" : "Ready",
  "type" : "Exists"
}
```

## Syntax
<a name="exists-syntax"></a>


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun. | Enumeration | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| preconditionTimeout | The period from start after which precondition is marked as failed if still not satisfied | Period | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| hostname | The host name of client that picked up the task attempt. | String | 
| node | The node for which this precondition is being performed. | Reference Object, e.g. "node":\$1"ref":"myRunnableObjectId"\$1 | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object. | DateTime | 
| @scheduledStartTime | Schedule start time for object. | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | Id of the pipeline to which this object belongs to. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

## See Also
<a name="exists-seealso"></a>
+ [ShellCommandPrecondition](dp-object-shellcommandprecondition.md)

# S3KeyExists
<a name="dp-object-S3KeyExists"></a>

 Checks whether a key exists in an Amazon S3 data node.

## Example
<a name="dp-object-S3KeyExists-example"></a>

The following is an example of this object type. The precondition will trigger when the key, `s3://amzn-s3-demo-bucket/mykey`, referenced by the `s3Key` parameter, exists. 

```
{
"id" : "InputReady",
"type" : "S3KeyExists",
"role" : "test-role",
"s3Key" : "s3://amzn-s3-demo-bucket/mykey"
}
```

You can also use `S3KeyExists` as a precondition on the second pipeline that waits for the first pipeline to finish. To do so:

1. Write a file to Amazon S3 at the end of the first pipeline's completion.

1. Create an `S3KeyExists` precondition on the second pipeline.

## Syntax
<a name="S3KeyExists-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| role | Specifies the role to be used to execute the precondition. | String | 
| s3Key | The Amazon S3 key. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout before attempting to complete remote work one more time. If set, then a remote activity that does not complete within the set time after starting is attempted again. | Period | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun. | Enumeration | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maximumRetries | Maximum number of attempts that are initiated on failure. | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| preconditionTimeout | The period from start after which precondition is marked as failed if still not satisfied. | Period | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and are retried. | Period | 
| retryDelay | The timeout duration between two successive attempts. | Period | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| currentRetryCount | Number of times the precondition was tried in this attempt. | String | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| hostname | The host name of client that picked up the task attempt. | String | 
| lastRetryTime | Last time when the precondition was tried within this attempt. | String | 
| node | The node for which this precondition is being performed | Reference Object, e.g. "node":\$1"ref":"myRunnableObjectId"\$1 | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object | DateTime | 
| @scheduledStartTime | Schedule start time for object | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object | String | 
| @pipelineId | Id of the pipeline to which this object belongs to | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects | String | 

## See Also
<a name="S3KeyExists-seealso"></a>
+ [ShellCommandPrecondition](dp-object-shellcommandprecondition.md)

# S3PrefixNotEmpty
<a name="dp-object-s3prefixnotempty"></a>

A precondition to check that the Amazon S3 objects with the given prefix (represented as a URI) are present.

## Example
<a name="s3prefixnotempty-example"></a>

The following is an example of this object type using required, optional, and expression fields.

```
{
  "id" : "InputReady",
  "type" : "S3PrefixNotEmpty",
  "role" : "test-role",
  "s3Prefix" : "#{node.filePath}"
}
```

## Syntax
<a name="s3prefixnotempty-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| role | Specifies the role to be used to execute the precondition. | String | 
| s3Prefix | The Amazon S3 prefix to check for existence of objects. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun | Enumeration | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| preconditionTimeout | The period from start after which precondition is marked as failed if still not satisfied | Period | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| currentRetryCount | Number of times the precondition was tried in this attempt. | String | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| hostname | The host name of client that picked up the task attempt. | String | 
| lastRetryTime | Last time when the precondition was tried within this attempt. | String | 
| node | The node for which this precondition is being performed. | Reference Object, e.g. "node":\$1"ref":"myRunnableObjectId"\$1 | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object. | DateTime | 
| @scheduledStartTime | Schedule start time for object. | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object | String | 
| @pipelineId | Id of the pipeline to which this object belongs to | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects | String | 

## See Also
<a name="s3prefixnotempty-seealso"></a>
+ [ShellCommandPrecondition](dp-object-shellcommandprecondition.md)

# ShellCommandPrecondition
<a name="dp-object-shellcommandprecondition"></a>

 A Unix/Linux shell command that can be run as a precondition. 

## Example
<a name="shellcommandprecondition-example"></a>

The following is an example of this object type.

```
{
  "id" : "VerifyDataReadiness",
  "type" : "ShellCommandPrecondition",
  "command" : "perl check-data-ready.pl"
}
```

## Syntax
<a name="shellcommandprecondition-syntax"></a>


****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| command | The command to run. This value and any associated parameters must function in the environment from which you are running the Task Runner. | String | 
| scriptUri | An Amazon S3 URI path for a file to download and run as a shell command. Only one scriptUri or command field should be present. scriptUri cannot use parameters, use command instead. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun | Enumeration | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| preconditionTimeout | The period from start after which precondition is marked as failed if still not satisfied | Period | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scriptArgument | Argument to be passed to shell script | String | 
| stderr | The Amazon S3 path that receives redirected system error messages from the command. If you use the runsOn field, this must be an Amazon S3 path because of the transitory nature of the resource running your activity. However, if you specify the workerGroup field, a local file path is permitted. | String | 
| stdout | The Amazon S3 path that receives redirected output from the command. If you use the runsOn field, this must be an Amazon S3 path because of the transitory nature of the resource running your activity. However, if you specify the workerGroup field, a local file path is permitted. | String | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| hostname | The host name of client that picked up the task attempt. | String | 
| node | The node for which this precondition is being performed | Reference Object, e.g. "node":\$1"ref":"myRunnableObjectId"\$1 | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object | DateTime | 
| @scheduledStartTime | Schedule start time for object | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object | String | 
| @pipelineId | Id of the pipeline to which this object belongs to | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects | String | 

## See Also
<a name="shellcommandprecondition-seealso"></a>
+ [ShellCommandActivity](dp-object-shellcommandactivity.md)
+ [Exists](dp-object-exists.md)

# Databases
<a name="dp-object-databases"></a>

The following are the AWS Data Pipeline database objects:

**Topics**
+ [JdbcDatabase](dp-object-jdbcdatabase.md)
+ [RdsDatabase](dp-object-rdsdatabase.md)
+ [RedshiftDatabase](dp-object-redshiftdatabase.md)

# JdbcDatabase
<a name="dp-object-jdbcdatabase"></a>

Defines a JDBC database.

## Example
<a name="jdbcdatabase-example"></a>

The following is an example of this object type.

```
{
  "id" : "MyJdbcDatabase",
  "type" : "JdbcDatabase",
  "connectionString" : "jdbc:redshift://hostname:portnumber/dbname",
  "jdbcDriverClass" : "com.amazon.redshift.jdbc41.Driver",
  "jdbcDriverJarUri" : "s3://redshift-downloads/drivers/RedshiftJDBC41-1.1.6.1006.jar",
  "username" : "user_name",
  "*password" : "my_password"
}
```

## Syntax
<a name="jdbcdatabase-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| connectionString | The JDBC connection string to access the database. | String | 
| jdbcDriverClass | The driver class to load before establishing the JDBC connection. | String | 
| \$1password | The password to supply. | String | 
| username | The user name to supply when connecting to the database. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| databaseName | Name of the logical database to attach to | String | 
| jdbcDriverJarUri | The location in Amazon S3 of the JDBC driver JAR file used to connect to the database. AWS Data Pipeline must have permission to read this JAR file. | String | 
| jdbcProperties | Pairs of the form A=B that will be set as properties on JDBC connections for this database. | String | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @version | Pipeline version that the object was created with. | String | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | ID of the pipeline to which this object belongs. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

# RdsDatabase
<a name="dp-object-rdsdatabase"></a>

Defines an Amazon RDS database.

**Note**  
RdsDatabase does not support Aurora. Use [JdbcDatabase](dp-object-jdbcdatabase.md) for Aurora, instead.

## Example
<a name="rdsdatabase-example"></a>

The following is an example of this object type.

```
{
  "id" : "MyRdsDatabase",
  "type" : "RdsDatabase",
  "region" : "us-east-1",
  "username" : "user_name",
  "*password" : "my_password",
  "rdsInstanceId" : "my_db_instance_identifier"
}
```

For the Oracle engine, the `jdbcDriverJarUri` field is required and you can specify the following driver: `http://www.oracle.com/technetwork/database/features/jdbc/jdbc-drivers-12c-download-1958347.html`. For the SQL Server engine, the `jdbcDriverJarUri` field is required and you can specify the following driver: `https://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774`. For the MySQL and PostgreSQL engines, the `jdbcDriverJarUri` field is optional.

## Syntax
<a name="rdsdatabase-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| \$1password | The password to supply. | String | 
| rdsInstanceId | The DBInstanceIdentifier property of the DB instance. | String | 
| username | The user name to supply when connecting to the database. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| databaseName | Name of the logical database to attach to | String | 
| jdbcDriverJarUri | The location in Amazon S3 of the JDBC driver JAR file used to connect to the database. AWS Data Pipeline must have permission to read this JAR file. For the MySQL and PostgreSQL engines, the default driver is used if this field is not specified, but you can override the default using this field. For the Oracle and SQL Server engines, this field is required. | String | 
| jdbcProperties | Pairs of the form A=B that will be set as properties on JDBC connections for this database. | String | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, for example, "parent":\$1"ref":"myBaseObjectId"\$1 | 
| region | The code for the region where the database exists. For example, us-east-1. | String | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @version | Pipeline version that the object was created with. | String | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | ID of the pipeline to which this object belongs. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

# RedshiftDatabase
<a name="dp-object-redshiftdatabase"></a>

Defines an Amazon Redshift database. `RedshiftDatabase` represents the properties of the database used by your pipeline.

## Example
<a name="redshiftdatabase-example"></a>

The following is an example of this object type.

```
{
  "id" : "MyRedshiftDatabase",
  "type" : "RedshiftDatabase",
  "clusterId" : "myRedshiftClusterId",
  "username" : "user_name",
  "*password" : "my_password",
  "databaseName" : "database_name"
}
```

By default, the object uses the Postgres driver, which requires the `clusterId` field. To use the Amazon Redshift driver, specify the Amazon Redshift database connection string from the Amazon Redshift console (starts with "jdbc:redshift:") in the `connectionString` field instead.

## Syntax
<a name="redshiftdatabase-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| \$1password | The password to supply. | String | 
| username | The user name to supply when connecting to the database. | String | 

 
****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| clusterId | The identifier provided by the user when the Amazon Redshift cluster was created. For example, if the endpoint for your Amazon Redshift cluster is mydb.example.us-east-1.redshift.amazonaws.com, the correct identifier is mydb. In the Amazon Redshift console, you can get this value from Cluster Identifier or Cluster Name. | String | 
| connectionString | The JDBC endpoint for connecting to an Amazon Redshift instance owned by an account different than the pipeline. You can't specify both connectionString and clusterId. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| databaseName | Name of the logical database to attach to. | String | 
| jdbcProperties | Pairs of the form A=B to be set as properties on JDBC connections for this database. | String | 
| parent | Parent of the current object from which slots are inherited. | Reference Object, for example, "parent":\$1"ref":"myBaseObjectId"\$1 | 
| region | The code for the region where the database exists. For example, us-east-1. | Enumeration | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @version | Pipeline version that the object was created with. | String | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | ID of the pipeline to which this object belongs. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

# Data Formats
<a name="dp-object-dataformats"></a>

The following are the AWS Data Pipeline data format objects:

**Topics**
+ [CSV Data Format](dp-object-csv.md)
+ [Custom Data Format](dp-object-custom.md)
+ [DynamoDBDataFormat](dp-object-dynamodbdataformat.md)
+ [DynamoDBExportDataFormat](dp-object-dynamodbexportdataformat.md)
+ [RegEx Data Format](dp-object-regex.md)
+ [TSV Data Format](dp-object-tsv.md)

# CSV Data Format
<a name="dp-object-csv"></a>

A comma-delimited data format where the column separator is a comma and the record separator is a newline character.

## Example
<a name="csv-example"></a>

The following is an example of this object type. 

```
{
  "id" : "MyOutputDataType",
  "type" : "CSV",
  "column" : [
    "Name STRING",
    "Score INT",
    "DateOfBirth TIMESTAMP"
  ]
}
```

## Syntax
<a name="csv-syntax"></a>


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| column | Column name with datatype specified by each field for the data described by this data node. Ex: hostname STRING For multiple values, use column names and data types separated by a space. | String | 
| escapeChar | A character, for example "\$1", that instructs the parser to ignore the next character. | String | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @version | Pipeline version the object was created with. | String | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object | String | 
| @pipelineId | Id of the pipeline to which this object belongs to | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects | String | 

# Custom Data Format
<a name="dp-object-custom"></a>

A custom data format defined by a combination of a certain column separator, record separator, and escape character.

## Example
<a name="custom-example"></a>

The following is an example of this object type. 

```
{
  "id" : "MyOutputDataType",
  "type" : "Custom",
  "columnSeparator" : ",",
  "recordSeparator" : "\n",
  "column" : [
    "Name STRING",
    "Score INT",
    "DateOfBirth TIMESTAMP"
  ]
}
```

## Syntax
<a name="custom-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| columnSeparator | A character that indicates the end of a column in a data file. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| column | Column name with datatype specified by each field for the data described by this data node. Ex: hostname STRING For multiple values, use column names and data types separated by a space. | String | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| recordSeparator | A character that indicates the end of a row in a data file, for example "\$1n". Only single characters are supported. | String | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @version | Pipeline version the object was created with. | String | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object | String | 
| @pipelineId | Id of the pipeline to which this object belongs to | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects | String | 

# DynamoDBDataFormat
<a name="dp-object-dynamodbdataformat"></a>

Applies a schema to a DynamoDB table to make it accessible by a Hive query. `DynamoDBDataFormat` is used with a `HiveActivity` object and a `DynamoDBDataNode` input and output. `DynamoDBDataFormat` requires that you specify all columns in your Hive query. For more flexibility to specify certain columns in a Hive query or Amazon S3 support, see [DynamoDBExportDataFormat](dp-object-dynamodbexportdataformat.md).

**Note**  
DynamoDB Boolean types are not mapped to Hive Boolean types. However, it is possible to map DynamoDB integer values of 0 or 1 to Hive Boolean types.

## Example
<a name="dynamodbdataformat-example"></a>

The following example shows how to use `DynamoDBDataFormat` to assign a schema to a `DynamoDBDataNode` input, which allows a `HiveActivity` object to access the data by named columns and copy the data to a `DynamoDBDataNode` output. 

```
{
  "objects": [
    {
      "id" : "Exists.1",
      "name" : "Exists.1",
      "type" : "Exists"
    },
    {
      "id" : "DataFormat.1",
      "name" : "DataFormat.1",
      "type" : "DynamoDBDataFormat",
      "column" : [ 
         "hash STRING", 
        "range STRING" 
      ]
    },
    {
      "id" : "DynamoDBDataNode.1",
      "name" : "DynamoDBDataNode.1",
      "type" : "DynamoDBDataNode",
      "tableName" : "$INPUT_TABLE_NAME",
      "schedule" : { "ref" : "ResourcePeriod" },
      "dataFormat" : { "ref" : "DataFormat.1" }
    },
    {
      "id" : "DynamoDBDataNode.2",
      "name" : "DynamoDBDataNode.2",
      "type" : "DynamoDBDataNode",
      "tableName" : "$OUTPUT_TABLE_NAME",
      "schedule" : { "ref" : "ResourcePeriod" },
      "dataFormat" : { "ref" : "DataFormat.1" }
    },
    {
      "id" : "EmrCluster.1",
      "name" : "EmrCluster.1",
      "type" : "EmrCluster",
      "schedule" : { "ref" : "ResourcePeriod" },
      "masterInstanceType" : "m1.small",
      "keyPair" : "$KEYPAIR"
    },
    {
      "id" : "HiveActivity.1",
      "name" : "HiveActivity.1",
      "type" : "HiveActivity",
      "input" : { "ref" : "DynamoDBDataNode.1" },
      "output" : { "ref" : "DynamoDBDataNode.2" },
      "schedule" : { "ref" : "ResourcePeriod" },
      "runsOn" : { "ref" : "EmrCluster.1" },
      "hiveScript" : "insert overwrite table ${output1} select * from ${input1} ;"
    },
    {
      "id" : "ResourcePeriod",
      "name" : "ResourcePeriod",
      "type" : "Schedule",
      "period" : "1 day",
      "startDateTime" : "2012-05-04T00:00:00",
      "endDateTime" : "2012-05-05T00:00:00"
    }
  ]
}
```

## Syntax
<a name="dynamodbdataformat-syntax"></a>


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| column | The column name with data type specified by each field for the data described by this data node. For example, hostname STRING. For multiple values, use column names and data types separated by a space. | String | 
| parent | The parent of the current object from which slots will be inherited. | Reference Object, such as "parent":\$1"ref":"myBaseObjectId"\$1 | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @version | The pipeline version uses to create the object. | String | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | The error describing the ill-formed object. | String | 
| @pipelineId | The Id of the pipeline to which this object belongs. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

# DynamoDBExportDataFormat
<a name="dp-object-dynamodbexportdataformat"></a>

Applies a schema to an DynamoDB table to make it accessible by a Hive query. Use `DynamoDBExportDataFormat` with a `HiveCopyActivity` object and `DynamoDBDataNode` or `S3DataNode` input and output. `DynamoDBExportDataFormat` has the following benefits: 
+ Provides both DynamoDB and Amazon S3 support
+ Allows you to filter data by certain columns in your Hive query
+ Exports all attributes from DynamoDB even if you have a sparse schema

**Note**  
DynamoDB Boolean types are not mapped to Hive Boolean types. However, it is possible to map DynamoDB integer values of 0 or 1 to Hive Boolean types.

## Example
<a name="dynamodbexportdataformat-example"></a>

The following example shows how to use `HiveCopyActivity` and `DynamoDBExportDataFormat` to copy data from one `DynamoDBDataNode` to another, while filtering based on a time stamp.

```
{
  "objects": [
    {
      "id" : "DataFormat.1",
      "name" : "DataFormat.1",
      "type" : "DynamoDBExportDataFormat",
      "column" : "timeStamp BIGINT"
    },
    {
      "id" : "DataFormat.2",
      "name" : "DataFormat.2",
      "type" : "DynamoDBExportDataFormat"
    },
    {
      "id" : "DynamoDBDataNode.1",
      "name" : "DynamoDBDataNode.1",
      "type" : "DynamoDBDataNode",
      "tableName" : "item_mapped_table_restore_temp",
      "schedule" : { "ref" : "ResourcePeriod" },
      "dataFormat" : { "ref" : "DataFormat.1" }
    },
    {
      "id" : "DynamoDBDataNode.2",
      "name" : "DynamoDBDataNode.2",
      "type" : "DynamoDBDataNode",
      "tableName" : "restore_table",
      "region" : "us_west_1",
      "schedule" : { "ref" : "ResourcePeriod" },
      "dataFormat" : { "ref" : "DataFormat.2" }
    },
    {
      "id" : "EmrCluster.1",
      "name" : "EmrCluster.1",
      "type" : "EmrCluster",
      "schedule" : { "ref" : "ResourcePeriod" },
      "masterInstanceType" : "m1.xlarge",
      "coreInstanceCount" : "4"
    },
    {
      "id" : "HiveTransform.1",
      "name" : "Hive Copy Transform.1",
      "type" : "HiveCopyActivity",
      "input" : { "ref" : "DynamoDBDataNode.1" },
      "output" : { "ref" : "DynamoDBDataNode.2" },
      "schedule" : { "ref" : "ResourcePeriod" },
      "runsOn" : { "ref" : "EmrCluster.1" },
      "filterSql" : "`timeStamp` > unix_timestamp(\"#{@scheduledStartTime}\", \"yyyy-MM-dd'T'HH:mm:ss\")"
    },
    {
      "id" : "ResourcePeriod",
      "name" : "ResourcePeriod",
      "type" : "Schedule",
      "period" : "1 Hour",
      "startDateTime" : "2013-06-04T00:00:00",
      "endDateTime" : "2013-06-04T01:00:00"
    }
  ]
}
```

## Syntax
<a name="dynamodbexportdataformat-syntax"></a>


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| column | Column name with datatype specified by each field for the data described by this data node. Ex: hostname STRING | String | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @version | Pipeline version the object was created with. | String | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object | String | 
| @pipelineId | Id of the pipeline to which this object belongs to | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects | String | 

# RegEx Data Format
<a name="dp-object-regex"></a>

A custom data format defined by a regular expression.

## Example
<a name="regex-example"></a>

The following is an example of this object type. 

```
{
  "id" : "MyInputDataType",
  "type" : "RegEx",
  "inputRegEx" : "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
  "outputFormat" : "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s",
  "column" : [
    "host STRING",
    "identity STRING",
    "user STRING",
    "time STRING",
    "request STRING",
    "status STRING",
    "size STRING",
    "referer STRING",
    "agent STRING"
  ]
}
```

## Syntax
<a name="regex-syntax"></a>


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| column | Column name with datatype specified by each field for the data described by this data node. Ex: hostname STRING For multiple values, use column names and data types separated by a space. | String | 
| inputRegEx | The regular expression to parse an S3 input file. inputRegEx provides a way to retrieve columns from relatively unstructured data in a file. | String | 
| outputFormat | The column fields retrieved by inputRegEx, but referenced as %1\$1s %2\$1s using Java formatter syntax. | String | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @version | Pipeline version the object was created with. | String | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object | String | 
| @pipelineId | Id of the pipeline to which this object belongs to | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects | String | 

# TSV Data Format
<a name="dp-object-tsv"></a>

A comma-delimited data format where the column separator is a tab character and the record separator is a newline character.

## Example
<a name="tsv-example"></a>

The following is an example of this object type. 

```
{
  "id" : "MyOutputDataType",
  "type" : "TSV",
  "column" : [
    "Name STRING",
    "Score INT",
    "DateOfBirth TIMESTAMP"
  ]
}
```

## Syntax
<a name="tsv-syntax"></a>


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| column | Column name and data type for the data described by this data node. For example "Name STRING" denotes a column named Name with fields of data type STRING. Separate multiple column name and data type pairs with commas (as shown in the example). | String | 
| columnSeparator | The character that separates fields in one column from fields in the next column. Defaults to '\$1t'. | String | 
| escapeChar | A character, for example "\$1", that instructs the parser to ignore the next character. | String | 
| parent | Parent of the current object from which slots are inherited. | Reference Object, for example, "parent":\$1"ref":"myBaseObjectId"\$1 | 
| recordSeparator | The character that separates records. Defaults to '\$1n'. | String | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @version | Pipeline version that the object was created with. | String | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | ID of the pipeline to which this object belongs. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects, which execute Attempt Objects. | String | 

# Actions
<a name="dp-object-actions"></a>

The following are the AWS Data Pipeline action objects:

**Topics**
+ [SnsAlarm](dp-object-snsalarm.md)
+ [Terminate](dp-object-terminate.md)

# SnsAlarm
<a name="dp-object-snsalarm"></a>

 Sends an Amazon SNS notification message when an activity fails or finishes successfully. 

## Example
<a name="snsalarm-example"></a>

The following is an example of this object type. The values for `node.input` and `node.output` come from the data node or activity that references this object in its `onSuccess` field. 

```
{
  "id" : "SuccessNotify",
  "name" : "SuccessNotify",
  "type" : "SnsAlarm",
  "topicArn" : "arn:aws:sns:us-east-1:28619EXAMPLE:ExampleTopic",
  "subject" : "COPY SUCCESS: #{node.@scheduledStartTime}",
  "message" : "Files were copied from #{node.input} to #{node.output}."
}
```

## Syntax
<a name="snsalarm-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| message | The body text of the Amazon SNS notification. | String | 
| role | The IAM role to use to create the Amazon SNS alarm. | String | 
| subject | The subject line of the Amazon SNS notification message. | String | 
| topicArn | The destination Amazon SNS topic ARN for the message. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| node | The node for which this action is being performed. | Reference Object, e.g. "node":\$1"ref":"myRunnableObjectId"\$1 | 
| @version | Pipeline version the object was created with. | String | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | Id of the pipeline to which this object belongs to. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

# Terminate
<a name="dp-object-terminate"></a>

An action to trigger the cancellation of a pending or unfinished activity, resource, or data node. AWS Data Pipeline attempts to put the activity, resource, or data node into the CANCELLED state if it does not start by the `lateAfterTimeout` value. 

You cannot terminate actions that include `onSuccess`, `OnFail`, or `onLateAction` resources.

## Example
<a name="terminate-example"></a>

The following is an example of this object type. In this example, the `onLateAction` field of `MyActivity` contains a reference to the action `DefaultAction1`. When you provide an action for `onLateAction`, you must also provide a `lateAfterTimeout` value to indicate the period of time since the scheduled start of the pipeline after which the activity is considered late.

```
{
  "name" : "MyActivity",
  "id" : "DefaultActivity1",
  "schedule" : {
    "ref" : "MySchedule"
  },
  "runsOn" : {
    "ref" : "MyEmrCluster"
  },
  "lateAfterTimeout" : "1 Hours",
  "type" : "EmrActivity",
  "onLateAction" : {
    "ref" : "DefaultAction1"
  },
  "step" : [
    "s3://amzn-s3-demo-bucket/myPath/myStep.jar,firstArg,secondArg",
    "s3://amzn-s3-demo-bucket/myPath/myOtherStep.jar,anotherArg"
  ]
},
{
  "name" : "TerminateTasks",
  "id" : "DefaultAction1",
  "type" : "Terminate"
}
```

## Syntax
<a name="terminate-syntax"></a>


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| parent | Parent of the current object from which slots are inherited. | Reference Object, for example "parent":\$1"ref":"myBaseObjectId"\$1 | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| node | The node for which this action is being performed. | Reference Object, for example "node":\$1"ref":"myRunnableObjectId"\$1 | 
| @version | Pipeline version that the object was created with. | String | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | ID of the pipeline to which this object belongs. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects, which execute Attempt Objects. | String | 

# Schedule
<a name="dp-object-schedule"></a>

Defines the timing of a scheduled event, such as when an activity runs.

**Note**  
When a schedule's start time is in the past, AWS Data Pipeline backfills your pipeline and begins scheduling runs immediately beginning at the specified start time. For testing/development, use a relatively short interval. Otherwise, AWS Data Pipeline attempts to queue and schedule all runs of your pipeline for that interval. AWS Data Pipeline attempts to prevent accidental backfills if the pipeline component `scheduledStartTime` is earlier than 1 day ago by blocking pipeline activation.

## Examples
<a name="schedule-example"></a>

The following is an example of this object type. It defines a schedule of every hour starting at 00:00:00 hours on 2012-09-01 and ending at 00:00:00 hours on 2012-10-01. The first period ends at 01:00:00 on 2012-09-01.

```
{
  "id" : "Hourly",
  "type" : "Schedule",
  "period" : "1 hours",
  "startDateTime" : "2012-09-01T00:00:00",
  "endDateTime" : "2012-10-01T00:00:00"
}
```

The following pipeline will start at the `FIRST_ACTIVATION_DATE_TIME` and run every hour until 22:00:00 hours on 2014-04-25.

```
{
     "id": "SchedulePeriod",
     "name": "SchedulePeriod",
     "startAt": "FIRST_ACTIVATION_DATE_TIME",
     "period": "1 hours",
     "type": "Schedule",
     "endDateTime": "2014-04-25T22:00:00"
   }
```

The following pipeline will start at the `FIRST_ACTIVATION_DATE_TIME` and run every hour and complete after three occurrences.

```
{
     "id": "SchedulePeriod",
     "name": "SchedulePeriod",
     "startAt": "FIRST_ACTIVATION_DATE_TIME",
     "period": "1 hours",
     "type": "Schedule",
     "occurrences": "3"
   }
```

The following pipeline will start at 22:00:00 on 2014-04-25, run hourly, and end after three occurrences.

```
{
     "id": "SchedulePeriod",
     "name": "SchedulePeriod",
     "startDateTime": "2014-04-25T22:00:00",
     "period": "1 hours",
     "type": "Schedule",
     "occurrences": "3"
   }
```

On-demand using the Default object

```
{
  "name": "Default",
  "resourceRole": "DataPipelineDefaultResourceRole",
  "role": "DataPipelineDefaultRole",
  "scheduleType": "ondemand"
}
```

On-demand with explicit Schedule object

```
{
  "name": "Default",
  "resourceRole": "DataPipelineDefaultResourceRole",
  "role": "DataPipelineDefaultRole",
  "scheduleType": "ondemand"
},
{
  "name": "DefaultSchedule",
  "type": "Schedule",
  "id": "DefaultSchedule",
  "period": "ONDEMAND_PERIOD",
  "startAt": "ONDEMAND_ACTIVATION_TIME"
},
```

The following examples demonstrate how a Schedule can be inherited from the default object, be explicitly set for that object, or be given by a parent reference:

Schedule inherited from Default object

```
{
  "objects": [
  {       
      "id": "Default",
      "failureAndRerunMode":"cascade",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "pipelineLogUri": "s3://myLogsbucket",
      "scheduleType": "cron",
      "schedule": {
        "ref": "DefaultSchedule"
      }
   },
   {
      "type": "Schedule",
      "id": "DefaultSchedule",
      "occurrences": "1",
      "period": "1 Day",
      "startAt": "FIRST_ACTIVATION_DATE_TIME"
    },
    { 
      "id": "A_Fresh_NewEC2Instance",
      "type": "Ec2Resource",
      "terminateAfter": "1 Hour"
    },
    {
      "id": "ShellCommandActivity_HelloWorld",
      "runsOn": {
        "ref": "A_Fresh_NewEC2Instance"
      },
      "type": "ShellCommandActivity",
      "command": "echo 'Hello World!'"
    }
  ]
}
```

Explicit schedule on the object

```
{
  "objects": [
  {       
      "id": "Default",
      "failureAndRerunMode":"cascade",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "pipelineLogUri": "s3://myLogsbucket",
      "scheduleType": "cron"
      
   },
   {
      "type": "Schedule",
      "id": "DefaultSchedule",
      "occurrences": "1",
      "period": "1 Day",
      "startAt": "FIRST_ACTIVATION_DATE_TIME"
    },
    { 
      "id": "A_Fresh_NewEC2Instance",
      "type": "Ec2Resource",
      "terminateAfter": "1 Hour"
    },
    {
      "id": "ShellCommandActivity_HelloWorld",
      "runsOn": {
        "ref": "A_Fresh_NewEC2Instance"
      },
      "schedule": {
        "ref": "DefaultSchedule"
      },
      "type": "ShellCommandActivity",
      "command": "echo 'Hello World!'"
    }
  ]
}
```

Schedule from Parent reference

```
{
  "objects": [
  {       
      "id": "Default",
      "failureAndRerunMode":"cascade",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "pipelineLogUri": "s3://myLogsbucket",
      "scheduleType": "cron"
      
   },
   {       
      "id": "parent1",
      "schedule": {
        "ref": "DefaultSchedule"
      }
      
   },
   {
      "type": "Schedule",
      "id": "DefaultSchedule",
      "occurrences": "1",
      "period": "1 Day",
      "startAt": "FIRST_ACTIVATION_DATE_TIME"
    },
    { 
      "id": "A_Fresh_NewEC2Instance",
      "type": "Ec2Resource",
      "terminateAfter": "1 Hour"
    },
    {
      "id": "ShellCommandActivity_HelloWorld",
      "runsOn": {
        "ref": "A_Fresh_NewEC2Instance"
      },
      "parent": {
        "ref": "parent1"
      },
      "type": "ShellCommandActivity",
      "command": "echo 'Hello World!'"
    }
  ]
}
```

## Syntax
<a name="schedule-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| period | How often the pipeline should run. The format is "N [minutes\$1hours\$1days\$1weeks\$1months]", where N is a number followed by one of the time specifiers. For example, "15 minutes", runs the pipeline every 15 minutes. The minimum period is 15 minutes and the maximum period is 3 years. | Period | 

 
****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| startAt | The date and time at which to start the scheduled pipeline runs. Valid value is FIRST\$1ACTIVATION\$1DATE\$1TIME, which is deprecated in favor of creating an on-demand pipeline. | Enumeration | 
| startDateTime | The date and time to start the scheduled runs. You must use either startDateTime or startAt but not both. | DateTime | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| endDateTime | The date and time to end the scheduled runs. Must be a date and time later than the value of startDateTime or startAt. The default behavior is to schedule runs until the pipeline is shut down.  | DateTime | 
| occurrences | The number of times to execute the pipeline after it's activated. You can't use occurrences with endDateTime. | Integer | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @version | Pipeline version the object was created with. | String | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object | String | 
| @firstActivationTime | The time of object creation. | DateTime | 
| @pipelineId | Id of the pipeline to which this object belongs to | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects | String | 

# Utilities
<a name="dp-object-utilities"></a>

The following utility objects configure other pipeline objects:

**Topics**
+ [ShellScriptConfig](dp-object-shellscriptconfig.md)
+ [EmrConfiguration](dp-object-emrconfiguration.md)
+ [Property](dp-object-property.md)

# ShellScriptConfig
<a name="dp-object-shellscriptconfig"></a>

Use with an Activity to run a shell script for preActivityTaskConfig and postActivityTaskConfig. This object is available for [HadoopActivity](dp-object-hadoopactivity.md), [HiveActivity](dp-object-hiveactivity.md), [HiveCopyActivity](dp-object-hivecopyactivity.md), and [PigActivity](dp-object-pigactivity.md). You specify an S3 URI and a list of arguments for the script.

## Example
<a name="shellscriptconfig-example"></a>

A ShellScriptConfig with arguments:

```
{
   "id" : "ShellScriptConfig_1”,
   "name" : “prescript”,
   "type" : "ShellScriptConfig",
   "scriptUri": “s3://my-bucket/shell-cleanup.sh”,
   "scriptArgument" : ["arg1","arg2"]
 }
```

## Syntax
<a name="shellscriptconfig-syntax"></a>

This object includes the following fields.


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| parent | Parent of the current object from which slots are inherited. | Reference Object, for example, "parent":\$1"ref":"myBaseObjectId"\$1 | 
| scriptArgument | A list of arguments to use with the shell script. | String | 
| scriptUri | The script URI in Amazon S3 that should be downloaded and run. | String | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @version | Pipeline version that the object was created with. | String | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | ID of the pipeline to which this object belongs. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects, which execute Attempt Objects. | String | 

# EmrConfiguration
<a name="dp-object-emrconfiguration"></a>

The EmrConfiguration object is the configuration used for EMR clusters with releases 4.0.0 or greater. Configurations (as a list) is a parameter to the RunJobFlow API call. The configuration API for Amazon EMR takes a classification and properties. AWS Data Pipeline uses EmrConfiguration with corresponding Property objects to configure an [EmrCluster](dp-object-emrcluster.md) application such as Hadoop, Hive, Spark, or Pig on EMR clusters launched in a pipeline execution. Because configuration can only be changed for new clusters, you cannot provide a EmrConfiguration object for existing resources. For more information, see [http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/](http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/).

## Example
<a name="emrconfiguration-example"></a>

The following configuration object sets the `io.file.buffer.size` and `fs.s3.block.size` properties in `core-site.xml`:

```
[
   {  
      "classification":"core-site",
      "properties":
      {
         "io.file.buffer.size": "4096",
         "fs.s3.block.size": "67108864"
      }
   }
]
```

The corresponding pipeline object definition uses a EmrConfiguration object and a list of Property objects in the `property` field:

```
{
  "objects": [
    {
      "name": "ReleaseLabelCluster",
      "releaseLabel": "emr-4.1.0",
      "applications": ["spark", "hive", "pig"],
      "id": "ResourceId_I1mCc",
      "type": "EmrCluster",
      "configuration": {
        "ref": "coresite"
      }
    },
    {
      "name": "coresite",
      "id": "coresite",
      "type": "EmrConfiguration",
      "classification": "core-site",
      "property": [{
        "ref": "io-file-buffer-size"
      },
      {
        "ref": "fs-s3-block-size"
      }
      ]
    },
    {
      "name": "io-file-buffer-size",
      "id": "io-file-buffer-size",
      "type": "Property",
      "key": "io.file.buffer.size",
      "value": "4096"
    },
    {
      "name": "fs-s3-block-size",
      "id": "fs-s3-block-size",
      "type": "Property",
      "key": "fs.s3.block.size",
      "value": "67108864"
    }
  ]
}
```

The following example is a nested configuration used to set the Hadoop environment with the `hadoop-env` classification:

```
[
  {
    "classification": "hadoop-env",
    "properties": {},
    "configurations": [
      {
        "classification": "export",
        "properties": {
          "YARN_PROXYSERVER_HEAPSIZE": "2396"
        }
      }
    ]
  }
]
```

The corresponding pipeline definition object that uses this configuration is below:

```
{
  "objects": [
    {
      "name": "ReleaseLabelCluster",
      "releaseLabel": "emr-4.0.0",
      "applications": ["spark", "hive", "pig"],
      "id": "ResourceId_I1mCc",
      "type": "EmrCluster",
      "configuration": {
        "ref": "hadoop-env"
      }
    },
    {
      "name": "hadoop-env",
      "id": "hadoop-env",
      "type": "EmrConfiguration",
      "classification": "hadoop-env",
      "configuration": {
        "ref": "export"
      }
    },
    {
      "name": "export",
      "id": "export",
      "type": "EmrConfiguration",
      "classification": "export",
      "property": {
        "ref": "yarn-proxyserver-heapsize"
      }
    },
    {
      "name": "yarn-proxyserver-heapsize",
      "id": "yarn-proxyserver-heapsize",
      "type": "Property",
      "key": "YARN_PROXYSERVER_HEAPSIZE",
      "value": "2396"
    },
  ]
}
```

The following example modifies a Hive-specific property for an EMR cluster:

```
{
    "objects": [
        {
            "name": "hivesite",
            "id": "hivesite",
            "type": "EmrConfiguration",
            "classification": "hive-site",
            "property": [
                {
                    "ref": "hive-client-timeout"
                }
            ]
        },
        {
            "name": "hive-client-timeout",
            "id": "hive-client-timeout",
            "type": "Property",
            "key": "hive.metastore.client.socket.timeout",
            "value": "2400s"
        }
    ]
}
```

## Syntax
<a name="emrconfiguration-syntax"></a>

This object includes the following fields.


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| classification | Classification for the configuration. | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| configuration | Sub-configuration for this configuration. | Reference Object, e.g. "configuration":\$1"ref":"myEmrConfigurationId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| property | Configuration property. | Reference Object, e.g. "property":\$1"ref":"myPropertyId"\$1 | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @version | Pipeline version the object was created with. | String | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object | String | 
| @pipelineId | Id of the pipeline to which this object belongs to | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects | String | 

## See Also
<a name="emrconfiguration-seealso"></a>
+ [EmrCluster](dp-object-emrcluster.md)
+ [Property](dp-object-property.md)
+ [Amazon EMR Release Guide](http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/)

# Property
<a name="dp-object-property"></a>

A single key-value property for use with an EmrConfiguration object.

## Example
<a name="property-example"></a>

The following pipeline definition shows an EmrConfiguration object and corresponding Property objects to launch an EmrCluster:

```
{
  "objects": [
    {
      "name": "ReleaseLabelCluster",
      "releaseLabel": "emr-4.1.0",
      "applications": ["spark", "hive", "pig"],
      "id": "ResourceId_I1mCc",
      "type": "EmrCluster",
      "configuration": {
        "ref": "coresite"
      }
    },
    {
      "name": "coresite",
      "id": "coresite",
      "type": "EmrConfiguration",
      "classification": "core-site",
      "property": [{
        "ref": "io-file-buffer-size"
      },
      {
        "ref": "fs-s3-block-size"
      }
      ]
    },
    {
      "name": "io-file-buffer-size",
      "id": "io-file-buffer-size",
      "type": "Property",
      "key": "io.file.buffer.size",
      "value": "4096"
    },
    {
      "name": "fs-s3-block-size",
      "id": "fs-s3-block-size",
      "type": "Property",
      "key": "fs.s3.block.size",
      "value": "67108864"
    }
  ]
}
```

## Syntax
<a name="property-syntax"></a>

This object includes the following fields.


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| key | key | String | 
| value | value | String | 

 
****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| parent | Parent of the current object from which slots are inherited. | Reference Object, for example, "parent":\$1"ref":"myBaseObjectId"\$1 | 

 
****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @version | Pipeline version that the object was created with. | String | 

 
****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | ID of the pipeline to which this object belongs. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects, which execute Attempt Objects. | String | 

## See Also
<a name="property-seealso"></a>
+ [EmrCluster](dp-object-emrcluster.md)
+ [EmrConfiguration](dp-object-emrconfiguration.md)
+ [Amazon EMR Release Guide](http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/)