

AWS Data Pipeline is no longer available to new customers. Existing customers of AWS Data Pipeline can continue to use the service as normal. [Learn more](https://aws.amazon.com/blogs/big-data/migrate-workloads-from-aws-data-pipeline/)

# Activities
<a name="dp-object-activities"></a>

The following are the AWS Data Pipeline activity objects:

**Topics**
+ [CopyActivity](dp-object-copyactivity.md)
+ [EmrActivity](dp-object-emractivity.md)
+ [HadoopActivity](dp-object-hadoopactivity.md)
+ [HiveActivity](dp-object-hiveactivity.md)
+ [HiveCopyActivity](dp-object-hivecopyactivity.md)
+ [PigActivity](dp-object-pigactivity.md)
+ [RedshiftCopyActivity](dp-object-redshiftcopyactivity.md)
+ [ShellCommandActivity](dp-object-shellcommandactivity.md)
+ [SqlActivity](dp-object-sqlactivity.md)

# CopyActivity
<a name="dp-object-copyactivity"></a>

Copies data from one location to another. `CopyActivity` supports [S3DataNode](dp-object-s3datanode.md) and [SqlDataNode](dp-object-sqldatanode.md) as input and output and the copy operation is normally performed record-by-record. However, `CopyActivity` provides a high-performance Amazon S3 to Amazon S3 copy when all the following conditions are met:
+ The input and output are S3DataNodes
+ The `dataFormat` field is the same for input and output

If you provide compressed data files as input and do not indicate this using the `compression` field on the S3 data nodes, `CopyActivity` might fail. In this case, `CopyActivity` does not properly detect the end of record character and the operation fails. Further, `CopyActivity` supports copying from a directory to another directory and copying a file to a directory, but record-by-record copy occurs when copying a directory to a file. Finally, `CopyActivity` does not support copying multipart Amazon S3 files. 

`CopyActivity` has specific limitations to its CSV support. When you use an S3DataNode as input for `CopyActivity`, you can only use a Unix/Linux variant of the CSV data file format for the Amazon S3 input and output fields. The Unix/Linux variant requires the following: 
+ The separator must be the "," (comma) character.
+ The records are not quoted.
+ The default escape character is ASCII value 92 (backslash).
+ The end of record identifier is ASCII value 10 (or "\$1n").

Windows-based systems typically use a different end-of-record character sequence: a carriage return and line feed together (ASCII value 13 and ASCII value 10). You must accommodate this difference using an additional mechanism, such as a pre-copy script to modify the input data, to ensure that `CopyActivity` can properly detect the end of a record; otherwise, the `CopyActivity` fails repeatedly.

When using `CopyActivity` to export from a PostgreSQL RDS object to a TSV data format, the default NULL character is \$1n.

## Example
<a name="copyactivity-example"></a>

The following is an example of this object type. This object references three other objects that you would define in the same pipeline definition file. `CopyPeriod` is a `Schedule` object and `InputData` and `OutputData` are data node objects.

```
{
  "id" : "S3ToS3Copy",
  "type" : "CopyActivity",
  "schedule" : { "ref" : "CopyPeriod" },
  "input" : { "ref" : "InputData" },
  "output" : { "ref" : "OutputData" },
  "runsOn" : { "ref" : "MyEc2Resource" }
}
```

## Syntax
<a name="copyactivity-syntax"></a>


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Users must specify a schedule reference to another object to set the dependency execution order for this object. Users can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), users can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html) | Reference Object, e.g. "schedule":\$1"ref":"myScheduleId"\$1 | 


****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | The computational resource to run the activity or command. For example, an Amazon EC2 instance or Amazon EMR cluster. | Reference Object, e.g. "runsOn":\$1"ref":"myResourceId"\$1 | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| dependsOn | Specify dependency on another runnable object. | Reference Object, e.g. "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun | Enumeration | 
| input | The input data source. | Reference Object, e.g. "input":\$1"ref":"myDataNodeId"\$1 | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| output | The output data source. | Reference Object, e.g. "output":\$1"ref":"myDataNodeId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, e.g. "precondition":\$1"ref":"myPreconditionId"\$1 | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries. | Enumeration | 

 


****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object | DateTime | 
| @scheduledStartTime | Schedule start time for object | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 


****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object | String | 
| @pipelineId | Id of the pipeline to which this object belongs to | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects | String | 

## See Also
<a name="copyactivity-seealso"></a>
+ [ShellCommandActivity](dp-object-shellcommandactivity.md)
+ [EmrActivity](dp-object-emractivity.md)
+ [Export MySQL Data to Amazon S3 Using AWS Data Pipeline](dp-copydata-mysql.md)

# EmrActivity
<a name="dp-object-emractivity"></a>

 Runs an EMR cluster. 

AWS Data Pipeline uses a different format for steps than Amazon EMR; for example, AWS Data Pipeline uses comma-separated arguments after the JAR name in the `EmrActivity` step field. The following example shows a step formatted for Amazon EMR, followed by its AWS Data Pipeline equivalent:

```
s3://amzn-s3-demo-bucket/MyWork.jar arg1 arg2 arg3
```

```
"s3://amzn-s3-demo-bucket/MyWork.jar,arg1,arg2,arg3"
```

## Examples
<a name="emractivity-example"></a>

The following is an example of this object type. This example uses older versions of Amazon EMR. Verify this example for correctness with the version of Amazon EMR cluster that you are using. 

This object references three other objects that you would define in the same pipeline definition file. `MyEmrCluster` is an `EmrCluster` object and `MyS3Input` and `MyS3Output` are `S3DataNode` objects. 

**Note**  
In this example, you can replace the `step` field with your desired cluster string, which could be a Pig script, Hadoop streaming cluster, your own custom JAR including its parameters, or so on.

Hadoop 2.x (AMI 3.x)

```
{
  "id" : "MyEmrActivity",
  "type" : "EmrActivity",
  "runsOn" : { "ref" : "MyEmrCluster" },
  "preStepCommand" : "scp remoteFiles localFiles",
  "step" : ["s3://amzn-s3-demo-bucket/myPath/myStep.jar,firstArg,secondArg,-files,s3://amzn-s3-demo-bucket/myPath/myFile.py,-input,s3://myinputbucket/path,-output,s3://myoutputbucket/path,-mapper,myFile.py,-reducer,reducerName","s3://amzn-s3-demo-bucket/myPath/myotherStep.jar,..."],
  "postStepCommand" : "scp localFiles remoteFiles",
  "input" : { "ref" : "MyS3Input" },
  "output" : { "ref" : "MyS3Output" }
}
```

**Note**  
To pass arguments to an application in a step, you need to specify the Region in the path of the script, as in the following example. In addition, you may need to escape the arguments that you pass. For example, if you use `script-runner.jar` to run a shell script and want to pass arguments to the script, you must escape the commas that separate them. The following step slot illustrates how to do this:   

```
"step" : "s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,s3://datapipeline/echo.sh,a\\\\,b\\\\,c"
```
This step uses `script-runner.jar` to run the `echo.sh` shell script and passes `a`, `b`, and `c` as a single argument to the script. The first escape character is removed from the resultant argument so you may need to escape again. For example, if you had `File\.gz` as an argument in JSON, you could escape it using `File\\\\.gz`. However, because the first escape is discarded, you must use `File\\\\\\\\.gz `.

## Syntax
<a name="emractivity-syntax"></a>


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Specify a schedule reference to another object to set the dependency execution order for this object. You can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), you can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html) | Reference Object, for example, "schedule":\$1"ref":"myScheduleId"\$1 | 

 


****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | The Amazon EMR cluster on which this job will run. | Reference Object, for example, "runsOn":\$1"ref":"myEmrClusterId"\$1 | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set, then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| dependsOn | Specify dependency on another runnable object. | Reference Object, for example, "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun. | Enumeration | 
| input | The location of the input data. | Reference Object, for example, "input":\$1"ref":"myDataNodeId"\$1 | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | The maximum number of attempt retries on failure. | Integer | 
| onFail | An action to run when current object fails. | Reference Object, for example, "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, for example, "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when the current object succeeds. | Reference Object, for example, "onSuccess":\$1"ref":"myActionId"\$1 | 
| output | The location of the output data. | Reference Object, for example, "output":\$1"ref":"myDataNodeId"\$1 | 
| parent | The parent of the current object from which slots will be inherited. | Reference Object, for example, "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The Amazon S3 URI, such as 's3://BucketName/Prefix/' for uploading logs for the pipeline. | String | 
| postStepCommand | Shell scripts to be run after all steps are finished. To specify multiple scripts, up to 255, add multiple postStepCommand fields. | String | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, for example, "precondition":\$1"ref":"myPreconditionId"\$1 | 
| preStepCommand | Shell scripts to be run before any steps are run. To specify multiple scripts, up to 255, add multiple preStepCommand fields. | String | 
| reportProgressTimeout | The timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| resizeClusterBeforeRunning |  Resize the cluster before performing this activity to accommodate DynamoDB tables specified as inputs or outputs.   If your `EmrActivity` uses a `DynamoDBDataNode` as either an input or output data node, and if you set the `resizeClusterBeforeRunning` to `TRUE`, AWS Data Pipeline starts using `m3.xlarge` instance types. This overwrites your instance type choices with `m3.xlarge`, which could increase your monthly costs.   | Boolean | 
| resizeClusterMaxInstances | A limit on the maximum number of instances that can be requested by the resize algorithm. | Integer | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of the interval, or end of the interval. Values are: cron, ondemand, and timeseries. The timeseries scheduling means that instances are scheduled at the end of each interval. The cron scheduling means that instances are scheduled at the beginning of each interval. An ondemand schedule allows you to run a pipeline one time per activation. You do not have to clone or re-create the pipeline to run it again. If you use an ondemand schedule, it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use ondemand pipelines, call the ActivatePipeline operation for each subsequent run.  | Enumeration | 
| step | One or more steps for the cluster to run. To specify multiple steps, up to 255, add multiple step fields. Use comma-separated arguments after the JAR name; for example, "s3://amzn-s3-demo-bucket/MyWork.jar,arg1,arg2,arg3". | String | 

 


****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, for example, "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | Amazon EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for the object. | DateTime | 
| @scheduledStartTime | Schedule start time for the object. | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version that the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, for example, "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 


****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | ID of the pipeline to which this object belongs. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

## See Also
<a name="emractivity-seealso"></a>
+ [ShellCommandActivity](dp-object-shellcommandactivity.md)
+ [CopyActivity](dp-object-copyactivity.md)
+ [EmrCluster](dp-object-emrcluster.md)

# HadoopActivity
<a name="dp-object-hadoopactivity"></a>

 Runs a MapReduce job on a cluster. The cluster can be an EMR cluster managed by AWS Data Pipeline or another resource if you use TaskRunner. Use HadoopActivity when you want to run work in parallel. This allows you to use the scheduling resources of the YARN framework or the MapReduce resource negotiator in Hadoop 1. If you would like to run work sequentially using the Amazon EMR Step action, you can still use [EmrActivity](dp-object-emractivity.md).

## Examples
<a name="hadoopactivity-example"></a>

**HadoopActivity using an EMR cluster managed by AWS Data Pipeline**  
The following HadoopActivity object uses an EmrCluster resource to run a program:

```
 {
   "name": "MyHadoopActivity",
   "schedule": {"ref": "ResourcePeriod"},
   "runsOn": {"ref": “MyEmrCluster”},
   "type": "HadoopActivity",
   "preActivityTaskConfig":{"ref":"preTaskScriptConfig”},   
   "jarUri": "/home/hadoop/contrib/streaming/hadoop-streaming.jar",
   "argument": [
     "-files",
     “s3://elasticmapreduce/samples/wordcount/wordSplitter.py“,
     "-mapper",
     "wordSplitter.py",
     "-reducer",
     "aggregate",
     "-input",
     "s3://elasticmapreduce/samples/wordcount/input/",
     "-output",
     “s3://amzn-s3-demo-bucket/MyHadoopActivity/#{@pipelineId}/#{format(@scheduledStartTime,'YYYY-MM-dd')}"
   ],
   "maximumRetries": "0",
   "postActivityTaskConfig":{"ref":"postTaskScriptConfig”},
   "hadoopQueue" : “high”
 }
```

Here is the corresponding *MyEmrCluster*, which configures the FairScheduler and queues in YARN for Hadoop 2-based AMIs:

```
{
  "id" : "MyEmrCluster",
  "type" : "EmrCluster",
   "hadoopSchedulerType" : "PARALLEL_FAIR_SCHEDULING",
  “amiVersion” : “3.7.0”,
  "bootstrapAction" : ["s3://Region.elasticmapreduce/bootstrap-actions/configure-hadoop,-z,yarn.scheduler.capacity.root.queues=low\,high\,default,-z,yarn.scheduler.capacity.root.high.capacity=50,-z,yarn.scheduler.capacity.root.low.capacity=10,-z,yarn.scheduler.capacity.root.default.capacity=30”]
}
```

This is the EmrCluster you use to configure FairScheduler in Hadoop 1:

```
{
      "id": "MyEmrCluster",
      "type": "EmrCluster",    
      "hadoopSchedulerType": "PARALLEL_FAIR_SCHEDULING",
      "amiVersion": "2.4.8",
      "bootstrapAction": "s3://Region.elasticmapreduce/bootstrap-actions/configure-hadoop,-m,mapred.queue.names=low\\\\,high\\\\,default,-m,mapred.fairscheduler.poolnameproperty=mapred.job.queue.name"
          }
```

The following EmrCluster configures CapacityScheduler for Hadoop 2-based AMIs:

```
{
      "id": "MyEmrCluster",
      "type": "EmrCluster",
      "hadoopSchedulerType": "PARALLEL_CAPACITY_SCHEDULING",
      "amiVersion": "3.7.0",
      "bootstrapAction": "s3://Region.elasticmapreduce/bootstrap-actions/configure-hadoop,-z,yarn.scheduler.capacity.root.queues=low\\\\,high,-z,yarn.scheduler.capacity.root.high.capacity=40,-z,yarn.scheduler.capacity.root.low.capacity=60"
    }
```

**HadoopActivity using an existing EMR cluster**  
In this example, you use workergroups and a TaskRunner to run a program on an existing EMR cluster. The following pipeline definition uses HadoopActivity to: 
+ Run a MapReduce program only on *myWorkerGroup* resources. For more information about worker groups, see [Executing Work on Existing Resources Using Task Runner](dp-how-task-runner-user-managed.md).
+ Run a preActivityTaskConfig and postActivityTaskConfig

```
{
  "objects": [
    {
      "argument": [
        "-files",
        "s3://elasticmapreduce/samples/wordcount/wordSplitter.py",
        "-mapper",
        "wordSplitter.py",
        "-reducer",
        "aggregate",
        "-input",
        "s3://elasticmapreduce/samples/wordcount/input/",
        "-output",
        "s3://amzn-s3-demo-bucket/MyHadoopActivity/#{@pipelineId}/#{format(@scheduledStartTime,'YYYY-MM-dd')}"
      ],
      "id": "MyHadoopActivity",
      "jarUri": "/home/hadoop/contrib/streaming/hadoop-streaming.jar",
      "name": "MyHadoopActivity",
      "type": "HadoopActivity"
    },
    {
      "id": "SchedulePeriod",
      "startDateTime": "start_datetime",
      "name": "SchedulePeriod",
      "period": "1 day",
      "type": "Schedule",
      "endDateTime": "end_datetime"
    },
    {
      "id": "ShellScriptConfig",
      "scriptUri": "s3://amzn-s3-demo-bucket/scripts/preTaskScript.sh",
      "name": "preTaskScriptConfig",
      "scriptArgument": [
        "test",
        "argument"
      ],
      "type": "ShellScriptConfig"
    },
    {
      "id": "ShellScriptConfig",
      "scriptUri": "s3://amzn-s3-demo-bucket/scripts/postTaskScript.sh",
      "name": "postTaskScriptConfig",
      "scriptArgument": [
        "test",
        "argument"
      ],
      "type": "ShellScriptConfig"
    },
    {
      "id": "Default",
      "scheduleType": "cron",
      "schedule": {
        "ref": "SchedulePeriod"
      },
      "name": "Default",
      "pipelineLogUri": "s3://amzn-s3-demo-bucket/logs/2015-05-22T18:02:00.343Z642f3fe415",
      "maximumRetries": "0",    
      "workerGroup": "myWorkerGroup",
      "preActivityTaskConfig": {
        "ref": "preTaskScriptConfig"
      },
      "postActivityTaskConfig": {
        "ref": "postTaskScriptConfig"
      }    
    }
  ] 
}
```

## Syntax
<a name="hadoopactivity-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| jarUri | Location of a JAR in Amazon S3 or the local file system of the cluster to run with HadoopActivity. | String | 

 


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Users must specify a schedule reference to another object to set the dependency execution order for this object. Users can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), users can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html) | Reference Object, e.g. "schedule":\$1"ref":"myScheduleId"\$1 | 

 


****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | EMR Cluster on which this job will run. | Reference Object, e.g. "runsOn":\$1"ref":"myEmrClusterId"\$1 | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| argument | Arguments to pass to the JAR. | String | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| dependsOn | Specify dependency on another runnable object. | Reference Object, e.g. "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun | Enumeration | 
| hadoopQueue | The Hadoop scheduler queue name on which the activity will be submitted. | String | 
| input | Location of the input data. | Reference Object, e.g. "input":\$1"ref":"myDataNodeId"\$1 | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| mainClass | The main class of the JAR you are executing with HadoopActivity. | String | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| output | Location of the output data. | Reference Object, e.g. "output":\$1"ref":"myDataNodeId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| postActivityTaskConfig | Post-activity configuration script to be run. This consists of a URI of the shell script in Amazon S3 and a list of arguments. | Reference Object, e.g. "postActivityTaskConfig":\$1"ref":"myShellScriptConfigId"\$1 | 
| preActivityTaskConfig | Pre-activity configuration script to be run. This consists of a URI of the shell script in Amazon S3 and a list of arguments. | Reference Object, e.g. "preActivityTaskConfig":\$1"ref":"myShellScriptConfigId"\$1 | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, e.g. "precondition":\$1"ref":"myPreconditionId"\$1 | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries. | Enumeration | 

 


****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object | DateTime | 
| @scheduledStartTime | Schedule start time for object | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 


****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | Id of the pipeline to which this object belongs to. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

## See Also
<a name="hadoopactivity-seealso"></a>
+ [ShellCommandActivity](dp-object-shellcommandactivity.md)
+ [CopyActivity](dp-object-copyactivity.md)
+ [EmrCluster](dp-object-emrcluster.md)

# HiveActivity
<a name="dp-object-hiveactivity"></a>

Runs a Hive query on an EMR cluster. `HiveActivity` makes it easier to set up an Amazon EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon RDS. All you need to specify is the HiveQL to run on the source data. AWS Data Pipeline automatically creates Hive tables with `${input1}`, `${input2}`, and so on, based on the input fields in the `HiveActivity` object. 

For Amazon S3 inputs, the `dataFormat` field is used to create the Hive column names. 

For MySQL (Amazon RDS) inputs, the column names for the SQL query are used to create the Hive column names.

**Note**  
This activity uses the Hive [CSV Serde](https://cwiki.apache.org/confluence/display/Hive/CSV+Serde).

## Example
<a name="hiveactivity-example"></a>

The following is an example of this object type. This object references three other objects that you define in the same pipeline definition file. `MySchedule` is a `Schedule` object and `MyS3Input` and `MyS3Output` are data node objects.

```
{
  "name" : "ProcessLogData",
  "id" : "MyHiveActivity",
  "type" : "HiveActivity",
  "schedule" : { "ref": "MySchedule" },
  "hiveScript" : "INSERT OVERWRITE TABLE ${output1} select host,user,time,request,status,size from ${input1};",
  "input" : { "ref": "MyS3Input" },
  "output" : { "ref": "MyS3Output" },
  "runsOn" : { "ref": "MyEmrCluster" }
}
```

## Syntax
<a name="hiveactivity-syntax"></a>


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Specify a schedule reference to another object to set the dependency execution order for this object. You can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), you can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html). | Reference Object, e.g. "schedule":\$1"ref":"myScheduleId"\$1 | 

 


****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| hiveScript | The Hive script to run. | String | 
| scriptUri | The location of the Hive script to run (for example, s3://scriptLocation). | String | 

 


****  

| Required Group | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | The EMR cluster on which this HiveActivity runs. | Reference Object, e.g. "runsOn":\$1"ref":"myEmrClusterId"\$1 | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 
| input | The input data source. | Reference Object, such as "input":\$1"ref":"myDataNodeId"\$1 | 
| output | The output data source. | Reference Object, such as "output":\$1"ref":"myDataNodeId"\$1 | 

 


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set, then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| dependsOn | Specify dependency on another runnable object. | Reference Object, such as "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun. | Enumeration | 
| hadoopQueue | The Hadoop scheduler queue name on which the job will be submitted. | String | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | The maximum number of attempt retries on failure. | Integer | 
| onFail | An action to run when current object fails. | Reference Object, such as "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, such as "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, such as "onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, such as "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| postActivityTaskConfig | Post-activity configuration script to be run. This consists of a URI of the shell script in Amazon S3 and a list of arguments. | Reference Object, such as "postActivityTaskConfig":\$1"ref":"myShellScriptConfigId"\$1 | 
| preActivityTaskConfig | Pre-activity configuration script to be run. This consists of a URI of the shell script in Amazon S3 and a list of arguments. | Reference Object, such as "preActivityTaskConfig":\$1"ref":"myShellScriptConfigId"\$1 | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, such as "precondition":\$1"ref":"myPreconditionId"\$1 | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| resizeClusterBeforeRunning | Resize the cluster before performing this activity to accommodate DynamoDB data nodes specified as inputs or outputs.  If your activity uses a `DynamoDBDataNode` as either an input or output data node, and if you set the `resizeClusterBeforeRunning` to `TRUE`, AWS Data Pipeline starts using `m3.xlarge` instance types. This overwrites your instance type choices with `m3.xlarge`, which could increase your monthly costs.  | Boolean | 
| resizeClusterMaxInstances | A limit on the maximum number of instances that can be requested by the resize algorithm. | Integer | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries. | Enumeration | 
| scriptVariable | Specifies script variables for Amazon EMR to pass to Hive while running a script. For example, the following example script variables would pass a SAMPLE and FILTER\$1DATE variable to Hive: SAMPLE=s3://elasticmapreduce/samples/hive-ads and FILTER\$1DATE=\$1\$1format(@scheduledStartTime,'YYYY-MM-dd')\$1%. This field accepts multiple values and works with both script and scriptUri fields. In addition, scriptVariable functions regardless of whether stage is set to true or false. This field is especially useful to send dynamic values to Hive using AWS Data Pipeline expressions and functions. | String | 
| stage | Determines whether staging is enabled before or after running the script. Not permitted with Hive 11, so use an Amazon EMR AMI version 3.2.0 or greater. | Boolean | 

 


****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, such as "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, such as "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | Amazon EMR step logs available only on EMR activity attempts. | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for an object. | DateTime | 
| @scheduledStartTime | Schedule start time for an object. | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, such as "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 


****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | Id of the pipeline to which this object belongs. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

## See Also
<a name="hiveactivity-seealso"></a>
+ [ShellCommandActivity](dp-object-shellcommandactivity.md)
+ [EmrActivity](dp-object-emractivity.md)

# HiveCopyActivity
<a name="dp-object-hivecopyactivity"></a>

Runs a Hive query on an EMR cluster. `HiveCopyActivity` makes it easier to copy data between DynamoDB tables. `HiveCopyActivity` accepts a HiveQL statement to filter input data from DynamoDB at the column and row level.

## Example
<a name="hivecopyactivity-example"></a>

The following example shows how to use `HiveCopyActivity` and `DynamoDBExportDataFormat` to copy data from one `DynamoDBDataNode` to another, while filtering data, based on a time stamp.

```
{
  "objects": [
    {
      "id" : "DataFormat.1",
      "name" : "DataFormat.1",
      "type" : "DynamoDBExportDataFormat",
      "column" : "timeStamp BIGINT"
    },
    {
      "id" : "DataFormat.2",
      "name" : "DataFormat.2",
      "type" : "DynamoDBExportDataFormat"
    },
    {
      "id" : "DynamoDBDataNode.1",
      "name" : "DynamoDBDataNode.1",
      "type" : "DynamoDBDataNode",
      "tableName" : "item_mapped_table_restore_temp",
      "schedule" : { "ref" : "ResourcePeriod" },
      "dataFormat" : { "ref" : "DataFormat.1" }
    },
    {
      "id" : "DynamoDBDataNode.2",
      "name" : "DynamoDBDataNode.2",
      "type" : "DynamoDBDataNode",
      "tableName" : "restore_table",
      "region" : "us_west_1",
      "schedule" : { "ref" : "ResourcePeriod" },
      "dataFormat" : { "ref" : "DataFormat.2" }
    },
    {
      "id" : "EmrCluster.1",
      "name" : "EmrCluster.1",
      "type" : "EmrCluster",
      "schedule" : { "ref" : "ResourcePeriod" },
      "masterInstanceType" : "m1.xlarge",
      "coreInstanceCount" : "4"
    },
    {
      "id" : "HiveTransform.1",
      "name" : "Hive Copy Transform.1",
      "type" : "HiveCopyActivity",
      "input" : { "ref" : "DynamoDBDataNode.1" },
      "output" : { "ref" : "DynamoDBDataNode.2" },
      "schedule" :{ "ref" : "ResourcePeriod" },
      "runsOn" : { "ref" : "EmrCluster.1" },
      "filterSql" : "`timeStamp` > unix_timestamp(\"#{@scheduledStartTime}\", \"yyyy-MM-dd'T'HH:mm:ss\")"
    },
    {
      "id" : "ResourcePeriod",
      "name" : "ResourcePeriod",
      "type" : "Schedule",
      "period" : "1 Hour",
      "startDateTime" : "2013-06-04T00:00:00",
      "endDateTime" : "2013-06-04T01:00:00"
    }
  ]
}
```

## Syntax
<a name="hivecopyactivity-syntax"></a>


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Users must specify a schedule reference to another object to set the dependency execution order for this object. Users can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), users can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html) | Reference Object, e.g. "schedule":\$1"ref":"myScheduleId"\$1 | 

 


****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | Specify cluster to run on. | Reference Object, e.g. "runsOn":\$1"ref":"myResourceId"\$1 | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | The most recently reported status from the remote activity. | String | 
| attemptTimeout | The timeout for remote work completion. If set, then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| dependsOn | Specifies the dependency on another runnable object. | Reference Object, e.g. "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun. | Enumeration | 
| filterSql | A Hive SQL statement fragment that filters a subset of DynamoDB or Amazon S3 data to copy. The filter should only contain predicates and not begin with a WHERE clause, because AWS Data Pipeline adds it automatically. | String | 
| input | The input data source. This must be a S3DataNode or DynamoDBDataNode. If you use DynamoDBNode, specify a DynamoDBExportDataFormat. | Reference Object, e.g. "input":\$1"ref":"myDataNodeId"\$1 | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | The maximum number attempt retries on failure. | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| output | The output data source. If input is S3DataNode, this must be DynamoDBDataNode. Otherwise, this can be S3DataNode or DynamoDBDataNode. If you use DynamoDBNode, specify a DynamoDBExportDataFormat. | Reference Object, e.g. "output":\$1"ref":"myDataNodeId"\$1 | 
| parent | The parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The Amazon S3 URI, such as  's3://BucketName/Key/', for uploading logs for the pipeline. | String | 
| postActivityTaskConfig | The post-activity configuration script to be run. This consists of a URI of the shell script in Amazon S3 and a list of arguments. | Reference Object, e.g. "postActivityTaskConfig":\$1"ref":"myShellScriptConfigId"\$1 | 
| preActivityTaskConfig | The pre-activity configuration script to be run. This consists of a URI of the shell script in Amazon S3 and a list of arguments. | Reference Object, e.g. "preActivityTaskConfig":\$1"ref":"myShellScriptConfigId"\$1 | 
| precondition | Optionally defines a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, e.g. "precondition":\$1"ref":"myPreconditionId"\$1 | 
| reportProgressTimeout | The timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| resizeClusterBeforeRunning | Resize the cluster before performing this activity to accommodate DynamoDB data nodes specified as inputs or outputs.  If your activity uses a `DynamoDBDataNode` as either an input or output data node, and if you set the `resizeClusterBeforeRunning` to `TRUE`, AWS Data Pipeline starts using `m3.xlarge` instance types. This overwrites your instance type choices with `m3.xlarge`, which could increase your monthly costs.  | Boolean | 
| resizeClusterMaxInstances | A limit on the maximum number of instances that can be requested by the resize algorithm | Integer | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries. | Enumeration | 

 


****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | Amazon EMR step logs available only on EMR activity attempts. | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | The most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object. | DateTime | 
| @scheduledStartTime | Schedule start time for object. | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 


****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | Id of the pipeline to which this object belongs to. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Object. | String | 

## See Also
<a name="hivecopyactivity-seealso"></a>
+ [ShellCommandActivity](dp-object-shellcommandactivity.md)
+ [EmrActivity](dp-object-emractivity.md)

# PigActivity
<a name="dp-object-pigactivity"></a>

PigActivity provides native support for Pig scripts in AWS Data Pipeline without the requirement to use `ShellCommandActivity` or `EmrActivity`. In addition, PigActivity supports data staging. When the stage field is set to true, AWS Data Pipeline stages the input data as a schema in Pig without additional code from the user. 

## Example
<a name="pigactivity-example"></a>

The following example pipeline shows how to use `PigActivity`. The example pipeline performs the following steps:
+ MyPigActivity1 loads data from Amazon S3 and runs a Pig script that selects a few columns of data and uploads it to Amazon S3.
+ MyPigActivity2 loads the first output, selects a few columns and three rows of data, and uploads it to Amazon S3 as a second output.
+ MyPigActivity3 loads the second output data, inserts two rows of data and only the column named "fifth" to Amazon RDS.
+ MyPigActivity4 loads Amazon RDS data, selects the first row of data, and uploads it to Amazon S3.

```
{
  "objects": [
    {
      "id": "MyInputData1",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "directoryPath": "s3://amzn-s3-demo-bucket/pigTestInput",
      "name": "MyInputData1",
      "dataFormat": {
        "ref": "MyInputDataType1"
      },
      "type": "S3DataNode"
    },
    {
      "id": "MyPigActivity4",
      "scheduleType": "CRON",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "input": {
        "ref": "MyOutputData3"
      },
      "pipelineLogUri": "s3://amzn-s3-demo-bucket/path/",
      "name": "MyPigActivity4",
      "runsOn": {
        "ref": "MyEmrResource"
      },
      "type": "PigActivity",
      "dependsOn": {
        "ref": "MyPigActivity3"
      },
      "output": {
        "ref": "MyOutputData4"
      },
      "script": "B = LIMIT ${input1} 1; ${output1} = FOREACH B GENERATE one;",
      "stage": "true"
    },
    {
      "id": "MyPigActivity3",
      "scheduleType": "CRON",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "input": {
        "ref": "MyOutputData2"
      },
      "pipelineLogUri": "s3://amzn-s3-demo-bucket/path",
      "name": "MyPigActivity3",
      "runsOn": {
        "ref": "MyEmrResource"
      },
      "script": "B = LIMIT ${input1} 2; ${output1} = FOREACH B GENERATE Fifth;",
      "type": "PigActivity",
      "dependsOn": {
        "ref": "MyPigActivity2"
      },
      "output": {
        "ref": "MyOutputData3"
      },
      "stage": "true"
    },
    {
      "id": "MyOutputData2",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "name": "MyOutputData2",
      "directoryPath": "s3://amzn-s3-demo-bucket/PigActivityOutput2",
      "dataFormat": {
        "ref": "MyOutputDataType2"
      },
      "type": "S3DataNode"
    },
    {
      "id": "MyOutputData1",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "name": "MyOutputData1",
      "directoryPath": "s3://amzn-s3-demo-bucket/PigActivityOutput1",
      "dataFormat": {
        "ref": "MyOutputDataType1"
      },
      "type": "S3DataNode"
    },
    {
      "id": "MyInputDataType1",
      "name": "MyInputDataType1",
      "column": [
        "First STRING",
        "Second STRING",
        "Third STRING",
        "Fourth STRING",
        "Fifth STRING",
        "Sixth STRING",
        "Seventh STRING",
        "Eighth STRING",
        "Ninth STRING",
        "Tenth STRING"
      ],
      "inputRegEx": "^(\\\\S+) (\\\\S+) (\\\\S+) (\\\\S+) (\\\\S+) (\\\\S+) (\\\\S+) (\\\\S+) (\\\\S+) (\\\\S+)",
      "type": "RegEx"
    },
    {
      "id": "MyEmrResource",
      "region": "us-east-1",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "keyPair": "example-keypair",
      "masterInstanceType": "m1.small",
      "enableDebugging": "true",
      "name": "MyEmrResource",
      "actionOnTaskFailure": "continue",
      "type": "EmrCluster"
    },
    {
      "id": "MyOutputDataType4",
      "name": "MyOutputDataType4",
      "column": "one STRING",
      "type": "CSV"
    },
    {
      "id": "MyOutputData4",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "directoryPath": "s3://amzn-s3-demo-bucket/PigActivityOutput3",
      "name": "MyOutputData4",
      "dataFormat": {
        "ref": "MyOutputDataType4"
      },
      "type": "S3DataNode"
    },
    {
      "id": "MyOutputDataType1",
      "name": "MyOutputDataType1",
      "column": [
        "First STRING",
        "Second STRING",
        "Third STRING",
        "Fourth STRING",
        "Fifth STRING",
        "Sixth STRING",
        "Seventh STRING",
        "Eighth STRING"
      ],
      "columnSeparator": "*",
      "type": "Custom"
    },
    {
      "id": "MyOutputData3",
      "username": "___",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "insertQuery": "insert into #{table} (one) values (?)",
      "name": "MyOutputData3",
      "*password": "___",
      "runsOn": {
        "ref": "MyEmrResource"
      },
      "connectionString": "jdbc:mysql://example-database-instance:3306/example-database",
      "selectQuery": "select * from #{table}",
      "table": "example-table-name",
      "type": "MySqlDataNode"
    },
    {
      "id": "MyOutputDataType2",
      "name": "MyOutputDataType2",
      "column": [
        "Third STRING",
        "Fourth STRING",
        "Fifth STRING",
        "Sixth STRING",
        "Seventh STRING",
        "Eighth STRING"
      ],
      "type": "TSV"
    },
    {
      "id": "MyPigActivity2",
      "scheduleType": "CRON",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "input": {
        "ref": "MyOutputData1"
      },
      "pipelineLogUri": "s3://amzn-s3-demo-bucket/path",
      "name": "MyPigActivity2",
      "runsOn": {
        "ref": "MyEmrResource"
      },
      "dependsOn": {
        "ref": "MyPigActivity1"
      },
      "type": "PigActivity",
      "script": "B = LIMIT ${input1} 3; ${output1} = FOREACH B GENERATE Third, Fourth, Fifth, Sixth, Seventh, Eighth;",
      "output": {
        "ref": "MyOutputData2"
      },
      "stage": "true"
    },
    {
      "id": "MyEmrResourcePeriod",
      "startDateTime": "2013-05-20T00:00:00",
      "name": "MyEmrResourcePeriod",
      "period": "1 day",
      "type": "Schedule",
      "endDateTime": "2013-05-21T00:00:00"
    },
    {
      "id": "MyPigActivity1",
      "scheduleType": "CRON",
      "schedule": {
        "ref": "MyEmrResourcePeriod"
      },
      "input": {
        "ref": "MyInputData1"
      },
      "pipelineLogUri": "s3://amzn-s3-demo-bucket/path",
      "scriptUri": "s3://amzn-s3-demo-bucket/script/pigTestScipt.q",
      "name": "MyPigActivity1",
      "runsOn": {
        "ref": "MyEmrResource"
      },
      "scriptVariable": [
        "column1=First",
        "column2=Second",
        "three=3"
      ],
      "type": "PigActivity",
      "output": {
        "ref": "MyOutputData1"
      },
      "stage": "true"
    }
  ]
}
```

The content of `pigTestScript.q` is as follows.

```
B = LIMIT ${input1} $three; ${output1} = FOREACH B GENERATE $column1, $column2, Third, Fourth, Fifth, Sixth, Seventh, Eighth;
```

## Syntax
<a name="pigactivity-syntax"></a>


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule | This object is invoked within the execution of a schedule interval. Users must specify a schedule reference to another object to set the dependency execution order for this object. Users can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": \$1"ref": "DefaultSchedule"\$1. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), users can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html) | Reference Object, for example, "schedule":\$1"ref":"myScheduleId"\$1 | 

 


****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| script | The Pig script to run. | String | 
| scriptUri | The location of the Pig script to run (for example, s3://scriptLocation). | String | 

 


****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | EMR Cluster on which this PigActivity runs. | Reference Object, for example, "runsOn":\$1"ref":"myEmrClusterId"\$1 | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | The most recently reported status from the remote activity. | String | 
| attemptTimeout | The timeout for remote work completion. If set, then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| dependsOn | Specifies the dependency on another runnable object. | Reference Object, for example, "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun. | Enumeration | 
| input | The input data source. | Reference Object, for example, "input":\$1"ref":"myDataNodeId"\$1 | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | The maximum number attempt retries on failure. | Integer | 
| onFail | An action to run when current object fails. | Reference Object, for example, "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object, for example, "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, for example, "onSuccess":\$1"ref":"myActionId"\$1 | 
| output | The output data source. | Reference Object, for example, "output":\$1"ref":"myDataNodeId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, for example, "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The Amazon S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| postActivityTaskConfig | Post-activity configuration script to be run. This consists of a URI of the shell script in Amazon S33 and a list of arguments. | Reference Object, for example, "postActivityTaskConfig":\$1"ref":"myShellScriptConfigId"\$1 | 
| preActivityTaskConfig | Pre-activity configuration script to be run. This consists of a URI of the shell script in Amazon S3 and a list of arguments. | Reference Object, for example, "preActivityTaskConfig":\$1"ref":"myShellScriptConfigId"\$1 | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, for example, "precondition":\$1"ref":"myPreconditionId"\$1 | 
| reportProgressTimeout | The timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| resizeClusterBeforeRunning | Resize the cluster before performing this activity to accommodate DynamoDB data nodes specified as inputs or outputs. If your activity uses a `DynamoDBDataNode` as either an input or output data node, and if you set the `resizeClusterBeforeRunning` to `TRUE`, AWS Data Pipeline starts using `m3.xlarge` instance types. This overwrites your instance type choices with `m3.xlarge`, which could increase your monthly costs.  | Boolean | 
| resizeClusterMaxInstances | A limit on the maximum number of instances that can be requested by the resize algorithm. | Integer | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType | Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries. | Enumeration | 
| scriptVariable | The arguments to pass to the Pig script. You can use scriptVariable with script or scriptUri. | String | 
| stage | Determines whether staging is enabled and allows your Pig script to have access to the staged-data tables, such as \$1\$1INPUT1\$1 and \$1\$1OUTPUT1\$1. | Boolean | 

 


****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, for example, "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, for example, "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | Amazon EMR step logs available only on EMR activity attempts. | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for the object. | DateTime | 
| @scheduledStartTime | Schedule start time for the object. | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version that the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, for example, "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 


****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | ID of the pipeline to which this object belongs. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

## See Also
<a name="pigactivity-seealso"></a>
+ [ShellCommandActivity](dp-object-shellcommandactivity.md)
+ [EmrActivity](dp-object-emractivity.md)

# RedshiftCopyActivity
<a name="dp-object-redshiftcopyactivity"></a>

Copies data from DynamoDB or Amazon S3 to Amazon Redshift. You can load data into a new table, or easily merge data into an existing table.

Here is an overview of a use case in which to use `RedshiftCopyActivity`:

1. Start by using AWS Data Pipeline to stage your data in Amazon S3. 

1. Use `RedshiftCopyActivity` to move the data from Amazon RDS and Amazon EMR to Amazon Redshift.

   This lets you load your data into Amazon Redshift where you can analyze it.

1. Use [SqlActivity](dp-object-sqlactivity.md) to perform SQL queries on the data that you've loaded into Amazon Redshift.

 In addition, `RedshiftCopyActivity` let's you work with an `S3DataNode`, since it supports a manifest file. For more information, see [S3DataNode](dp-object-s3datanode.md).

## Example
<a name="redshiftcopyactivity-example"></a>

The following is an example of this object type. 

To ensure formats conversion, this example uses [EMPTYASNULL](https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-emptyasnull) and [IGNOREBLANKLINES](https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-ignoreblanklines) special conversion parameters in `commandOptions`. For information, see [Data Conversion Parameters](https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html) in the *Amazon Redshift Database Developer Guide*.

```
{
  "id" : "S3ToRedshiftCopyActivity",
  "type" : "RedshiftCopyActivity",
  "input" : { "ref": "MyS3DataNode" },
  "output" : { "ref": "MyRedshiftDataNode" },
  "insertMode" : "KEEP_EXISTING",
  "schedule" : { "ref": "Hour" },
  "runsOn" : { "ref": "MyEc2Resource" },
  "commandOptions": ["EMPTYASNULL", "IGNOREBLANKLINES"]
}
```

The following example pipeline definition shows an activity that uses the `APPEND` insert mode:

```
{
  "objects": [
    {
      "id": "CSVId1",
      "name": "DefaultCSV1",
      "type": "CSV"
    },
    {
      "id": "RedshiftDatabaseId1",
      "databaseName": "dbname",
      "username": "user",
      "name": "DefaultRedshiftDatabase1",
      "*password": "password",
      "type": "RedshiftDatabase",
      "clusterId": "redshiftclusterId"
    },
    {
      "id": "Default",
      "scheduleType": "timeseries",
      "failureAndRerunMode": "CASCADE",
      "name": "Default",
      "role": "DataPipelineDefaultRole",
      "resourceRole": "DataPipelineDefaultResourceRole"
    },
    {
      "id": "RedshiftDataNodeId1",
      "schedule": {
        "ref": "ScheduleId1"
      },
      "tableName": "orders",
      "name": "DefaultRedshiftDataNode1",
      "createTableSql": "create table StructuredLogs (requestBeginTime CHAR(30) PRIMARY KEY DISTKEY SORTKEY, requestEndTime CHAR(30), hostname CHAR(100), requestDate varchar(20));",
      "type": "RedshiftDataNode",
      "database": {
        "ref": "RedshiftDatabaseId1"
      }
    },
    {
      "id": "Ec2ResourceId1",
      "schedule": {
        "ref": "ScheduleId1"
      },
      "securityGroups": "MySecurityGroup",
      "name": "DefaultEc2Resource1",
      "role": "DataPipelineDefaultRole",
      "logUri": "s3://myLogs",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "type": "Ec2Resource"
    },
    {
      "id": "ScheduleId1",
      "startDateTime": "yyyy-mm-ddT00:00:00",
      "name": "DefaultSchedule1",
      "type": "Schedule",
      "period": "period",
      "endDateTime": "yyyy-mm-ddT00:00:00"
    },
    {
      "id": "S3DataNodeId1",
      "schedule": {
        "ref": "ScheduleId1"
      },
      "filePath": "s3://datapipeline-us-east-1/samples/hive-ads-samples.csv",
      "name": "DefaultS3DataNode1",
      "dataFormat": {
        "ref": "CSVId1"
      },
      "type": "S3DataNode"
    },
    {
      "id": "RedshiftCopyActivityId1",
      "input": {
        "ref": "S3DataNodeId1"
      },
      "schedule": {
        "ref": "ScheduleId1"
      },
      "insertMode": "APPEND",
      "name": "DefaultRedshiftCopyActivity1",
      "runsOn": {
        "ref": "Ec2ResourceId1"
      },
      "type": "RedshiftCopyActivity",
      "output": {
        "ref": "RedshiftDataNodeId1"
      }
    }
  ]
}
```

`APPEND` operation adds items to a table regardless of the primary or sort keys. For example, if you have the following table, you can append a record with the same ID and user value.

```
ID(PK)     USER
1          aaa
2          bbb
```

You can append a record with the same ID and user value:

```
ID(PK)     USER
1          aaa
2          bbb
1          aaa
```

**Note**  
If an `APPEND` operation is interrupted and retried, the resulting rerun pipeline potentially appends from the beginning. This may cause further duplication, so you should be aware of this behavior, especially if you have any logic that counts the number of rows.

For a tutorial, see [Copy Data to Amazon Redshift Using AWS Data Pipeline](dp-copydata-redshift.md).

## Syntax
<a name="redshiftcopyactivity-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| insertMode |   Determines what AWS Data Pipeline does with pre-existing data in the target table that overlaps with rows in the data to be loaded. Valid values are: `KEEP_EXISTING`, `OVERWRITE_EXISTING`, `TRUNCATE` and `APPEND`. `KEEP_EXISTING` adds new rows to the table, while leaving any existing rows unmodified. `KEEP_EXISTING` and ` OVERWRITE_EXISTING` use the primary key, sort, and distribution keys to identify which incoming rows to match with existing rows. See [Updating and Inserting New Data](https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html) in the Amazon Redshift *Database Developer Guide*.  `TRUNCATE` deletes all the data in the destination table before writing the new data.  `APPEND` adds all records to the end of the Redshift table. `APPEND` does not require a primary, distribution key, or sort key so items that may be potential duplicates may be appended.  | Enumeration | 

 


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule |  This object is invoked within the execution of a schedule interval.  Specify a schedule reference to another object to set the dependency execution order for this object.  In most cases, we recommend to put the schedule reference on the default pipeline object so that all objects inherit that schedule. For example, you can explicitly set a schedule on the object by specifying `"schedule": {"ref": "DefaultSchedule"}`.  If the master schedule in your pipeline contains nested schedules, create a parent object that has a schedule reference.  For more information about example optional schedule configurations, see [Schedule](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html).   | Reference Object, such as: "schedule":\$1"ref":"myScheduleId"\$1 | 

 


****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | The computational resource to run the activity or command. For example, an Amazon EC2 instance or Amazon EMR cluster. | Reference Object, e.g. "runsOn":\$1"ref":"myResourceId"\$1 | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set, then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| commandOptions |  Takes parameters to pass to the Amazon Redshift data node during the `COPY` operation. For information on parameters, see [COPY](https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) in the Amazon Redshift *Database Developer Guide*. As it loads the table, `COPY` attempts to implicitly convert the strings to the data type of the target column. In addition to the default data conversions that happen automatically, if you receive errors or have other conversion needs, you can specify additional conversion parameters. For information, see [Data Conversion Parameters](https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html) in the Amazon Redshift *Database Developer Guide*. If a data format is associated with the input or output data node, then the provided parameters are ignored.  Because the copy operation first uses `COPY` to insert data into a staging table, and then uses an `INSERT` command to copy the data from the staging table into the destination table, some `COPY` parameters do not apply, such as the `COPY` command's ability to enable automatic compression of the table. If compression is required, add column encoding details to the `CREATE TABLE` statement.  Also, in some cases when it needs to unload data from the Amazon Redshift cluster and create files in Amazon S3, the `RedshiftCopyActivity` relies on the `UNLOAD` operation from Amazon Redshift. To improve performance during copying and unloading, specify `PARALLEL OFF` parameter from the `UNLOAD` command. For information on parameters, see [UNLOAD](https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html) in the Amazon Redshift *Database Developer Guide*.  | String | 
| dependsOn | Specify dependency on another runnable object. | Reference Object: "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun | Enumeration | 
| input | The input data node. The data source can be Amazon S3, DynamoDB, or Amazon Redshift. | Reference Object: "input":\$1"ref":"myDataNodeId"\$1 | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object: "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed. | Reference Object:  "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object: "onSuccess":\$1"ref":"myActionId"\$1 | 
| output | The output data node. The output location can be Amazon S3 or Amazon Redshift. | Reference Object: "output":\$1"ref":"myDataNodeId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object: "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object: "precondition":\$1"ref":"myPreconditionId"\$1 | 
| queue |  Corresponds to the `query_group ` setting in Amazon Redshift, which allows you to assign and prioritize concurrent activities based on their placement in queues.  Amazon Redshift limits the number of simultaneous connections to 15. For more information, see [Assigning Queries to Queues](https://docs.aws.amazon.com/AmazonRDS/latest/DeveloperGuide/cm-c-executing-queries.html) in the Amazon RDS *Database Developer Guide*.  | String | 
| reportProgressTimeout |  Timeout for remote work successive calls to `reportProgress`.  If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried.  | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType |  Allows you to specify whether the schedule for objects in your pipeline. Values are: `cron`, `ondemand`, and `timeseries`. The `timeseries` scheduling means instances are scheduled at the end of each interval. The `Cron` scheduling means instances are scheduled at the beginning of each interval.  An `ondemand` schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again.  To use `ondemand` pipelines, call the `ActivatePipeline` operation for each subsequent run.  If you use an `ondemand` schedule, you must specify it in the default object, and it must be the only `scheduleType` specified for objects in the pipeline.  | Enumeration | 
| transformSql |  The `SQL SELECT` expression used to transform the input data.  Run the `transformSql` expression on the table named `staging`.  When you copy data from DynamoDB or Amazon S3, AWS Data Pipeline creates a table called "staging" and initially loads data in there. Data from this table is used to update the target table.  The output schema of `transformSql` must match the final target table's schema. If you specify the `transformSql` option, a second staging table is created from the specified SQL statement. The data from this second staging table is then updated in the final target table.  | String | 

 


****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object: "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object: "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object. | DateTime | 
| @scheduledStartTime | Schedule start time for object. | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object: "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 


****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | Id of the pipeline to which this object belongs to. | String | 
| @sphere | The sphere of an object. Denotes its place in the life cycle. For example, Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

# ShellCommandActivity
<a name="dp-object-shellcommandactivity"></a>

 Runs a command or script. You can use `ShellCommandActivity` to run time-series or cron-like scheduled tasks. 

When the `stage` field is set to true and used with an `S3DataNode`, `ShellCommandActivity` supports the concept of staging data, which means that you can move data from Amazon S3 to a stage location, such as Amazon EC2 or your local environment, perform work on the data using scripts and the `ShellCommandActivity`, and move it back to Amazon S3. 

In this case, when your shell command is connected to an input `S3DataNode`, your shell scripts operate directly on the data using `${INPUT1_STAGING_DIR}`, `${INPUT2_STAGING_DIR}`, and other fields, referring to the `ShellCommandActivity` input fields. 

Similarly, output from the shell-command can be staged in an output directory to be automatically pushed to Amazon S3, referred to by `${OUTPUT1_STAGING_DIR}`, `${OUTPUT2_STAGING_DIR}`, and so on. 

These expressions can pass as command-line arguments to the shell-command for you to use in data transformation logic.

`ShellCommandActivity` returns Linux-style error codes and strings. If a `ShellCommandActivity` results in error, the `error` returned is a non-zero value.

## Example
<a name="shellcommandactivity-example"></a>

The following is an example of this object type.

```
{
  "id" : "CreateDirectory",
  "type" : "ShellCommandActivity",
  "command" : "mkdir new-directory"
}
```

## Syntax
<a name="shellcommandactivity-syntax"></a>


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule |  This object is invoked within the execution of a `schedule` interval. To set the dependency execution order for this object, specify a `schedule` reference to another object.  To satisfy this requirement, explicitly set a `schedule` on the object, for example, by specifying `"schedule": {"ref": "DefaultSchedule"}`.  In most cases, it is better to put the `schedule` reference on the default pipeline object so that all objects inherit that schedule. If the pipeline consists of a tree of schedules (schedules within the master schedule), create a parent object that has a schedule reference.  To spread the load, AWS Data Pipeline creates physical objects slightly ahead of schedule, but runs them on schedule.  For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html)  | Reference Object, e.g. "schedule":\$1"ref":"myScheduleId"\$1 | 

 


****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| command | The command to run. Use \$1 to reference positional parameters and scriptArgument to specify the parameters for the command. This value and any associated parameters must function in the environment from which you are running the Task Runner. | String | 
| scriptUri | An Amazon S3 URI path for a file to download and run as a shell command. Specify only one scriptUri, or command field. scriptUri cannot use parameters, use command instead. | String | 

 


****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | The computational resource to run the activity or command, for example, an Amazon EC2 instance or an Amazon EMR cluster. | Reference Object, e.g. "runsOn":\$1"ref":"myResourceId"\$1 | 
| workerGroup | Used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | The most recently reported status from the remote activity. | String | 
| attemptTimeout | The timeout for the remote work completion. If set, then a remote activity that does not complete within the specified starting time may be retried. | Period | 
| dependsOn | Specifies a dependency on another runnable object. | Reference Object, e.g. "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun. | Enumeration | 
| input | The location of the input data. | Reference Object, e.g. "input":\$1"ref":"myDataNodeId"\$1 | 
| lateAfterTimeout | The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | The maximum number attempt retries on failure. | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or is not completed. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| output | The location of the output data. | Reference Object, e.g. "output":\$1"ref":"myDataNodeId"\$1 | 
| parent | The parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The Amazon S3 URI, such as 's3://BucketName/Key/' for uploading logs for the pipeline. | String | 
| precondition | Optionally defines a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, e.g. "precondition":\$1"ref":"myPreconditionId"\$1 | 
| reportProgressTimeout | The timeout for successive calls to reportProgress by remote activities. If set, then remote activities that do not report progress for the specified period may be considered stalled and are retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType |  Allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of the interval or at the end of the interval.  The values are: `cron`, `ondemand`, and `timeseries`. If set to `timeseries`, instances are scheduled at the end of each interval.  If set to `Cron`, instances are scheduled at the beginning of each interval.  If set to `ondemand`, you can run a pipeline one time, per activation. This means you do not have to clone or recreate the pipeline to run it again. If you use an `ondemand` schedule, specify it in the default object as the only `scheduleType` for objects in the pipeline. To use `ondemand` pipelines, call the `ActivatePipeline` operation for each subsequent run.   | Enumeration | 
| scriptArgument | A JSON-formatted array of strings to pass to the command specified by the command. For example, if command is echo \$11 \$12, specify scriptArgument as "param1", "param2". For multiple arguments and parameters, pass the scriptArgument as follows: "scriptArgument":"arg1","scriptArgument":"param1","scriptArgument":"arg2","scriptArgument":"param2". The scriptArgument can only be used with command; Using it with scriptUri causes an error. | String | 
| stage | Determines whether staging is enabled and allows your shell commands to have access to the staged-data variables, such as \$1\$1INPUT1\$1STAGING\$1DIR\$1 and \$1\$1OUTPUT1\$1STAGING\$1DIR\$1. | Boolean | 
| stderr | The path that receives redirected system error messages from the command. If you use the runsOn field, this must be an Amazon S3 path because of the transitory nature of the resource running your activity. However, if you specify the workerGroup field, a local file path is permitted. | String | 
| stdout | The Amazon S3 path that receives redirected output from the command. If you use the runsOn field, this must be an Amazon S3 path because of the transitory nature of the resource running your activity. However, if you specify the workerGroup field, a local file path is permitted. | String | 

 


****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | The list of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | The time when the execution of this object finished. | DateTime | 
| @actualStartTime | The time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | The description of the dependency chain that caused the object failure. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | Amazon EMR step logs available only on Amazon EMR activity attempts. | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which the object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for Amazon EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | The Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | The time at which the health status was updated last time. | DateTime | 
| hostname | The host name of the client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | The time of the latest run for which the execution completed. | DateTime | 
| @latestRunTime | The time of the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | The time of the run to be scheduled next. | DateTime | 
| reportProgressTime | The most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | The schedule end time for object. | DateTime | 
| @scheduledStartTime | The schedule start time for object. | DateTime | 
| @status | The status of the object. | String | 
| @version | The AWS Data Pipeline version used to create the object. | String | 
| @waitingOn | The description of the list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 


****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | The error describing the ill-formed object. | String | 
| @pipelineId | The Id of the pipeline to which this object belongs. | String | 
| @sphere | The place of an object in the lifecycle. Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 

## See Also
<a name="shellcommandactivity-seealso"></a>
+ [CopyActivity](dp-object-copyactivity.md)
+ [EmrActivity](dp-object-emractivity.md)

# SqlActivity
<a name="dp-object-sqlactivity"></a>

Runs an SQL query (script) on a database.

## Example
<a name="sqlactivity-example"></a>

The following is an example of this object type.

```
{
  "id" : "MySqlActivity",
  "type" : "SqlActivity",
  "database" : { "ref": "MyDatabaseID" },
  "script" : "SQLQuery" | "scriptUri" : s3://scriptBucket/query.sql,
  "schedule" : { "ref": "MyScheduleID" },
}
```

## Syntax
<a name="sqlactivity-syntax"></a>


****  

| Required Fields | Description | Slot Type | 
| --- | --- | --- | 
| database | The database on which to run the supplied SQL script. | Reference Object, e.g. "database":\$1"ref":"myDatabaseId"\$1 | 

 


****  

| Object Invocation Fields | Description | Slot Type | 
| --- | --- | --- | 
| schedule |  This object is invoked within the execution of a schedule interval. You must specify a schedule reference to another object to set the dependency execution order for this object. You can set a schedule explicitly on the object, for example, by specifying `"schedule": {"ref": "DefaultSchedule"}`.  In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule.  If the pipeline has a tree of schedules nested within the master schedule, create a parent object that has a schedule reference. For more information about example optional schedule configurations, see [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html)  | Reference Object, e.g. "schedule":\$1"ref":"myScheduleId"\$1 | 

 


****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| script | The SQL script to run. You must specify script or scriptUri. When the script is stored in Amazon S3, then script is not evaluated as an expression. Specifying multiple values for scriptArgument is helpful when the script is stored in Amazon S3. | String | 
| scriptUri | A URI specifying the location of an SQL script to execute in this activity. | String | 

 


****  

| Required Group (One of the following is required) | Description | Slot Type | 
| --- | --- | --- | 
| runsOn | The computational resource to run the activity or command. For example, an Amazon EC2 instance or Amazon EMR cluster. | Reference Object, e.g. "runsOn":\$1"ref":"myResourceId"\$1 | 
| workerGroup | The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists, workerGroup is ignored. | String | 

 


****  

| Optional Fields | Description | Slot Type | 
| --- | --- | --- | 
| attemptStatus | Most recently reported status from the remote activity. | String | 
| attemptTimeout | Timeout for remote work completion. If set then a remote activity that does not complete within the set time of starting may be retried. | Period | 
| dependsOn | Specify dependency on another runnable object. | Reference Object, e.g. "dependsOn":\$1"ref":"myActivityId"\$1 | 
| failureAndRerunMode | Describes consumer node behavior when dependencies fail or are rerun | Enumeration | 
| input | Location of the input data. | Reference Object, e.g. "input":\$1"ref":"myDataNodeId"\$1 | 
| lateAfterTimeout | The time period since the scheduled start of the pipeline within which the object run must start. | Period | 
| maxActiveInstances | The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances. | Integer | 
| maximumRetries | Maximum number attempt retries on failure | Integer | 
| onFail | An action to run when current object fails. | Reference Object, e.g. "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | Actions that should be triggered if an object has not yet been scheduled or still not completed in the time period since the scheduled start of the pipeline as specified by 'lateAfterTimeout'. | Reference Object, e.g. "onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | An action to run when current object succeeds. | Reference Object, e.g. "onSuccess":\$1"ref":"myActionId"\$1 | 
| output | Location of the output data. This is only useful for referencing from within a script (for example \$1\$1output.tablename\$1) and for creating the output table by setting 'createTableSql' in the output data node. The output of the SQL query is not written to the output data node. | Reference Object, e.g. "output":\$1"ref":"myDataNodeId"\$1 | 
| parent | Parent of the current object from which slots will be inherited. | Reference Object, e.g. "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | The S3 URI (such as 's3://BucketName/Key/') for uploading logs for the pipeline. | String | 
| precondition | Optionally define a precondition. A data node is not marked "READY" until all preconditions have been met. | Reference Object, e.g. "precondition":\$1"ref":"myPreconditionId"\$1 | 
| queue | [Amazon Redshift only] Corresponds to the query\$1group setting in Amazon Redshift, which allows you to assign and prioritize concurrent activities based on their placement in queues. Amazon Redshift limits the number of simultaneous connections to 15. For more information, see [Assigning Queries to Queues](https://docs.aws.amazon.com/redshift/latest/dg/cm-c-executing-queries.html) in the Amazon Redshift Database Developer Guide. | String | 
| reportProgressTimeout | Timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried. | Period | 
| retryDelay | The timeout duration between two retry attempts. | Period | 
| scheduleType |  Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Values are: `cron`, `ondemand`, and `timeseries`.  `timeseries` scheduling means instances are scheduled at the end of each interval. `cron` scheduling means instances are scheduled at the beginning of each interval.  An `ondemand` schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an `ondemand` schedule, it must be specified in the default object and must be the only `scheduleType` specified for objects in the pipeline. To use `ondemand` pipelines, call the `ActivatePipeline` operation for each subsequent run.  | Enumeration | 
| scriptArgument | A list of variables for the script. You can alternatively put expressions directly into the script field. Multiple values for scriptArgument are helpful when the script is stored in Amazon S3. Example: \$1\$1format(@scheduledStartTime, "YY-MM-DD HH:MM:SS"\$1\$1n\$1\$1format(plusPeriod(@scheduledStartTime, "1 day"), "YY-MM-DD HH:MM:SS"\$1 | String | 

 


****  

| Runtime Fields | Description | Slot Type | 
| --- | --- | --- | 
| @activeInstances | List of the currently scheduled active instance objects. | Reference Object, e.g. "activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | Time when the execution of this object finished. | DateTime | 
| @actualStartTime | Time when the execution of this object started. | DateTime | 
| cancellationReason | The cancellationReason if this object was cancelled. | String | 
| @cascadeFailedOn | Description of the dependency chain the object failed on. | Reference Object, e.g. "cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | EMR step logs available only on EMR activity attempts | String | 
| errorId | The errorId if this object failed. | String | 
| errorMessage | The errorMessage if this object failed. | String | 
| errorStackTrace | The error stack trace if this object failed. | String | 
| @finishedTime | The time at which this object finished its execution. | DateTime | 
| hadoopJobLog | Hadoop job logs available on attempts for EMR-based activities. | String | 
| @healthStatus | The health status of the object which reflects success or failure of the last object instance that reached a terminated state. | String | 
| @healthStatusFromInstanceId | Id of the last instance object that reached a terminated state. | String | 
| @healthStatusUpdatedTime | Time at which the health status was updated last time. | DateTime | 
| hostname | The host name of client that picked up the task attempt. | String | 
| @lastDeactivatedTime | The time at which this object was last deactivated. | DateTime | 
| @latestCompletedRunTime | Time the latest run for which the execution completed. | DateTime | 
| @latestRunTime | Time the latest run for which the execution was scheduled. | DateTime | 
| @nextRunTime | Time of run to be scheduled next. | DateTime | 
| reportProgressTime | Most recent time that remote activity reported progress. | DateTime | 
| @scheduledEndTime | Schedule end time for object | DateTime | 
| @scheduledStartTime | Schedule start time for object | DateTime | 
| @status | The status of this object. | String | 
| @version | Pipeline version the object was created with. | String | 
| @waitingOn | Description of list of dependencies this object is waiting on. | Reference Object, e.g. "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 


****  

| System Fields | Description | Slot Type | 
| --- | --- | --- | 
| @error | Error describing the ill-formed object. | String | 
| @pipelineId | Id of the pipeline to which this object belongs to. | String | 
| @sphere | The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Objects. | String | 