

AWS Data Pipeline is no longer available to new customers. Existing customers of AWS Data Pipeline can continue to use the service as normal. [Learn more](https://aws.amazon.com/blogs/big-data/migrate-workloads-from-aws-data-pipeline/)

# Working with Task Runner
<a name="dp-using-task-runner"></a>

 Task Runner is a task agent application that polls AWS Data Pipeline for scheduled tasks and executes them on Amazon EC2 instances, Amazon EMR clusters, or other computational resources, reporting status as it does so. Depending on your application, you may choose to: 
+ Allow AWS Data Pipeline to install and manage one or more Task Runner applications for you. When a pipeline is activated, the default `Ec2Instance` or `EmrCluster` object referenced by an activity runsOn field is automatically created. AWS Data Pipeline takes care of installing Task Runner on an EC2 instance or on the master node of an EMR cluster. In this pattern, AWS Data Pipeline can do most of the instance or cluster management for you. 
+ Run all or parts of a pipeline on resources that you manage. The potential resources include a long-running Amazon EC2 instance, an Amazon EMR cluster, or a physical server. You can install a task runner (which can be either Task Runner or a custom task agent of your own devise) almost anywhere, provided that it can communicate with the AWS Data Pipeline web service. In this pattern, you assume almost complete control over which resources are used and how they are managed, and you must manually install and configure Task Runner. To do so, use the procedures in this section, as described in [Executing Work on Existing Resources Using Task Runner](dp-how-task-runner-user-managed.md). 

# Task Runner on AWS Data Pipeline-Managed Resources
<a name="dp-how-task-runner-dp-managed"></a>

 When a resource is launched and managed by AWS Data Pipeline, the web service automatically installs Task Runner on that resource to process tasks in the pipeline. You specify a computational resource (either an Amazon EC2 instance or an Amazon EMR cluster) for the `runsOn` field of an activity object. When AWS Data Pipeline launches this resource, it installs Task Runner on that resource and configures it to process all activity objects that have their `runsOn` field set to that resource. When AWS Data Pipeline terminates the resource, the Task Runner logs are published to an Amazon S3 location before it shuts down.

![\[Task runner life cycle on an AWS Data Pipeline-managed resource\]](http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/images/dp-task-runner-managed-emr-jobflow.png)


 For example, if you use the `EmrActivity` in a pipeline, and specify an `EmrCluster` resource in the `runsOn` field. When AWS Data Pipeline processes that activity, it launches an Amazon EMR cluster and installs Task Runner onto the master node. This Task Runner then processes the tasks for activities that have their `runsOn` field set to that `EmrCluster` object. The following excerpt from a pipeline definition shows this relationship between the two objects. 

```
{
  "id" : "MyEmrActivity",
  "name" : "Work to perform on my data",
  "type" : "EmrActivity",
  "runsOn" : {"ref" : "MyEmrCluster"},
  "preStepCommand" : "scp remoteFiles localFiles",
  "step" : "s3://amzn-s3-demo-bucket/myPath/myStep.jar,firstArg,secondArg",
  "step" : "s3://amzn-s3-demo-bucket/myPath/myOtherStep.jar,anotherArg",
  "postStepCommand" : "scp localFiles remoteFiles",
  "input" : {"ref" : "MyS3Input"},
  "output" : {"ref" : "MyS3Output"}
},
{
  "id" : "MyEmrCluster",
  "name" : "EMR cluster to perform the work",
  "type" : "EmrCluster",
  "hadoopVersion" : "0.20",
  "keypair" : "myKeyPair",
  "masterInstanceType" : "m1.xlarge",
  "coreInstanceType" : "m1.small",
  "coreInstanceCount" : "10",
  "taskInstanceType" : "m1.small",
  "taskInstanceCount": "10",
  "bootstrapAction" : "s3://elasticmapreduce/libs/ba/configure-hadoop,arg1,arg2,arg3",
  "bootstrapAction" : "s3://elasticmapreduce/libs/ba/configure-other-stuff,arg1,arg2"
}
```

For information and examples of running this activity, see [EmrActivity](dp-object-emractivity.md).

 If you have multiple AWS Data Pipeline-managed resources in a pipeline, Task Runner is installed on each of them, and they all poll AWS Data Pipeline for tasks to process. 

# Executing Work on Existing Resources Using Task Runner
<a name="dp-how-task-runner-user-managed"></a>

You can install Task Runner on computational resources that you manage, such as an Amazon EC2 instance, or a physical server or workstation. Task Runner can be installed anywhere, on any compatible hardware or operating system, provided that it can communicate with the AWS Data Pipeline web service.

This approach can be useful when, for example, you want to use AWS Data Pipeline to process data that is stored inside your organization’s firewall. By installing Task Runner on a server in the local network, you can access the local database securely and then poll AWS Data Pipeline for the next task to run. When AWS Data Pipeline ends processing or deletes the pipeline, the Task Runner instance remains running on your computational resource until you manually shut it down. The Task Runner logs persist after pipeline execution is complete. 

To use Task Runner on a resource that you manage, you must first download Task Runner, and then install it on your computational resource, using the procedures in this section. 

**Note**  
 You can only install Task Runner on Linux, UNIX, or macOS. Task Runner is not supported on the Windows operating system.   
To use Task Runner 2.0, the minimum Java version needed is 1.7.

To connect a Task Runner that you've installed to the pipeline activities it should process, add a `workerGroup` field to the object, and configure Task Runner to poll for that worker group value. You do this by passing the worker group string as a parameter (for example, `--workerGroup=wg-12345`) when you run the Task Runner JAR file. 

![\[AWS Data Pipeline diagram showing monthly task flow for archiving processed invoices using Task Runner.\]](http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/images/dp-task-runner-user-emr-jobflow.png)


```
{
  "id" : "CreateDirectory",
  "type" : "ShellCommandActivity",
  "workerGroup" : "wg-12345",
  "command" : "mkdir new-directory"
}
```

## Installing Task Runner
<a name="dp-installing-taskrunner"></a>

This section explains how to install and configure Task Runner and its prerequisites. Installation is a straightforward manual process.

**To install Task Runner**

1. Task Runner requires Java versions 1.6 or 1.8. To determine whether Java is installed, and the version that is running, use the following command:

   ```
   java -version
   ```

    If you do not have Java 1.6 or 1.8 installed on your computer, download one of these versions from [http://www.oracle.com/technetwork/java/index.html](http://www.oracle.com/technetwork/java/index.html). Download and install Java, and then proceed to the next step.

1. Download `TaskRunner-1.0.jar` from [https://s3.amazonaws.com/datapipeline-us-east-1/us-east-1/software/latest/TaskRunner/TaskRunner-1.0.jar](https://s3.amazonaws.com/datapipeline-us-east-1/us-east-1/software/latest/TaskRunner/TaskRunner-1.0.jar) and then copy it into a folder on the target computing resource. For Amazon EMR clusters running `EmrActivity` tasks, install Task Runner on the master node of the cluster.

1. When using Task Runner to connect to the AWS Data Pipeline web service to process your commands, users need programmatic access to a role that has permissions to create or manage data pipelines. For more information, see [Granting programmatic access](dp-get-setup.md#dp-grant-programmatic-access).

1. Task Runner connects to the AWS Data Pipeline web service using HTTPS. If you are using an AWS resource, ensure that HTTPS is enabled in the appropriate routing table and subnet ACL. If you are using a firewall or proxy, ensure that port 443 is open.

# (Optional) Granting Task Runner Access to Amazon RDS
<a name="dp-taskrunner-rdssecurity"></a>

Amazon RDS allows you to control access to your DB instances using database security groups (DB security groups). A DB security group acts like a firewall controlling network access to your DB instance. By default, network access is turned off for your DB instances. You must modify your DB security groups to let Task Runner access your Amazon RDS instances. Task Runner gains Amazon RDS access from the instance on which it runs, so the accounts and security groups that you add to your Amazon RDS instance depend on where you install Task Runner. 

**To grant access to Task Runner in EC2-Classic**

1. Open the Amazon RDS console.

1. In the navigation pane, choose **Instances**, and then select your DB instance.

1. Under **Security and Network**, select the security group, which opens the **Security Groups** page with this DB security group selected. Select the details icon for the DB security group.

1. Under **Security Group Details**, create a rule with the appropriate **Connection Type** and **Details**. These fields depend on where Task Runner is running, as described here:
   + `Ec2Resource`
     + **Connection Type**: `EC2 Security Group`

       **Details**: *my-security-group-name* (the name of the security group you created for the EC2 instance)
   + `EmrResource`
     + **Connection Type**: `EC2 Security Group`

       **Details**: `ElasticMapReduce-master`
     + **Connection Type**: `EC2 Security Group`

       **Details**: `ElasticMapReduce-slave`
   + Your local environment (on-premises)
     + **Connection Type**: `CIDR/IP`:

       **Details**: *my-ip-address* (the IP address of your computer or the IP address range of your network, if your computer is behind a firewall)

1. Click Add.

**To grant access to Task Runner in EC2-VPC**

1. Open the Amazon RDS console.

1. In the navigation pane, choose **Instances**.

1. Select the details icon for the DB instance. Under **Security and Network**, open the link to the security group, which takes you to the Amazon EC2 console. If you're using the old console design for security groups, switch to the new console design by selecting the icon that's displayed at the top of the console page.

1. On the **Inbound** tab, choose **Edit**, **Add Rule**. Specify the database port that you used when you launched the DB instance. The source depends on where Task Runner is running, as described here:
   + `Ec2Resource`
     + *my-security-group-id* (the ID of the security group you created for the EC2 instance)
   + `EmrResource`
     + *master-security-group-id* (the ID of the `ElasticMapReduce-master` security group)
     + *slave-security-group-id* (the ID of the `ElasticMapReduce-slave` security group)
   + Your local environment (on-premises)
     + *ip-address* (the IP address of your computer or the IP address range of your network, if your computer is behind a firewall)

1. Click **Save**.

## Starting Task Runner
<a name="dp-activate-task-runner"></a>

In a new command prompt window that is set to the directory where you installed Task Runner, start Task Runner with the following command.

```
java -jar TaskRunner-1.0.jar --config ~/credentials.json --workerGroup=myWorkerGroup --region=MyRegion --logUri=s3://amzn-s3-demo-bucket/foldername
```

The `--config` option points to your credentials file.

The `--workerGroup` option specifies the name of your worker group, which must be the same value as specified in your pipeline for tasks to be processed.

The `--region` option specifies the service region from which to pull tasks to execute.

The `--logUri` option is used for pushing your compressed logs to a location in Amazon S3.

When Task Runner is active, it prints the path to where log files are written in the terminal window. The following is an example.

```
Logging to /Computer_Name/.../output/logs
```

Task Runner should be run detached from your login shell. If you are using a terminal application to connect to your computer, you may need to use a utility like nohup or screen to prevent the Task Runner application from exiting when you log out. For more information about command line options, see [Task Runner Configuration Options](dp-taskrunner-config-options.md).

## Verifying Task Runner Logging
<a name="dp-verify-task-runner"></a>

The easiest way to verify that Task Runner is working is to check whether it is writing log files. Task Runner writes hourly log files to the directory, `output/logs`, under the directory where Task Runner is installed. The file name is `Task Runner.log.YYYY-MM-DD-HH`, where HH runs from 00 to 23, in UDT. To save storage space, any log files older than eight hours are compressed with GZip.

# Task Runner Threads and Preconditions
<a name="dp-taskrunner-threading"></a>

 Task Runner uses a thread pool for each of tasks, activities, and preconditions. The default setting for `--tasks` is 2, meaning that there are two threads allocated from the tasks pool and each thread polls the AWS Data Pipeline service for new tasks. Thus, `--tasks` is a performance tuning attribute that can be used to help optimize pipeline throughput.

 Pipeline retry logic for preconditions happens in Task Runner. Two precondition threads are allocated to poll AWS Data Pipeline for precondition objects. Task Runner honors the precondition object **retryDelay** and **preconditionTimeout** fields that you define on preconditions. 

In many cases, decreasing the precondition polling timeout and number of retries helps to improve the performance of your application. Similarly, applications with long-running preconditions may need to have the timeout and retry values increased. For more information about precondition objects, see [Preconditions](dp-concepts-preconditions.md).

# Task Runner Configuration Options
<a name="dp-taskrunner-config-options"></a>

These are the configuration options available from the command line when you launch Task Runner. 


****  

| Command Line Parameter | Description | 
| --- | --- | 
| `--help` | Command line help. Example: `Java -jar TaskRunner-1.0.jar --help` | 
| `--config` | The path and file name of your `credentials.json` file. | 
| `--accessId` | Your AWS access key ID for Task Runner to use when making requests. The `--accessID` and `--secretKey` options provide an alternative to using a credentials.json file. If a `credentials.json` file is also provided, the `--accessID` and `--secretKey` options take precedence.  | 
| `--secretKey` | Your AWS secret key for Task Runner to use when making requests. For more information, see `--accessID`.  | 
| `--endpoint` | An endpoint is a URL that is the entry point for a web service. The AWS Data Pipeline service endpoint in the region where you are making requests. Optional. In general, it is sufficient to specify a region, and you do not need to set the endpoint. For a listing of AWS Data Pipeline regions and endpoints, see [AWS Data Pipeline Regions and Endpoints](https://docs.aws.amazon.com/general/latest/gr/rande.html#datapipeline_region) in the *AWS General Reference*. | 
| `--workerGroup` | The name of the worker group for which Task Runner retrieves work. Required.When Task Runner polls the web service, it uses the credentials you supplied and the value of `workerGroup` to select which (if any) tasks to retrieve. You can use any name that is meaningful to you; the only requirement is that the string must match between the Task Runner and its corresponding pipeline activities. The worker group name is bound to a region. Even if there are identical worker group names in other regions, Task Runner always get tasks from the region specified in `--region`. | 
| `--taskrunnerId` | The ID of the task runner to use when reporting progress. Optional. | 
| `--output` | The Task Runner directory for log output files. Optional. Log files are stored in a local directory until they are pushed to Amazon S3. This option overrides the default directory.  | 
| `--region` | The region to use. Optional, but it is recommended to always set the region. If you do not specify the region, Task Runner retrieves tasks from the default service region, `us-east-1`.Other supported regions are: `eu-west-1`, `ap-northeast-1`, `ap-southeast-2`, `us-west-2`.  | 
| `--logUri` | The Amazon S3 destination path for Task Runner to back up log files to every hour. When Task Runner terminates, active logs in the local directory are pushed to the Amazon S3 destination folder.  | 
| --proxyHost | The host of the proxy used by Task Runner clients to connect to AWS services. | 
| --proxyPort | Port of the proxy host used by Task Runner clients to connect to AWS services. | 
| --proxyUsername | The user name for proxy. | 
| --proxyPassword | The password for proxy. | 
| --proxyDomain | The Windows domain name for NTLM Proxy. | 
| --proxyWorkstation | The Windows workstation name for NTLM Proxy. | 

# Using Task Runner with a Proxy
<a name="dp-taskrunner-proxy"></a>

If you are using a proxy host, you can either specify its [configuration](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-taskrunner-config-options.html) when invoking Task Runner or set the environment variable, HTTPS\$1PROXY. The environment variable used with Task Runner accepts the same configuration used for the [AWS Command Line Interface](https://docs.aws.amazon.com/cli/latest/userguide/cli-http-proxy.html). 

# Task Runner and Custom AMIs
<a name="dp-custom-ami"></a>

When you specify an `Ec2Resource` object for your pipeline, AWS Data Pipeline creates an EC2 instance for you, using an AMI that installs and configures Task Runner for you. A PV-compatible instance type is required in this case. Alternatively, you can create a custom AMI with Task Runner, and then specify the ID of this AMI using the `imageId` field of the `Ec2Resource` object. For more information, see [Ec2Resource](dp-object-ec2resource.md).

A custom AMI must meet the following requirements for AWS Data Pipeline to use it successfully for Task Runner:
+ Create the AMI in the same region in which the instances will run. For more information, see [Creating Your Own AMI](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/creating-an-ami.html) in the *Amazon EC2 User Guide*.
+ Ensure that the virtualization type of the AMI is supported by the instance type you plan to use. For example, the I2 and G2 instance types require an HVM AMI and the T1, C1, M1, and M2 instance types require a PV AMI. For more information, see [Linux AMI Virtualization Types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/virtualization_types.html) in the *Amazon EC2 User Guide*.
+ Install the following software:
  + Linux
  + Bash
  + wget
  + unzip
  + Java 1.6 or 1.8
  + cloud-init
+ Create and configure a user named `ec2-user`.