

AWS Data Pipeline is no longer available to new customers. Existing customers of AWS Data Pipeline can continue to use the service as normal. [Learn more](https://aws.amazon.com/blogs/big-data/migrate-workloads-from-aws-data-pipeline/)

# Executing Work on Existing Resources Using Task Runner
<a name="dp-how-task-runner-user-managed"></a>

You can install Task Runner on computational resources that you manage, such as an Amazon EC2 instance, or a physical server or workstation. Task Runner can be installed anywhere, on any compatible hardware or operating system, provided that it can communicate with the AWS Data Pipeline web service.

This approach can be useful when, for example, you want to use AWS Data Pipeline to process data that is stored inside your organization’s firewall. By installing Task Runner on a server in the local network, you can access the local database securely and then poll AWS Data Pipeline for the next task to run. When AWS Data Pipeline ends processing or deletes the pipeline, the Task Runner instance remains running on your computational resource until you manually shut it down. The Task Runner logs persist after pipeline execution is complete. 

To use Task Runner on a resource that you manage, you must first download Task Runner, and then install it on your computational resource, using the procedures in this section. 

**Note**  
 You can only install Task Runner on Linux, UNIX, or macOS. Task Runner is not supported on the Windows operating system.   
To use Task Runner 2.0, the minimum Java version needed is 1.7.

To connect a Task Runner that you've installed to the pipeline activities it should process, add a `workerGroup` field to the object, and configure Task Runner to poll for that worker group value. You do this by passing the worker group string as a parameter (for example, `--workerGroup=wg-12345`) when you run the Task Runner JAR file. 

![\[AWS Data Pipeline diagram showing monthly task flow for archiving processed invoices using Task Runner.\]](http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/images/dp-task-runner-user-emr-jobflow.png)


```
{
  "id" : "CreateDirectory",
  "type" : "ShellCommandActivity",
  "workerGroup" : "wg-12345",
  "command" : "mkdir new-directory"
}
```

## Installing Task Runner
<a name="dp-installing-taskrunner"></a>

This section explains how to install and configure Task Runner and its prerequisites. Installation is a straightforward manual process.

**To install Task Runner**

1. Task Runner requires Java versions 1.6 or 1.8. To determine whether Java is installed, and the version that is running, use the following command:

   ```
   java -version
   ```

    If you do not have Java 1.6 or 1.8 installed on your computer, download one of these versions from [http://www.oracle.com/technetwork/java/index.html](http://www.oracle.com/technetwork/java/index.html). Download and install Java, and then proceed to the next step.

1. Download `TaskRunner-1.0.jar` from [https://s3.amazonaws.com/datapipeline-us-east-1/us-east-1/software/latest/TaskRunner/TaskRunner-1.0.jar](https://s3.amazonaws.com/datapipeline-us-east-1/us-east-1/software/latest/TaskRunner/TaskRunner-1.0.jar) and then copy it into a folder on the target computing resource. For Amazon EMR clusters running `EmrActivity` tasks, install Task Runner on the master node of the cluster.

1. When using Task Runner to connect to the AWS Data Pipeline web service to process your commands, users need programmatic access to a role that has permissions to create or manage data pipelines. For more information, see [Granting programmatic access](dp-get-setup.md#dp-grant-programmatic-access).

1. Task Runner connects to the AWS Data Pipeline web service using HTTPS. If you are using an AWS resource, ensure that HTTPS is enabled in the appropriate routing table and subnet ACL. If you are using a firewall or proxy, ensure that port 443 is open.

# (Optional) Granting Task Runner Access to Amazon RDS
<a name="dp-taskrunner-rdssecurity"></a>

Amazon RDS allows you to control access to your DB instances using database security groups (DB security groups). A DB security group acts like a firewall controlling network access to your DB instance. By default, network access is turned off for your DB instances. You must modify your DB security groups to let Task Runner access your Amazon RDS instances. Task Runner gains Amazon RDS access from the instance on which it runs, so the accounts and security groups that you add to your Amazon RDS instance depend on where you install Task Runner. 

**To grant access to Task Runner in EC2-Classic**

1. Open the Amazon RDS console.

1. In the navigation pane, choose **Instances**, and then select your DB instance.

1. Under **Security and Network**, select the security group, which opens the **Security Groups** page with this DB security group selected. Select the details icon for the DB security group.

1. Under **Security Group Details**, create a rule with the appropriate **Connection Type** and **Details**. These fields depend on where Task Runner is running, as described here:
   + `Ec2Resource`
     + **Connection Type**: `EC2 Security Group`

       **Details**: *my-security-group-name* (the name of the security group you created for the EC2 instance)
   + `EmrResource`
     + **Connection Type**: `EC2 Security Group`

       **Details**: `ElasticMapReduce-master`
     + **Connection Type**: `EC2 Security Group`

       **Details**: `ElasticMapReduce-slave`
   + Your local environment (on-premises)
     + **Connection Type**: `CIDR/IP`:

       **Details**: *my-ip-address* (the IP address of your computer or the IP address range of your network, if your computer is behind a firewall)

1. Click Add.

**To grant access to Task Runner in EC2-VPC**

1. Open the Amazon RDS console.

1. In the navigation pane, choose **Instances**.

1. Select the details icon for the DB instance. Under **Security and Network**, open the link to the security group, which takes you to the Amazon EC2 console. If you're using the old console design for security groups, switch to the new console design by selecting the icon that's displayed at the top of the console page.

1. On the **Inbound** tab, choose **Edit**, **Add Rule**. Specify the database port that you used when you launched the DB instance. The source depends on where Task Runner is running, as described here:
   + `Ec2Resource`
     + *my-security-group-id* (the ID of the security group you created for the EC2 instance)
   + `EmrResource`
     + *master-security-group-id* (the ID of the `ElasticMapReduce-master` security group)
     + *slave-security-group-id* (the ID of the `ElasticMapReduce-slave` security group)
   + Your local environment (on-premises)
     + *ip-address* (the IP address of your computer or the IP address range of your network, if your computer is behind a firewall)

1. Click **Save**.

## Starting Task Runner
<a name="dp-activate-task-runner"></a>

In a new command prompt window that is set to the directory where you installed Task Runner, start Task Runner with the following command.

```
java -jar TaskRunner-1.0.jar --config ~/credentials.json --workerGroup=myWorkerGroup --region=MyRegion --logUri=s3://amzn-s3-demo-bucket/foldername
```

The `--config` option points to your credentials file.

The `--workerGroup` option specifies the name of your worker group, which must be the same value as specified in your pipeline for tasks to be processed.

The `--region` option specifies the service region from which to pull tasks to execute.

The `--logUri` option is used for pushing your compressed logs to a location in Amazon S3.

When Task Runner is active, it prints the path to where log files are written in the terminal window. The following is an example.

```
Logging to /Computer_Name/.../output/logs
```

Task Runner should be run detached from your login shell. If you are using a terminal application to connect to your computer, you may need to use a utility like nohup or screen to prevent the Task Runner application from exiting when you log out. For more information about command line options, see [Task Runner Configuration Options](dp-taskrunner-config-options.md).

## Verifying Task Runner Logging
<a name="dp-verify-task-runner"></a>

The easiest way to verify that Task Runner is working is to check whether it is writing log files. Task Runner writes hourly log files to the directory, `output/logs`, under the directory where Task Runner is installed. The file name is `Task Runner.log.YYYY-MM-DD-HH`, where HH runs from 00 to 23, in UDT. To save storage space, any log files older than eight hours are compressed with GZip.