AWS Data Pipeline 不再向新客户提供。的现有客户 AWS Data Pipeline 可以继续照常使用该服务。[了解详情](https://aws.amazon.com/blogs/big-data/migrate-workloads-from-aws-data-pipeline/)

本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。

# 资源
<a name="dp-object-resources"></a>

以下是 AWS Data Pipeline 资源对象：

**Topics**
+ [Ec2Resource](dp-object-ec2resource.md)
+ [EmrCluster](dp-object-emrcluster.md)
+ [HttpProxy](dp-object-httpproxy.md)

# Ec2Resource
<a name="dp-object-ec2resource"></a>

执行管道活动定义的工作的 Amazon EC2 实例。

AWS Data Pipeline 现在支持 Amazon EC2 实例的 imdsv2，当从实例检索元数据信息时，它使用面向会话的方法来更好地处理身份验证。会话会开始和结束一系列请求，Amazon EC2 实例上运行的软件使用这些请求访问本地存储的 Amazon EC2 实例元数据和凭证。该软件通过向 imdsv2 发出简单的 HTTP PUT 请求来开始会话。IMDSv2 会向 Amazon EC2 实例上运行的软件返回一个秘密令牌，该软件将使用该令牌作为密码来请求元数据 IMDSv2 和证书。

**注意**  
要将 IMDSv2 用于您的 Amazon EC2 实例，您需要修改设置，因为默认 AMI 与不兼容。 IMDSv2您可以指定一个新的 AMI 版本，您可以通过以下 SSM 参数检索该版本：`/aws/service/ami-amazon-linux-latest/amzn-ami-hvm-x86_64-ebs`。

有关在您未指定实例的情况下 AWS Data Pipeline 创建的默认 Amazon EC2 实例的信息，请参阅[Amazon Web Services Region 的默认 Amazon EC2 实例](dp-ec2-default-instance-types.md)。

## 示例
<a name="ec2resource-example"></a>

**EC2-Classic**

**重要**  
只有 2013 年 12 月 4 日之前创建的 AWS 账户支持 EC2-Classic 平台。如果您拥有其中一个账户，则可以选择为 EC2-Classic 网络（而不是 VPC）中的管道创建 EC2Resource 对象。我们强烈建议为您在 VPC 中的所有管道创建资源。此外，如果您在 EC2-Classic 有现有资源，建议您把这些资源迁移到 VPC。

以下示例对象在 EC2-Classic 中启动 EC2 实例（带一些可选字段集）。

```
{
  "id" : "MyEC2Resource",
  "type" : "Ec2Resource",
  "actionOnTaskFailure" : "terminate",
  "actionOnResourceFailure" : "retryAll",
  "maximumRetries" : "1",
  "instanceType" : "m5.large",
  "securityGroups" : [
    "test-group",
    "default"
  ],
  "keyPair" : "my-key-pair"
}
```

**EC2-VPC**

以下示例对象在非默认 VPC 中启动 EC2 实例 (设置了一些可选字段)。

```
{
  "id" : "MyEC2Resource",
  "type" : "Ec2Resource",
  "actionOnTaskFailure" : "terminate",
  "actionOnResourceFailure" : "retryAll",
  "maximumRetries" : "1",
  "instanceType" : "m5.large",
  "securityGroupIds" : [
    "sg-12345678",
    "sg-12345678"
  ],
  "subnetId": "subnet-12345678",
  "associatePublicIpAddress": "true",
  "keyPair" : "my-key-pair"
}
```

## 语法
<a name="ec2resource-syntax"></a>


****  

| 必填字段 | 说明 | 槽位类型 | 
| --- | --- | --- | 
| resourceRole | 控制 Amazon EC2 实例可访问的资源的 IAM 角色。 | 字符串 | 
| 角色 |  AWS Data Pipeline 用于创建 EC2 实例的 IAM 角色。 | 字符串 | 

 
****  

| 对象调用字段 | 说明 | 槽位类型 | 
| --- | --- | --- | 
| 计划 |  该对象在计划间隔的执行中调用。 要设置此对象的依赖项执行顺序，请指定对另一个对象的计划引用。您可以通过下列方式之一来执行该操作： [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_cn/datapipeline/latest/DeveloperGuide/dp-object-ec2resource.html)  | 引用对象，例如，"schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| 可选字段 | 说明 | 槽位类型 | 
| --- | --- | --- | 
| actionOnResource失败 | 在此资源发生资源失败后执行的操作。有效值为 "retryall" 和 "retrynone"。 | 字符串 | 
| actionOnTask失败 | 在此资源发生任务失败后执行的操作。有效值为 "continue" 或 "terminate"。 | 字符串 | 
| associatePublicIp地址 | 指示是否向实例分配公有 IP 地址。如果实例位于 Amazon EC2 或 Amazon VPC 中，则默认值为 true。否则，默认值为 false。 | 布尔值 | 
| attemptStatus | 来自远程活动的最近报告的状态。 | 字符串 | 
| attemptTimeout | 远程工作完成的超时时间。如果设置此字段，则可能会重试未在指定开始时间内完成的远程活动。 | 周期 | 
| availabilityZone | 要在其中启动 Amazon EC2 实例的可用区。 | 字符串 | 
| 禁用 IMDSv1 | 默认值为 false，同时启用 IMDSv1 和 IMDSv2。如果你将其设置为 true 那么它就会禁用 IMDSv1 并且只提供 IMDSv2s | 布尔值 | 
| failureAndRerun模式 | 描述依赖项失败或重新运行时的使用者节点行为。 | 枚举 | 
| httpProxy | 客户端用来连接 AWS 服务的代理主机。 | 引用对象，例如， "httpProxy":\$1"ref":"myHttpProxyId"\$1 | 
| imageId | 要用于实例的 AMI 的 ID。默认情况下， AWS Data Pipeline 使用 HVM AMI 虚拟化类型。 IDs 使用的特定 AMI 基于区域。您可以通过指定自己选择的 HVM AMI 来覆盖默认 AMI。有关 AMI 类型的更多信息，请参阅《Amazon EC2 用户指南》**中的 [Linux AMI 虚拟化类型](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/virtualization_types.html)和[查找 Linux AMI](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/finding-an-ami.html)。  | 字符串 | 
| initTimeout | 资源启动前要等待的时间长度。 | 周期 | 
| instanceCount | 已淘汰。 | 整数 | 
| instanceType | 要启动的 Amazon EC2 实例的类型。 | 字符串 | 
| keyPair | 密钥对的名称。如果您在未指定密钥对的情况下启动 Amazon EC2 实例，则无法登录该实例。 | 字符串 | 
| lateAfterTimeout | 管道启动后经过的时间，在此时间内，对象必须完成。仅当计划类型未设置为 ondemand 时才会触发。 | 周期 | 
| maxActiveInstances | 组件的并发活动实例的最大数量。重新运行不计入活动实例数中。 | 整数 | 
| maximumRetries | 失败后的最大重试次数。 | 整数 | 
| minInstanceCount | 已淘汰。 | 整数 | 
| onFail | 当前对象失败时要运行的操作。 | 引用对象，例如， "onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | 在尚未计划对象或对象仍在运行的情况下将触发的操作。 | 引用对象，例如，"onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | 当前对象成功时要运行的操作。 | 引用对象，例如， "onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | 作为槽继承源的当前对象的父项。 | 引用对象，例如， "parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | 用于上传管道日志的 Amazon S3 URI，例如 's3://BucketName/Key/'。 | 字符串 | 
| region |  应在其中运行 Amazon EC2 实例的区域的代码。默认情况下，该实例在管道所在的区域中运行。您可以在从属数据集所在的区域中运行实例。 | 枚举 | 
| reportProgressTimeout | 远程工作对 reportProgress 的连续调用的超时时间。如果设置此字段，则未报告指定时段的进度的远程活动可能会被视为停滞并且将进行重试。 | 周期 | 
| retryDelay | 两次重试之间的超时时间。 | 周期 | 
| runAsUser | 要运行的用户 TaskRunner。 | 字符串 | 
| runsOn | 禁止在该对象上使用此字段。 | 引用对象，例如，"runsOn":\$1"ref":"myResourceId"\$1 | 
| scheduleType |  您可以通过计划类型指定应在间隔开始时、间隔结束时还是按需计划管道定义中的对象。 值为： [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_cn/datapipeline/latest/DeveloperGuide/dp-object-ec2resource.html)  | 枚举 | 
| securityGroupIds | 要用于资源池中的实例的一个或多个 Amazon EC2 安全组的 ID。 | 字符串 | 
| securityGroups | 要用于资源池中的实例的一个或多个 Amazon EC2 安全组。 | 字符串 | 
| spotBidPrice | 每小时您的 Spot 实例的最高价 (美元)，是一个介于 0 和 20.00 (不含) 的小数值。 | 字符串 | 
| subnetId | 要在其中启动实例的 Amazon EC2 子网的 ID。 | 字符串 | 
| terminateAfter | 小时数，经过此时间后将终止资源。 | 周期 | 
| useOnDemandOnLastAttempt | 在最后一次尝试请求 Spot 实例时，请求的是按需实例而不是 Spot 实例。这可确保如果所有之前的尝试都失败，则最后一次尝试不中断。 | 布尔值 | 
| workerGroup | 禁止在该对象上使用此字段。 | 字符串 | 

 
****  

| 运行时字段 | 说明 | 槽位类型 | 
| --- | --- | --- | 
| @activeInstances | 当前计划的有效实例对象的列表。 | 引用对象，例如，"activeInstances":\$1"ref":"myRunnableObjectId"\$1 | 
| @actualEndTime | 该对象的执行完成时间。 | DateTime | 
| @actualStartTime | 该对象的执行开始时间。 | DateTime | 
| cancellationReason | 该对象被取消时显示的 cancellationReason。 | 字符串 | 
| @cascadeFailedOn | 对象在其上失败的依赖项链的描述。 | 引用对象，例如，"cascadeFailedOn":\$1"ref":"myRunnableObjectId"\$1 | 
| emrStepLog | 仅在尝试 Amazon EMR 活动时可用的步骤日志。 | 字符串 | 
| errorId | 该对象失败时显示的错误 ID。 | 字符串 | 
| errorMessage | 该对象失败时显示的错误消息。 | 字符串 | 
| errorStackTrace | 该对象失败时显示的错误堆栈跟踪。 | 字符串 | 
| @failureReason | 资源失败的原因。 | 字符串 | 
| @finishedTime | 该对象完成其执行的时间。 | DateTime | 
| hadoopJobLog | 在尝试 Amazon EMR 的活动时可用的 Hadoop 任务日志。 | 字符串 | 
| @healthStatus | 对象的运行状况，反映进入终止状态的上个对象实例成功还是失败。 | 字符串 | 
| @healthStatusFromInstanceId | 进入终止状态的上个实例对象的 ID。 | 字符串 | 
| @ T healthStatusUpdated ime | 上次更新运行状况的时间。 | DateTime | 
| hostname | 已执行任务尝试的客户端的主机名。 | 字符串 | 
| @lastDeactivatedTime | 上次停用该对象的时间。 | DateTime | 
| @ T latestCompletedRun ime | 已完成执行的最新运行的时间。 | DateTime | 
| @latestRunTime | 已计划执行的最新运行的时间。 | DateTime | 
| @nextRunTime | 计划下次运行的时间。 | DateTime | 
| reportProgressTime | 远程活动报告进度的最近时间。 | DateTime | 
| @scheduledEndTime | 对象的计划结束时间。 | DateTime | 
| @scheduledStartTime | 对象的计划开始时间。 | DateTime | 
| @status | 该对象的状态。 | 字符串 | 
| @version | 用来创建对象的管道版本。 | 字符串 | 
| @waitingOn | 此对象在其上处于等待状态的依赖项列表的描述。 | 引用对象，例如， "waitingOn":\$1"ref":"myRunnableObjectId"\$1 | 

 
****  

| 系统字段 | 说明 | 槽位类型 | 
| --- | --- | --- | 
| @error | 用于描述格式不正确的对象的错误消息。 | 字符串 | 
| @pipelineId | 该对象所属的管道的 ID。 | 字符串 | 
| @sphere | 对象在生命周期中的位置。组件对象产生实例对象，后者执行尝试对象。 | 字符串 | 

# EmrCluster
<a name="dp-object-emrcluster"></a>

表示 Amazon EMR 集群的配置。[EmrActivity](dp-object-emractivity.md) 和 [HadoopActivity](dp-object-hadoopactivity.md) 使用此对象来启动集群。

**Topics**
+ [调度器](#emrcluster-schedulers)
+ [Amazon EMR 发行版](#dp-emrcluster-release-versions)
+ [Amazon EMR 权限](#w2aac52c17b9c11)
+ [语法](#emrcluster-syntax)
+ [示例](emrcluster-example.md)
+ [另请参阅](#emrcluster-seealso)

## 调度器
<a name="emrcluster-schedulers"></a>

计划程序提供了一种方法来在 Hadoop 集群中指定资源分配和作业优先级。管理员或用户可以为各类用户和应用程序选择一个计划程序。计划程序可能使用队列来向用户和应用程序分配资源。您在创建集群时会设置这些队列。随后，您可以将特定类型的工作和用户设为优先于其他工作和用户。这样可以高效地使用集群资源，并允许多个用户将工作提交到集群。有以下三类计划程序可用：
+ [FairScheduler](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/FairScheduler.html)— 尝试在相当长的一段时间内均匀地安排资源。
+ [CapacityScheduler](https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html)— 使用队列允许群集管理员将用户分配到不同优先级和资源分配的队列。
+ Default - 由集群使用 (可由您的站点配置)。

## Amazon EMR 发行版
<a name="dp-emrcluster-release-versions"></a>

Amazon EMR 发行版是一组来自大数据生态系统的开源应用程序。每个发行版由您在创建集群时选择让 Amazon EMR 安装和配置的各个大数据应用程序、组件和功能组成。可使用发行版标注指定版本。版本标签的格式是 `emr-x.x.x`。例如 `emr-5.30.0`。基于版本标签 `emr-4.0.0` 及更高版本的 Amazon EMR 集群使用 `releaseLabel` 属性指定 `EmrCluster` 对象的版本标签。早期版本使用 `amiVersion` 属性。

**重要**  
使用发布版本 5.22.0 或更高版本创建的所有 Amazon EMR 集群都使用[签名版本 4](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html) 向 Amazon S3 验证请求。某些早期发布版本使用签名版本 2。对签名版本 2 的支持即将停止。有关更多信息，请参阅 [Amazon S3 更新 — SIGv2 弃用期延长并修改](https://aws.amazon.com/blogs/aws/amazon-s3-update-sigv2-deprecation-period-extended-modified/)。我们强烈建议您使用支持签名版本 4 的 Amazon EMR 发布版本。对于早期发布版本，从 EMR 4.7.x 开始，系列中的最新版本已更新为支持签名版本 4。使用较早版本的 EMR 时，建议您使用系列中的最新版本。此外，请避免早于 EMR 4.7.0 的版本。

### 注意事项和限制
<a name="dp-emrcluster-considerations"></a>

#### 使用最新版本的任务运行程序
<a name="dp-task-runner-latest"></a>

如果您将自管理的 `EmrCluster` 对象与版本标签结合使用，请使用最新的任务运行程序。有关任务运行程序的更多信息，请参阅[使用任务运行程序](dp-using-task-runner.md)。您可以为所有 Amazon EMR 配置分类配置属性值。有关更多信息，请参阅 *Amazon EMR 版本指南*中的[配置应用程序](https://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-configure-apps.html)以及 [EmrConfiguration](dp-object-emrconfiguration.md) 和 [属性](dp-object-property.md) 对象引用。

#### Support IMDSv2
<a name="dp-emr-imdsv2-support"></a>

此前，仅 AWS Data Pipeline 支持 IMDSv1。现在， AWS Data Pipeline 支持 IMDSv2 亚马逊 EMR 5.23.1、5.27.1 和 5.32 或更高版本，以及亚马逊 EMR 6.2 或更高版本。 IMDSv2 在从实例检索元数据信息时，使用面向会话的方法来更好地处理身份验证。您应使用 TaskRunner -2.0 创建用户管理的资源，将您的实例配置为进行 IMDSv2 调用。

#### Amazon EMR 5.32 或更高版本以及 Amazon EMR 6.x
<a name="dp-emr-6-classpath"></a>

Amazon EMR 5.32 或更高版本和 6.x 版本系列使用 Hadoop 3.x 版本。Hadoop 3.x 版本与 Hadoop 2.x 版本相比，引入了对 Hadoop 类路径的评估方式的重大变更。像 Joda-Time 这样的常见库已从类路径中删除。

如果 [EmrActivity](dp-object-emractivity.md) 或 [HadoopActivity](dp-object-hadoopactivity.md) 运行的 Jar 文件依赖于 Hadoop 3.x 中已删除的库，则该步骤将失败，并显示错误 `java.lang.NoClassDefFoundError` 或 `java.lang.ClassNotFoundException`。对于使用 Amazon EMR 5.x 发行版运行时不会出现问题的 Jar 文件，可能会发生这种情况。

要解决此问题，在启动 `EmrActivity` 或 `HadoopActivity` 之前，必须将 Jar 文件依赖关系复制到 `EmrCluster` 对象上的 Hadoop 类路径中。我们提供 bash 脚本来执行此操作。bash 脚本可在以下位置找到，例如`us-west-2`，该位置*MyRegion*是您的`EmrCluster`对象运行的 AWS 区域。

```
s3://datapipeline-MyRegion/MyRegion/bootstrap-actions/latest/TaskRunner/copy-jars-to-hadoop-classpath.sh
```

脚本的运行方式取决于`EmrActivity`或是在由 AWS Data Pipeline 自我管理的资源管理的资源上`HadoopActivity`运行还是在自管理的资源上运行。

如果您使用由管理的资源 AWS Data Pipeline，请`bootstrapAction`向`EmrCluster`对象添加。`bootstrapAction` 指定要复制作为参数的脚本和 Jar 文件。每个 `EmrCluster` 对象最多可以添加 255 个 `bootstrapAction` 字段，也可以向已有引导操作的 `EmrCluster` 对象添加 `bootstrapAction` 字段。

要将此脚本指定为引导操作，请使用以下语法，其中`JarFileRegion`是保存 Jar 文件的区域，每个*MyJarFile*n**区域都是 Amazon S3 中要复制到 Hadoop 类路径的 Jar 文件的绝对路径。请勿指定默认位于 Hadoop 类路径中的 Jar 文件。

```
s3://datapipeline-MyRegion/MyRegion/bootstrap-actions/latest/TaskRunner/copy-jars-to-hadoop-classpath.sh,JarFileRegion,MyJarFile1,MyJarFile2[, ...]
```

以下示例指定了一个引导操作，该操作将复制 Amazon S3 中的两个 Jar 文件：`my-jar-file.jar` 和 `emr-dynamodb-tool-4.14.0-jar-with-dependencies.jar`。此示例使用 us-west-2 区域。

```
{
  "id" : "MyEmrCluster",
  "type" : "EmrCluster",
  "keyPair" : "my-key-pair",
  "masterInstanceType" : "m5.xlarge",
  "coreInstanceType" : "m5.xlarge",
  "coreInstanceCount" : "2",
  "taskInstanceType" : "m5.xlarge",
  "taskInstanceCount": "2",
  "bootstrapAction" : ["s3://datapipeline-us-west-2/us-west-2/bootstrap-actions/latest/TaskRunner/copy-jars-to-hadoop-classpath.sh,us-west-2,s3://path/to/my-jar-file.jar,s3://dynamodb-dpl-us-west-2/emr-ddb-storage-handler/4.14.0/emr-dynamodb-tools-4.14.0-jar-with-dependencies.jar"]
}
```

我们强烈建议保存并激活管道，以便对新 `bootstrapAction` 的更改生效。

如果您使用自管理资源，则可以将脚本下载到集群实例，然后使用 SSH 从命令行运行该脚本。该脚本将在该目录中创建一个名为 `/etc/hadoop/conf/shellprofile.d` 的目录和一个名为 `datapipeline-jars.sh` 的文件。作为命令行参数提供的 jar 文件被复制到脚本创建的名为 `/home/hadoop/datapipeline_jars` 的目录中。如果您的集群设置不同，请在下载脚本后对脚本进行相应的修改。

在命令行上运行脚本的语法与使用前一个示例中所示的 `bootstrapAction` 略有不同。参数之间应使用空格而不是逗号，如以下示例所示。

```
./copy-jars-to-hadoop-classpath.sh us-west-2 s3://path/to/my-jar-file.jar s3://dynamodb-dpl-us-west-2/emr-ddb-storage-handler/4.14.0/emr-dynamodb-tools-4.14.0-jar-with-dependencies.jar
```

## Amazon EMR 权限
<a name="w2aac52c17b9c11"></a>

当您创建自定义 IAM 角色时，请仔细考虑您的集群执行其工作所需的最小权限。请务必授予对所需资源的访问权，例如文件（在 Amazon S3 中）或数据（在 Amazon RDS、Amazon Redshift 或 DynamoDB 中）。如果您希望将 `visibleToAllUsers` 设置为 False，您的角色必须具有适当的权限才能执行此操作。请注意，`DataPipelineDefaultRole` 没有这些权限。您必须提供 `DefaultDataPipelineResourceRole` 和 `DataPipelineDefaultRole` 角色的联合作为 `EmrCluster` 对象角色或创建您自己的角色来实现此目的。

## 语法
<a name="emrcluster-syntax"></a>


****  

| 对象调用字段 | 说明 | 槽位类型 | 
| --- | --- | --- | 
| 计划 | 该对象在计划间隔的执行中调用。指定对另一个对象的计划引用，以便设置该对象的依赖项执行顺序。您可以明确设置针对该对象的计划以满足该要求，例如，指定 "schedule": \$1"ref": "DefaultSchedule"\$1。在大多数情况下，最好将计划引用放在默认管道对象上，以便所有对象继承该计划。或者，如果管道具有一个计划树 (计划位于主计划中)，您可以创建具有计划引用的父对象。有关示例可选计划配置的更多信息，请参阅 [https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html](https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html)。 | 引用对象，例如， "schedule":\$1"ref":"myScheduleId"\$1 | 

 
****  

| 可选字段 | 说明 | 槽位类型 | 
| --- | --- | --- | 
| actionOnResource失败 | 在此资源发生资源失败后执行的操作。有效值为“retryall”(在指定的持续时间内对集群重试所有任务) 和“retrynone”。 | 字符串 | 
| actionOnTask失败 | 在此资源发生任务失败后执行的操作。有效值为“continue”(意味着不终止集群) 和“terminate”。 | 字符串 | 
| additionalMasterSecurityGroupIds | EMR 集群中其他主安全组的标识符，格式为 sg-01。XXXX6a有关更多信息，请参阅 Amazon EMR 管理指南中的 [Amazon EMR 其他安全组](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-additional-sec-groups.html)。 | 字符串 | 
| additionalSlaveSecurityGroupIds | EMR 集群的其他从属安全组的标识符，其形式为 sg-01XXXX6a。 | 字符串 | 
| amiVersion | Amazon EMR 用来安装集群节点的 Amazon Machine Image（AMI）版本。有关更多信息，请参阅 [Amazon EMR 管理指南](https://docs.aws.amazon.com/emr/latest/ManagementGuide/)。 | 字符串 | 
| 应用程序 | 要安装在集群中的应用程序，带逗号分隔的参数。默认情况下，安装 Hive 和 Pig。该参数仅适用于 Amazon EMR 4.0 和更高版本。 | 字符串 | 
| attemptStatus | 来自远程活动的最近报告的状态。 | 字符串 | 
| attemptTimeout | 远程工作完成的超时时间。如果设置此字段，则可能会重试未在设定的开始时间内完成的远程活动。 | 周期 | 
| availabilityZone | 用于运行集群的可用区。 | 字符串 | 
| bootstrapAction | 在集群启动时要运行的操作。您可以指定逗号分隔的参数。要指定多个操作 (最多 255 个)，请添加多个 bootstrapAction 字段。默认行为是启动集群，而不执行任何引导操作。 | 字符串 | 
| 配置 | Amazon EMR 集群的配置。该参数仅适用于 Amazon EMR 4.0 和更高版本。 | 引用对象，例如，"configuration":\$1"ref":"myEmrConfigurationId"\$1 | 
| coreInstanceBid价格 | 您愿意为 Amazon EC2 实例支付的最高 Spot 价格。如果指定了出价，Amazon EMR 将为实例组使用 Spot 实例。以 USD 为单位指定。 | 字符串 | 
| coreInstanceCount | 要用于集群的核心节点的数目。 | 整数 | 
| coreInstanceType | 要用于核心节点的 Amazon EC2 实例的类型。请参阅[Amazon EMR 集群支持的 Amazon EC2 实例](dp-emr-supported-instance-types.md)。 | 字符串 | 
| coreGroupConfiguration | Amazon EMR 集群核心实例组的配置。该参数仅适用于 Amazon EMR 4.0 和更高版本。 | 引用对象，例如，“configuration”: \$1“ref”: “myEmrConfigurationId”\$1 | 
| coreEbsConfiguration | 将附加到 Amazon EMR 集群的核心组中的每个核心节点的 Amazon EBS 卷的配置。有关更多信息，请参阅《Amazon EC2 用户指南》中的[支持 EBS 优化的实例类型](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html)。 | 引用对象，例如，“coreEbsConfiguration”: \$1“ref”: “myEbsConfiguration”\$1 | 
| customAmiId | 仅适用于 Amazon EMR 版本 5.7.0 及更高版本。指定当 Amazon EMR 预置 Amazon EC2 实例时要使用的自定义 AMI 的 AMI ID。也可以使用它来代替引导操作以自定义集群节点配置。有关更多信息，请参阅《Amazon EMR 管理指南》中的以下主题：[使用自定义 AMI](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-custom-ami.html) | 字符串 | 
| EbsBlockDeviceConfig |  请求的与实例组关联的 Amazon EBS 块设备的配置。包含指定数量的卷，这些卷将与实例组中的每个实例相关联。包括 `volumesPerInstance` 和 `volumeSpecification`，其中： [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_cn/datapipeline/latest/DeveloperGuide/dp-object-emrcluster.html)  | 引用对象，例如，“EbsBlockDeviceConfig”: \$1“ref”: “myEbsBlockDeviceConfig”\$1 | 
| emrManagedMasterSecurityGroupId | Amazon EMR 集群的主安全组的标识符，它采用 sg-01XXXX6a 格式。有关更多信息，请参阅 Amazon EMR 管理指南中的[配置安全组](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-security-groups.html)。 | 字符串 | 
| emrManagedSlaveSecurityGroupId | Amazon EMR 集群的从属安全组的标识符，它采用 sg-01XXXX6a 格式。 | 字符串 | 
| enableDebugging | 在 Amazon EMR 集群上启用调试。 | 字符串 | 
| failureAndRerun模式 | 描述依赖项失败或重新运行时的使用者节点行为。 | 枚举 | 
| hadoopSchedulerType | 集群的计划程序类型。有效类型为： PARALLEL\$1FAIR\$1SCHEDULING、 PARALLEL\$1CAPACITY\$1SCHEDULING 和  DEFAULT\$1SCHEDULER。 | 枚举 | 
| httpProxy | 客户端用来连接到 Amazon Web Services 的代理主机。 | 参考对象，例如，“HttpProxy”：\$1“ref”:” myHttpProxy Id "\$1 | 
| initTimeout | 资源启动前要等待的时间长度。 | 周期 | 
| keyPair | 要用于登录 Amazon EMR 集群的主节点的 Amazon EC2 密钥对。 | 字符串 | 
| lateAfterTimeout | 管道启动后经过的时间，在此时间内，对象必须完成。仅当计划类型未设置为 ondemand 时才会触发。 | 周期 | 
| masterInstanceBid价格 | 您愿意为 Amazon EC2 实例支付的最高 Spot 价格。它是一个介于 0 和 20.00 之间（不含）的数字。以 USD 为单位指定。设置此值将为 Amazon EMR 集群主节点启用 Spot 实例。如果指定了出价，Amazon EMR 将为实例组使用 Spot 实例。 | 字符串 | 
| masterInstanceType | 要用于主节点的 Amazon EC2 实例的类型。请参阅[Amazon EMR 集群支持的 Amazon EC2 实例](dp-emr-supported-instance-types.md)。 | 字符串 | 
| masterGroupConfiguration | Amazon EMR 集群主实例组的配置。该参数仅适用于 Amazon EMR 4.0 和更高版本。 | 引用对象，例如，“configuration”: \$1“ref”: “myEmrConfigurationId”\$1 | 
| masterEbsConfiguration | 将附加到 Amazon EMR 集群的主组中的每个主节点的 Amazon EBS 卷的配置。有关更多信息，请参阅《Amazon EC2 用户指南》中的[支持 EBS 优化的实例类型](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html)。 | 引用对象，例如，“masterEbsConfiguration”: \$1“ref”: “myEbsConfiguration”\$1 | 
| maxActiveInstances | 组件的并发活动实例的最大数量。重新运行不计入活动实例数中。 | 整数 | 
| maximumRetries | 失败后的最大重试次数。 | 整数 | 
| onFail | 当前对象失败时要运行的操作。 | 引用对象，例如，"onFail":\$1"ref":"myActionId"\$1 | 
| onLateAction | 在尚未计划对象或对象仍未完成的情况下将触发的操作。 | 引用对象，例如，"onLateAction":\$1"ref":"myActionId"\$1 | 
| onSuccess | 当前对象成功时要运行的操作。 | 引用对象，例如，"onSuccess":\$1"ref":"myActionId"\$1 | 
| parent | 作为槽继承源的当前对象的父项。 | 引用对象，例如，"parent":\$1"ref":"myBaseObjectId"\$1 | 
| pipelineLogUri | 用于上传管道日志的 Amazon S3 URI（例如 's3: BucketName ///Key/ '）。 | 字符串 | 
| region | Amazon EMR 集群应在其中运行的区域的代码。默认情况下，该集群在管道所在的区域中运行。您可以在从属数据集所在的区域中运行集群。 | 枚举 | 
| releaseLabel | EMR 集群的版本标签。 | 字符串 | 
| reportProgressTimeout | 远程工作对 reportProgress 的连续调用的超时时间。如果设置此字段，则未报告指定时段的进度的远程活动可能会被视为停滞且已重试。 | 周期 | 
| resourceRole |  AWS Data Pipeline 用于创建 Amazon EMR 集群的 IAM 角色。默认角色是 DataPipelineDefaultRole。 | 字符串 | 
| retryDelay | 两次重试之间的超时时间。 | 周期 | 
| 角色 | 传递到 Amazon EMR 以创建 EC2 节点的 IAM 角色。 | 字符串 | 
| runsOn | 禁止在该对象上使用此字段。 | 引用对象，例如，"runsOn":\$1"ref":"myResourceId"\$1 | 
| securityConfiguration | 应用于集群的 EMR 安全配置的标识符。该参数仅适用于 Amazon EMR 4.8.0 和更高版本。 | 字符串 | 
| serviceAccessSecurityGroupId | Amazon EMR 集群的服务访问安全组的标识符。 | 字符串。它采用 sg-01XXXX6a 格式，例如，sg-1234abcd。 | 
| scheduleType | 您可以通过计划类型指定应在间隔开头还是结尾计划管道定义中的对象。值包括：cron、ondemand 和 timeseries。timeseries 计划表示在每个间隔结尾计划实例。cron 计划表示在每个间隔开头计划实例。ondemand 计划让您可以在每次激活时运行一次管道。您不需要克隆或重新创建管道以再次运行它。如果您使用 ondemand 计划，则必须在默认对象中指定它，并且该计划必须是在管道中为对象指定的唯一 scheduleType。要使用 ondemand 管道，请为每个后续运行调用 ActivatePipeline 操作。 | 枚举 | 
| subnetId | 要在其中启动 Amazon EMR 集群的子网的标识符。 | 字符串 | 
| supportedProducts | 在 Amazon EMR 集群上安装第三方软件的参数，例如，安装第三方 Hadoop 分发版本。 | 字符串 | 
| taskInstanceBid价格 | 您愿意为 EC2 实例支付的最高 Spot 价格。一个介于 0 和 20.00 之间（不含）的数字。以 USD 为单位指定。如果指定了出价，Amazon EMR 将为实例组使用 Spot 实例。 | 字符串 | 
| taskInstanceCount | 要用于 Amazon EMR 集群的任务节点数。 | 整数 | 
| taskInstanceType | 要用于任务节点的 Amazon EC2 实例的类型。 | 字符串 | 
| taskGroupConfiguration | Amazon EMR 集群任务实例组的配置。该参数仅适用于 Amazon EMR 4.0 和更高版本。 | 引用对象，例如，“configuration”: \$1“ref”: “myEmrConfigurationId”\$1 | 
| taskEbsConfiguration | 将附加到 Amazon EMR 集群的任务组中的每个任务节点的 Amazon EBS 卷的配置。有关更多信息，请参阅《Amazon EC2 用户指南》中的[支持 EBS 优化的实例类型](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html)。 | 引用对象，例如，“taskEbsConfiguration”: \$1“ref”: “myEbsConfiguration”\$1 | 
| terminateAfter | 终止资源之前经过的小时数。 | 整数 | 
| VolumeSpecification |   Amazon EBS 卷规格，例如，为附加到 Amazon EMR 集群中的 Amazon EC2 实例的 Amazon EBS 卷请求的卷类型、IOPS 和大小（GiB）。节点可以是核心节点、主节点或任务节点。 `VolumeSpecification` 包括： [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_cn/datapipeline/latest/DeveloperGuide/dp-object-emrcluster.html)  | 引用对象，例如，“VolumeSpecification”: \$1“ref”: “myVolumeSpecification”\$1 | 
| useOnDemandOnLastAttempt | 在最后一次尝试请求资源时，请求的是按需实例而不是 Spot 实例。这可确保如果所有之前的尝试都失败，则最后一次尝试不中断。 | 布尔值 | 
| workerGroup | 禁止在该对象中使用该字段。 | 字符串 | 

 
****  

| 运行时字段 | 说明 | 槽位类型 | 
| --- | --- | --- | 
| @activeInstances | 当前计划的有效实例对象的列表。 | 参考对象，例如，“ActiveInstances”：\$1"ref”:” myRunnableObject Id "\$1 | 
| @actualEndTime | 该对象的执行完成时间。 | DateTime | 
| @actualStartTime | 该对象的执行开始时间。 | DateTime | 
| cancellationReason | 该对象被取消时显示的 cancellationReason。 | 字符串 | 
| @cascadeFailedOn | 对象在其上失败的依赖项链的描述。 | 参考对象，例如 cascadeFailedOn ““: \$1" ref”:” myRunnableObject Id "\$1 | 
| emrStepLog | 仅在尝试 Amazon EMR 活动时可用的步骤日志。 | 字符串 | 
| errorId | 该对象失败时显示的错误 ID。 | 字符串 | 
| errorMessage | 该对象失败时显示的错误消息。 | 字符串 | 
| errorStackTrace | 该对象失败时显示的错误堆栈跟踪。 | 字符串 | 
| @failureReason | 资源失败的原因。 | 字符串 | 
| @finishedTime | 该对象完成其执行的时间。 | DateTime | 
| hadoopJobLog | 在尝试 Amazon EMR 的活动时可用的 Hadoop 任务日志。 | 字符串 | 
| @healthStatus | 对象的运行状况，反映进入终止状态的上个对象实例成功还是失败。 | 字符串 | 
| @healthStatusFromInstanceId | 进入终止状态的上个实例对象的 ID。 | 字符串 | 
| @ T healthStatusUpdated ime | 上次更新运行状况的时间。 | DateTime | 
| hostname | 已执行任务尝试的客户端的主机名。 | 字符串 | 
| @lastDeactivatedTime | 上次停用该对象的时间。 | DateTime | 
| @ T latestCompletedRun ime | 已完成执行的最新运行的时间。 | DateTime | 
| @latestRunTime | 已计划执行的最新运行的时间。 | DateTime | 
| @nextRunTime | 计划下次运行的时间。 | DateTime | 
| reportProgressTime | 远程活动报告进度的最近时间。 | DateTime | 
| @scheduledEndTime | 对象的计划结束时间。 | DateTime | 
| @scheduledStartTime | 对象的计划开始时间。 | DateTime | 
| @status | 该对象的状态。 | 字符串 | 
| @version | 用来创建对象的管道版本。 | 字符串 | 
| @waitingOn | 此对象在其上处于等待状态的依赖项列表的描述。 | 参考对象，例如 “waitingOn”：\$1“ref”:” myRunnableObject Id "\$1 | 

 
****  

| 系统字段 | 说明 | 槽位类型 | 
| --- | --- | --- | 
| @error | 用于描述格式不正确的对象的错误消息。 | 字符串 | 
| @pipelineId | 该对象所属的管道的 ID。 | 字符串 | 
| @sphere | 对象在生命周期中的位置。组件对象产生实例对象，后者执行尝试对象。 | 字符串 | 

# 示例
<a name="emrcluster-example"></a>

下面是该对象类型的示例。

**Topics**
+ [使用 hadoopVersion 启动 Amazon EMR 集群](emrcluster-example-launch.md)
+ [启动具有版本标签 emr-4.x 或更高版本的 Amazon EMR 集群](emrcluster-example-release-label.md)
+ [在您的 Amazon EMR 集群上安装额外的软件](emrcluster-example-install-software.md)
+ [在 3.x 版本上禁用服务器端加密](emrcluster-example1-disable-encryption.md)
+ [在 4.x 版本上禁用服务器端加密](emrcluster-example2-disable-encryption.md)
+ [配置 Hadoop KMS ACLs 并在 HDFS 中创建加密区域](emrcluster-example-hadoop-kms.md)
+ [指定自定义 IAM 角色](emrcluster-example-custom-iam-roles.md)
+ [使用适用于 Java 的 AWS 开发工具包中的 EmrCluster 资源](emrcluster-example-java.md)
+ [在私有子网中配置 Amazon EMR 集群](emrcluster-example-private-subnet.md)
+ [将 EBS 卷附加到集群节点](emrcluster-example-ebs.md)

# 使用 hadoopVersion 启动 Amazon EMR 集群
<a name="emrcluster-example-launch"></a>

**Example**  <a name="example1"></a>
以下示例使用 AMI 1.0 版和 Hadoop 0.20 启动 Amazon EMR 集群。  

```
{
  "id" : "MyEmrCluster",
  "type" : "EmrCluster",
  "hadoopVersion" : "0.20",
  "keyPair" : "my-key-pair",
  "masterInstanceType" : "m3.xlarge",
  "coreInstanceType" : "m3.xlarge",
  "coreInstanceCount" : "10",
  "taskInstanceType" : "m3.xlarge",
  "taskInstanceCount": "10",
  "bootstrapAction" : ["s3://Region.elasticmapreduce/bootstrap-actions/configure-hadoop,arg1,arg2,arg3","s3://Region.elasticmapreduce/bootstrap-actions/configure-hadoop/configure-other-stuff,arg1,arg2"]
}
```

# 启动具有版本标签 emr-4.x 或更高版本的 Amazon EMR 集群
<a name="emrcluster-example-release-label"></a>

**Example**  
以下示例使用较新的 `releaseLabel` 字段启动 Amazon EMR 集群：  

```
{
  "id" : "MyEmrCluster",
  "type" : "EmrCluster",
  "keyPair" : "my-key-pair",
  "masterInstanceType" : "m3.xlarge",
  "coreInstanceType" : "m3.xlarge",
  "coreInstanceCount" : "10",
  "taskInstanceType" : "m3.xlarge",
  "taskInstanceCount": "10",
  "releaseLabel": "emr-4.1.0",
  "applications": ["spark", "hive", "pig"],
  "configuration": {"ref":"myConfiguration"}  
}
```

# 在您的 Amazon EMR 集群上安装额外的软件
<a name="emrcluster-example-install-software"></a>

**Example**  <a name="example2"></a>
`EmrCluster` 提供了 `supportedProducts` 字段，它在 Amazon EMR 集群上安装第三方软件，例如，它用于安装 Hadoop 自定义分发版本（如 MapR）。它接受适用于第三方软件读取和处理的参数的逗号分隔列表。以下示例说明如何使用 `EmrCluster` 的 `supportedProducts` 字段来创建已安装 Karmasphere Analytics 的自定义 MapR M3 版本集群，并在该集群上运行 `EmrActivity` 对象。  

```
{
    "id": "MyEmrActivity",
    "type": "EmrActivity",
    "schedule": {"ref": "ResourcePeriod"},
    "runsOn": {"ref": "MyEmrCluster"},
    "postStepCommand": "echo Ending job >> /mnt/var/log/stepCommand.txt",    
    "preStepCommand": "echo Starting job > /mnt/var/log/stepCommand.txt",
    "step": "/home/hadoop/contrib/streaming/hadoop-streaming.jar,-input,s3n://elasticmapreduce/samples/wordcount/input,-output, \
     hdfs:///output32113/,-mapper,s3n://elasticmapreduce/samples/wordcount/wordSplitter.py,-reducer,aggregate"
  },
  {    
    "id": "MyEmrCluster",
    "type": "EmrCluster",
    "schedule": {"ref": "ResourcePeriod"},
    "supportedProducts": ["mapr,--edition,m3,--version,1.2,--key1,value1","karmasphere-enterprise-utility"],
    "masterInstanceType": "m3.xlarge",
    "taskInstanceType": "m3.xlarge"
}
```

# 在 3.x 版本上禁用服务器端加密
<a name="emrcluster-example1-disable-encryption"></a>

**Example**  <a name="example3"></a>
默认情况下，由 Hadoop 2.x 版本创建的`EmrCluster`活动 AWS Data Pipeline 启用服务器端加密。如果您想禁用服务器端加密，则必须在集群对象定义中指定引导操作。  
以下示例创建一个已禁用服务器端加密的 `EmrCluster` 活动：  

```
{  
   "id":"NoSSEEmrCluster",
   "type":"EmrCluster",
   "hadoopVersion":"2.x",
   "keyPair":"my-key-pair",
   "masterInstanceType":"m3.xlarge",
   "coreInstanceType":"m3.large",
   "coreInstanceCount":"10",
   "taskInstanceType":"m3.large",
   "taskInstanceCount":"10",
   "bootstrapAction":["s3://Region.elasticmapreduce/bootstrap-actions/configure-hadoop,-e, fs.s3.enableServerSideEncryption=false"]
}
```

# 在 4.x 版本上禁用服务器端加密
<a name="emrcluster-example2-disable-encryption"></a>

**Example**  <a name="example4"></a>
您必须使用 `EmrConfiguration` 对象禁用服务器端加密。  
以下示例创建一个已禁用服务器端加密的 `EmrCluster` 活动：  

```
   {
      "name": "ReleaseLabelCluster",
      "releaseLabel": "emr-4.1.0",
      "applications": ["spark", "hive", "pig"],
      "id": "myResourceId",
      "type": "EmrCluster",
      "configuration": {
        "ref": "disableSSE"
      }
    },
    {
      "name": "disableSSE",
      "id": "disableSSE",
      "type": "EmrConfiguration",
      "classification": "emrfs-site",
      "property": [{
        "ref": "enableServerSideEncryption"
      }
      ]
    },
    {
      "name": "enableServerSideEncryption",
      "id": "enableServerSideEncryption",
      "type": "Property",
      "key": "fs.s3.enableServerSideEncryption",
      "value": "false"
    }
```

# 配置 Hadoop KMS ACLs 并在 HDFS 中创建加密区域
<a name="emrcluster-example-hadoop-kms"></a>

**Example**  <a name="example5"></a>
以下对象是 ACLs 为 Hadoop KMS 创建的，并在 HDFS 中创建加密区域和相应的加密密钥：  

```
{
      "name": "kmsAcls",
      "id": "kmsAcls",
      "type": "EmrConfiguration",
      "classification": "hadoop-kms-acls",
      "property": [
        {"ref":"kmsBlacklist"},
        {"ref":"kmsAcl"}
      ]
    },
    {
      "name": "hdfsEncryptionZone",
      "id": "hdfsEncryptionZone",
      "type": "EmrConfiguration",
      "classification": "hdfs-encryption-zones",
      "property": [
        {"ref":"hdfsPath1"},
        {"ref":"hdfsPath2"}
      ]
    },
    {
      "name": "kmsBlacklist",
      "id": "kmsBlacklist",
      "type": "Property",
      "key": "hadoop.kms.blacklist.CREATE",
      "value": "foo,myBannedUser"
    },
    {
      "name": "kmsAcl",
      "id": "kmsAcl",
      "type": "Property",
      "key": "hadoop.kms.acl.ROLLOVER",
      "value": "myAllowedUser"
    },
    {
      "name": "hdfsPath1",
      "id": "hdfsPath1",
      "type": "Property",
      "key": "/myHDFSPath1",
      "value": "path1_key"
    },
    {
      "name": "hdfsPath2",
      "id": "hdfsPath2",
      "type": "Property",
      "key": "/myHDFSPath2",
      "value": "path2_key"
    }
```

# 指定自定义 IAM 角色
<a name="emrcluster-example-custom-iam-roles"></a>

**Example**  <a name="example6"></a>
默认情况下，`DataPipelineDefaultRole`作`DataPipelineDefaultResourceRole`为 Amazon EMR 服务角色和 Amazon EC2 实例配置文件 AWS Data Pipeline 传递以代表您创建资源。但是，您可以创建自定义 Amazon EMR 服务角色和自定义实例配置文件，然后改为使用它们。 AWS Data Pipeline 应有足够的权限使用自定义角色创建集群，并且必须添加 AWS Data Pipeline 为可信实体。  
以下示例对象指定 Amazon EMR 集群的自定义角色：  

```
{  
   "id":"MyEmrCluster",
   "type":"EmrCluster",
   "hadoopVersion":"2.x",
   "keyPair":"my-key-pair",
   "masterInstanceType":"m3.xlarge",
   "coreInstanceType":"m3.large",
   "coreInstanceCount":"10",
   "taskInstanceType":"m3.large",
   "taskInstanceCount":"10",
   "role":"emrServiceRole",
   "resourceRole":"emrInstanceProfile"
}
```

# 使用适用于 Java 的 AWS 开发工具包中的 EmrCluster 资源
<a name="emrcluster-example-java"></a>

**Example**  <a name="example7"></a>
以下示例说明了如何使用 `EmrCluster` 和 `EmrActivity` 创建 Amazon EMR 4.x 集群以通过 Java 软件开发工具包运行 Spark 步骤：  

```
public class dataPipelineEmr4 {

  public static void main(String[] args) {
    
	AWSCredentials credentials = null;
	credentials = new ProfileCredentialsProvider("/path/to/AwsCredentials.properties","default").getCredentials();
	DataPipelineClient dp = new DataPipelineClient(credentials);
	CreatePipelineRequest createPipeline = new CreatePipelineRequest().withName("EMR4SDK").withUniqueId("unique");
	CreatePipelineResult createPipelineResult = dp.createPipeline(createPipeline);
	String pipelineId = createPipelineResult.getPipelineId();
    
	PipelineObject emrCluster = new PipelineObject()
	    .withName("EmrClusterObj")
	    .withId("EmrClusterObj")
	    .withFields(
			new Field().withKey("releaseLabel").withStringValue("emr-4.1.0"),
			new Field().withKey("coreInstanceCount").withStringValue("3"),
			new Field().withKey("applications").withStringValue("spark"),
			new Field().withKey("applications").withStringValue("Presto-Sandbox"),
			new Field().withKey("type").withStringValue("EmrCluster"),
			new Field().withKey("keyPair").withStringValue("myKeyName"),
			new Field().withKey("masterInstanceType").withStringValue("m3.xlarge"),
			new Field().withKey("coreInstanceType").withStringValue("m3.xlarge")        
			);
  
	PipelineObject emrActivity = new PipelineObject()
	    .withName("EmrActivityObj")
	    .withId("EmrActivityObj")
	    .withFields(
			new Field().withKey("step").withStringValue("command-runner.jar,spark-submit,--executor-memory,1g,--class,org.apache.spark.examples.SparkPi,/usr/lib/spark/lib/spark-examples.jar,10"),
			new Field().withKey("runsOn").withRefValue("EmrClusterObj"),
			new Field().withKey("type").withStringValue("EmrActivity")
			);
      
	PipelineObject schedule = new PipelineObject()
	    .withName("Every 15 Minutes")
	    .withId("DefaultSchedule")
	    .withFields(
			new Field().withKey("type").withStringValue("Schedule"),
			new Field().withKey("period").withStringValue("15 Minutes"),
			new Field().withKey("startAt").withStringValue("FIRST_ACTIVATION_DATE_TIME")
			);
      
	PipelineObject defaultObject = new PipelineObject()
	    .withName("Default")
	    .withId("Default")
	    .withFields(
			new Field().withKey("failureAndRerunMode").withStringValue("CASCADE"),
			new Field().withKey("schedule").withRefValue("DefaultSchedule"),
			new Field().withKey("resourceRole").withStringValue("DataPipelineDefaultResourceRole"),
			new Field().withKey("role").withStringValue("DataPipelineDefaultRole"),
			new Field().withKey("pipelineLogUri").withStringValue("s3://myLogUri"),
			new Field().withKey("scheduleType").withStringValue("cron")
			);     
      
	List<PipelineObject> pipelineObjects = new ArrayList<PipelineObject>();
    
	pipelineObjects.add(emrActivity);
	pipelineObjects.add(emrCluster);
	pipelineObjects.add(defaultObject);
	pipelineObjects.add(schedule);
    
	PutPipelineDefinitionRequest putPipelineDefintion = new PutPipelineDefinitionRequest()
	    .withPipelineId(pipelineId)
	    .withPipelineObjects(pipelineObjects);
    
	PutPipelineDefinitionResult putPipelineResult = dp.putPipelineDefinition(putPipelineDefintion);
	System.out.println(putPipelineResult);
    
	ActivatePipelineRequest activatePipelineReq = new ActivatePipelineRequest()
	    .withPipelineId(pipelineId);
	ActivatePipelineResult activatePipelineRes = dp.activatePipeline(activatePipelineReq);
	
      System.out.println(activatePipelineRes);
      System.out.println(pipelineId);
    
    }

}
```

# 在私有子网中配置 Amazon EMR 集群
<a name="emrcluster-example-private-subnet"></a>

**Example**  <a name="example8"></a>
该示例包括在 VPC 的私有子网中启动集群的配置。有关更多信息，请参阅 *Amazon EMR 管理指南*中的[在 VPC 中启动 Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-vpc-launching-job-flows.html)。此配置为可选配置。您可以在使用 `EmrCluster` 对象的任何管道中使用它。  
要在私有子网中启动 Amazon EMR 集群，请在您的 `SubnetId` 配置中指定 `emrManagedMasterSecurityGroupId`、`emrManagedSlaveSecurityGroupId`、`serviceAccessSecurityGroupId`、和 `EmrCluster`。  

```
{
  "objects": [
    {
      "output": {
        "ref": "S3BackupLocation"
      },
      "input": {
        "ref": "DDBSourceTable"
      },
      "maximumRetries": "2",
      "name": "TableBackupActivity",
      "step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}",
      "id": "TableBackupActivity",
      "runsOn": {
        "ref": "EmrClusterForBackup"
      },
      "type": "EmrActivity",
      "resizeClusterBeforeRunning": "false"
    },
    {
      "readThroughputPercent": "#{myDDBReadThroughputRatio}",
      "name": "DDBSourceTable",
      "id": "DDBSourceTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}"
    },
    {
      "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
      "name": "S3BackupLocation",
      "id": "S3BackupLocation",
      "type": "S3DataNode"
    },
    {
      "name": "EmrClusterForBackup",
      "coreInstanceCount": "1",
      "taskInstanceCount": "1",
      "taskInstanceType": "m4.xlarge",
      "coreInstanceType": "m4.xlarge",
      "releaseLabel": "emr-4.7.0",
      "masterInstanceType": "m4.xlarge",
      "id": "EmrClusterForBackup",
      "subnetId": "#{mySubnetId}",
      "emrManagedMasterSecurityGroupId": "#{myMasterSecurityGroup}",
      "emrManagedSlaveSecurityGroupId": "#{mySlaveSecurityGroup}",
      "serviceAccessSecurityGroupId": "#{myServiceAccessSecurityGroup}",
      "region": "#{myDDBRegion}",
      "type": "EmrCluster",
      "keyPair": "user-key-pair"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "pipelineLogUri": "#{myPipelineLogUri}",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    }
  ],
  "parameters": [
    {
      "description": "Output S3 folder",
      "id": "myOutputS3Loc",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "Source DynamoDB table name",
      "id": "myDDBTableName",
      "type": "String"
    },
    {
      "default": "0.25",
      "watermark": "Enter value between 0.1-1.0",
      "description": "DynamoDB read throughput ratio",
      "id": "myDDBReadThroughputRatio",
      "type": "Double"
    },
    {
      "default": "us-east-1",
      "watermark": "us-east-1",
      "description": "Region of the DynamoDB table",
      "id": "myDDBRegion",
      "type": "String"
    }
  ],
  "values": {
     "myDDBRegion": "us-east-1",
      "myDDBTableName": "ddb_table",
      "myDDBReadThroughputRatio": "0.25",
      "myOutputS3Loc": "s3://s3_path",
      "mySubnetId": "subnet_id",
      "myServiceAccessSecurityGroup":  "service access security group",
      "mySlaveSecurityGroup": "slave security group",
      "myMasterSecurityGroup": "master security group",
      "myPipelineLogUri": "s3://s3_path"
  }
}
```

# 将 EBS 卷附加到集群节点
<a name="emrcluster-example-ebs"></a>

**Example**  <a name="example8"></a>
您可以将 EBS 卷附加到您的管道内的 EMR 集群中的任何类型的节点。要将 EBS 卷附加到节点，请在您的 `EmrCluster` 配置中使用 `coreEbsConfiguration`、`masterEbsConfiguration` 和 `TaskEbsConfiguration`。  
该 Amazon EMR 集群示例使用 Amazon EBS 卷作为其主节点、任务节点和核心节点。有关更多信息，请参阅 *Amazon EMR 管理指南*中的 [Amazon EMR 中的 Amazon EBS 卷](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html)。  
这些配置是可选的。您可以在使用 `EmrCluster` 对象的任何管道中使用它们。  
在管道中，单击 `EmrCluster` 对象配置，选择**主 EBS 配置**、**核心 EBS 配置**或**任务 EBS 配置**，然后输入类似于以下示例的配置详细信息。  

```
{
  "objects": [
    {
      "output": {
        "ref": "S3BackupLocation"
      },
      "input": {
        "ref": "DDBSourceTable"
      },
      "maximumRetries": "2",
      "name": "TableBackupActivity",
      "step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}",
      "id": "TableBackupActivity",
      "runsOn": {
        "ref": "EmrClusterForBackup"
      },
      "type": "EmrActivity",
      "resizeClusterBeforeRunning": "false"
    },
    {
      "readThroughputPercent": "#{myDDBReadThroughputRatio}",
      "name": "DDBSourceTable",
      "id": "DDBSourceTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}"
    },
    {
      "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
      "name": "S3BackupLocation",
      "id": "S3BackupLocation",
      "type": "S3DataNode"
    },
    {
      "name": "EmrClusterForBackup",
      "coreInstanceCount": "1",
      "taskInstanceCount": "1",
      "taskInstanceType": "m4.xlarge",
      "coreInstanceType": "m4.xlarge",
      "releaseLabel": "emr-4.7.0",
      "masterInstanceType": "m4.xlarge",
      "id": "EmrClusterForBackup",
      "subnetId": "#{mySubnetId}",
      "emrManagedMasterSecurityGroupId": "#{myMasterSecurityGroup}",
      "emrManagedSlaveSecurityGroupId": "#{mySlaveSecurityGroup}",
      "region": "#{myDDBRegion}",
      "type": "EmrCluster",
      "coreEbsConfiguration": {
        "ref": "EBSConfiguration"
      },
      "masterEbsConfiguration": {
        "ref": "EBSConfiguration"
      },
      "taskEbsConfiguration": {
        "ref": "EBSConfiguration"
      },
      "keyPair": "user-key-pair"
    },
    {
       "name": "EBSConfiguration",
        "id": "EBSConfiguration",
        "ebsOptimized": "true",
        "ebsBlockDeviceConfig" : [
            { "ref": "EbsBlockDeviceConfig" }
        ],
        "type": "EbsConfiguration"
    },
    {
        "name": "EbsBlockDeviceConfig",
        "id": "EbsBlockDeviceConfig",
        "type": "EbsBlockDeviceConfig",
        "volumesPerInstance" : "2",
        "volumeSpecification" : {
            "ref": "VolumeSpecification"
        }
    },
    {
      "name": "VolumeSpecification",
      "id": "VolumeSpecification",
      "type": "VolumeSpecification",
      "sizeInGB": "500",
      "volumeType": "io1",
      "iops": "1000"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "pipelineLogUri": "#{myPipelineLogUri}",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    }
  ],
  "parameters": [
    {
      "description": "Output S3 folder",
      "id": "myOutputS3Loc",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "Source DynamoDB table name",
      "id": "myDDBTableName",
      "type": "String"
    },
    {
      "default": "0.25",
      "watermark": "Enter value between 0.1-1.0",
      "description": "DynamoDB read throughput ratio",
      "id": "myDDBReadThroughputRatio",
      "type": "Double"
    },
    {
      "default": "us-east-1",
      "watermark": "us-east-1",
      "description": "Region of the DynamoDB table",
      "id": "myDDBRegion",
      "type": "String"
    }
  ],
  "values": {
     "myDDBRegion": "us-east-1",
      "myDDBTableName": "ddb_table",
      "myDDBReadThroughputRatio": "0.25",
      "myOutputS3Loc": "s3://s3_path",
      "mySubnetId": "subnet_id",
      "mySlaveSecurityGroup": "slave security group",
      "myMasterSecurityGroup": "master security group",
      "myPipelineLogUri": "s3://s3_path"
  }
}
```

## 另请参阅
<a name="emrcluster-seealso"></a>
+ [EmrActivity](dp-object-emractivity.md)

# HttpProxy
<a name="dp-object-httpproxy"></a>

HttpProxy 允许您配置自己的代理并让 Task Runner 通过它访问 AWS Data Pipeline 服务。您不需要使用此信息配置正在运行的任务运行程序。

## in 的示 HttpProxy 例 TaskRunner
<a name="example9"></a>

以下管道定义显示一个 `HttpProxy` 对象：

```
{
  "objects": [
    {
      "schedule": {
        "ref": "Once"
      },
      "pipelineLogUri": "s3://myDPLogUri/path",
      "name": "Default",
      "id": "Default"
    },
    {
      "name": "test_proxy",
      "hostname": "hostname",
      "port": "port",
      "username": "username",
      "*password": "password",
      "windowsDomain": "windowsDomain",
      "type": "HttpProxy",
      "id": "test_proxy",
    },
    {
      "name": "ShellCommand",
      "id": "ShellCommand",
      "runsOn": {
        "ref": "Resource"
      },
      "type": "ShellCommandActivity",
      "command": "echo 'hello world' "
    },
    {
      "period": "1 day",
      "startDateTime": "2013-03-09T00:00:00",
      "name": "Once",
      "id": "Once",
      "endDateTime": "2013-03-10T00:00:00",
      "type": "Schedule"
    },
    {
      "role": "dataPipelineRole",
      "httpProxy": {
        "ref": "test_proxy"
      },
      "actionOnResourceFailure": "retrynone",
      "maximumRetries": "0",
      "type": "Ec2Resource",
      "terminateAfter": "10 minutes",
      "resourceRole": "resourceRole",
      "name": "Resource",
      "actionOnTaskFailure": "terminate",
      "securityGroups": "securityGroups",
      "keyPair": "keyPair",
      "id": "Resource",
      "region": "us-east-1"
    }
  ],
  "parameters": []
}
```

## 语法
<a name="httpproxy-slots"></a>


****  

| 必填字段 | 说明 | 槽位类型 | 
| --- | --- | --- | 
| hostname | 客户端将用来连接到 Amazon Web Services 的代理的主机。 | 字符串 | 
| 端口 | 客户端将用来连接到 Amazon Web Services 的代理主机的端口。 | 字符串 | 

 
****  

| 可选字段 | 说明 | 槽位类型 | 
| --- | --- | --- | 
| parent | 槽将继承自的当前对象的父级。 | 引用对象，例如 “父对象”：\$1"ref”:” myBaseObject Id "\$1 | 
| \$1password | 代理的密码。 | 字符串 | 
| s3 NoProxy | 在连接到 Amazon S3 时禁用 HTTP 代理 | 布尔值 | 
| username | 代理的用户名。 | 字符串 | 
| windowsDomain | NTLM 代理的 Windows 域名。 | 字符串 | 
| windowsWorkgroup | NTLM 代理的 Windows 工作组名。 | 字符串 | 

 
****  

| 运行时字段 | 说明 | 槽位类型 | 
| --- | --- | --- | 
| @version | 用来创建对象的管道版本。 | 字符串 | 

 
****  

| 系统字段 | 说明 | 槽位类型 | 
| --- | --- | --- | 
| @error | 用于描述格式不正确的对象的错误消息。 | 字符串 | 
| @pipelineId | 该对象所属的管道的 ID。 | 字符串 | 
| @sphere | 对象的范围指明对象在生命周期中的位置：组件对象产生实例对象，后者执行尝试对象。 | 字符串 |