RunJobFlow creates and starts running a new job flow. The job flow will run the steps
specified. Once the job flow completes, the cluster is stopped and the HDFS partition is lost.
To prevent loss of data, configure the last step of the job flow to store results in Amazon S3.
If the JobFlowInstancesConfig
KeepJobFlowAliveWhenNoSteps
parameter
is set to TRUE
, the job flow will transition to the WAITING state rather than
shutting down once the steps have completed.
For additional protection, you can set the JobFlowInstancesConfig
TerminationProtected
parameter to TRUE
to lock the job flow and
prevent it from being terminated by API call, user intervention, or in the event of a job flow
error.
A maximum of 256 steps are allowed in each job flow.
If your job flow is long-running (such as a Hive data warehouse) or complex, you may require
more than 256 steps to process your data. You can bypass the 256-step limitation in various
ways, including using the SSH shell to connect to the master node and submitting queries
directly to the software running on the master node, such as Hive and Hadoop. For more
information on how to do this, go to Add
More than 256 Steps to a Job Flow in the Amazon Elastic MapReduce Developer’s
Guide.
For long running job flows, we recommend that you periodically store your results.
Access
Parameters
Parameter |
Type |
Required |
Description |
$name
|
string
|
Required
|
The name of the job flow. [Constraints: The value must be between 0 and 256 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ] |
$instances
|
array
|
Required
|
A specification of the number and type of Amazon EC2 instances on which to run the job flow.
x - array - Optional - This represents a simple array index. MasterInstanceType - string - Optional - The EC2 instance type of the master node. [Constraints: The value must be between 1 and 256 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]SlaveInstanceType - string - Optional - The EC2 instance type of the slave nodes. [Constraints: The value must be between 1 and 256 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]InstanceCount - integer - Optional - The number of Amazon EC2 instances used to execute the job flow.InstanceGroups - array - Optional - Configuration for the job flow’s instance groups. x - array - Optional - This represents a simple array index. Name - string - Optional - Friendly name given to the instance group. [Constraints: The value must be between 0 and 256 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]Market - string - Optional - Market type of the Amazon EC2 instances used to create a cluster node. [Allowed values: ON_DEMAND , SPOT ]InstanceRole - string - Required - The role of the instance group in the cluster. [Allowed values: MASTER , CORE , TASK ]BidPrice - string - Optional - Bid price for each Amazon EC2 instance in the instance group when launching nodes as Spot Instances, expressed in USD. [Constraints: The value must be between 0 and 256 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]InstanceType - string - Required - The Amazon EC2 instance type for all instances in the instance group. [Constraints: The value must be between 1 and 256 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]InstanceCount - integer - Required - Target number of instances for the instance group.
Ec2KeyName - string - Optional - Specifies the name of the Amazon EC2 key pair that can be used to ssh to the master node as the user called “hadoop.” [Constraints: The value must be between 0 and 256 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]Placement - array - Optional - Specifies the Availability Zone the job flow will run in. x - array - Optional - This represents a simple array index. AvailabilityZone - string - Required - The Amazon EC2 Availability Zone for the job flow. [Constraints: The value must be between 0 and 10280 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]
KeepJobFlowAliveWhenNoSteps - boolean - Optional - Specifies whether the job flow should terminate after completing all steps.TerminationProtected - boolean - Optional - Specifies whether to lock the job flow to prevent the Amazon EC2 instances from being terminated by API call, user intervention, or in the event of a job flow error.HadoopVersion - string - Optional - Specifies the Hadoop version for the job flow. Valid inputs are “0.18”, “0.20”, or “0.20.205”. If you do not set this value, the default of 0.18 is used, unless the AmiVersion parameter is set in the RunJobFlow call, in which case the default version of Hadoop for that AMI version is used. [Constraints: The value must be between 0 and 256 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]Ec2SubnetId - string - Optional - To launch the job flow in Amazon Virtual Private Cloud (Amazon VPC), set this parameter to the identifier of the Amazon VPC subnet where you want the job flow to launch. If you do not specify this value, the job flow is launched in the normal Amazon Web Services cloud, outside of an Amazon VPC. Amazon VPC currently does not support cluster compute quadruple extra large (cc1.4xlarge) instances. Thus you cannot specify the cc1.4xlarge instance type for nodes of a job flow launched in a Amazon VPC. [Constraints: The value must be between 0 and 256 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]
|
$opt
|
array
|
Optional
|
An associative array of parameters that can have the following keys:
LogUri - string - Optional - Specifies the location in Amazon S3 to write the log files of the job flow. If a value is not provided, logs are not created. [Constraints: The value must be between 0 and 10280 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]AdditionalInfo - string - Optional - A JSON string for selecting additional features. [Constraints: The value must be between 0 and 10280 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]AmiVersion - string - Optional - The version of the Amazon Machine Image (AMI) to use when launching Amazon EC2 instances in the job flow. The following values are valid:- “latest” (uses the latest AMI)
- The version number of the AMI to use, for example, “2.0”
If this value is not specified, the job flow uses the default of (AMI 1.0, Hadoop 0.18). If the AMI supports multiple versions of Hadoop (for example, AMI 1.0 supports both Hadoop 0.18 and 0.20) you can use the JobFlowInstancesConfig HadoopVersion parameter to modify the version of Hadoop from the defaults shown above. For details about the AMI versions currently supported by Amazon ElasticMapReduce, go to AMI Versions Supported in Elastic MapReduce in the Amazon Elastic MapReduce Developer’s Guide. [Constraints: The value must be between 0 and 256 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]Steps - array - Optional - A list of steps to be executed by the job flow. x - array - Optional - This represents a simple array index. Name - string - Required - The name of the job flow step. [Constraints: The value must be between 0 and 256 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]ActionOnFailure - string - Optional - Specifies the action to take if the job flow step fails. [Allowed values: TERMINATE_JOB_FLOW , CANCEL_AND_WAIT , CONTINUE ]HadoopJarStep - array - Required - Specifies the JAR file used for the job flow step. x - array - Optional - This represents a simple array index. Properties - array - Optional - A list of Java properties that are set when the step runs. You can use these properties to pass key value pairs to your main function. x - array - Optional - This represents a simple array index. Key - string - Optional - The unique identifier of a key value pair. [Constraints: The value must be between 0 and 10280 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]Value - string - Optional - The value part of the identified key. [Constraints: The value must be between 0 and 10280 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]
Jar - string - Required - A path to a JAR file run during the step. [Constraints: The value must be between 0 and 10280 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]MainClass - string - Optional - The name of the main class in the specified Java file. If not specified, the JAR file should specify a Main-Class in its manifest file. [Constraints: The value must be between 0 and 10280 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]Args - string|array - Optional - A list of command line arguments passed to the JAR file’s main function when executed. Pass a string for a single value, or an indexed array for multiple values.
BootstrapActions - array - Optional - A list of bootstrap actions that will be run before Hadoop is started on the cluster nodes. x - array - Optional - This represents a simple array index. Name - string - Required - The name of the bootstrap action. [Constraints: The value must be between 0 and 256 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]ScriptBootstrapAction - array - Required - The script run by the bootstrap action. x - array - Optional - This represents a simple array index. Path - string - Required - Location of the script to run during a bootstrap action. Can be either a location in Amazon S3 or on a local file system. [Constraints: The value must be between 0 and 10280 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]Args - string|array - Optional - A list of command line arguments to pass to the bootstrap action script. Pass a string for a single value, or an indexed array for multiple values.
SupportedProducts - string|array - Optional - A list of strings that indicates third-party software to use with the job flow. For more information, go to Use Third Party Applications with Amazon EMR. Currently supported values are:- “karmasphere-enterprise-utility” - tag the job flow for management by Karmasphere.
- “mapr-m3” - launch the job flow using MapR M3 Edition.
- “mapr-m5” - launch the job flow using MapR M5 Edition.
Pass a string for a single value, or an indexed array for multiple values.VisibleToAllUsers - boolean - Optional - Whether the job flow is visible to all IAM users of the AWS account associated with the job flow. If this value is set to true , all IAM users of that AWS account can view and (if they have the proper policy permissions set) manage the job flow. If it is set to false , only the IAM user that created the job flow can view and manage it.JobFlowRole - string - Optional - An IAM role for the job flow. The EC2 instances of the job flow assume this role. The default role is EMRJobflowDefault . In order to use the default role, you must have already created it using the CLI. [Constraints: The value must be between 0 and 10280 characters, and must match the following regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]* ]curlopts - array - Optional - A set of values to pass directly into curl_setopt() , where the key is a pre-defined CURLOPT_* constant.returnCurlHandle - boolean - Optional - A private toggle specifying that the cURL handle be returned rather than actually completing the request. This toggle is useful for manually managed batch requests. |
Returns
Examples
Create and run a new Hadoop job flow without any starting steps.
If a new Hadoop job is created this way (i.e., without steps), steps must be added to the job with add_job_flow_steps()
.
$emr = new AmazonEMR();
$response = $emr->run_job_flow('my-hadoop-job', array(
'Ec2KeyName' => 'hadoop',
'HadoopVersion' => '0.20',
'KeepJobFlowAliveWhenNoSteps' => 'true',
'InstanceGroups' => array(
array( // Group #1
'InstanceCount' => 1,
'InstanceRole' => 'MASTER',
'InstanceType' => 'm1.small',
'Market' => 'ON_DEMAND',
'Name' => 'my-hadoop-cluster',
),
array( // Group #2
'InstanceCount' => 1,
'InstanceRole' => 'CORE',
'InstanceType' => 'm1.small',
'Market' => 'ON_DEMAND',
'Name' => 'my-hadoop-core-cluster',
)
),
'Placement' => array(
'AvailabilityZone' => 'us-east-1d'
)
));
// Success?
var_dump($response->isOK());
Result:
bool(true)
Create and run a new Hadoop job flow using pre-built steps.
The CFHadoopStep
class contains pre-built Hadoop steps that can be used.
$emr = new AmazonEMR();
$response = $emr->run_job_flow('my-hadoop-job', array(
// Required parameters
'Ec2KeyName' => 'hadoop',
'KeepJobFlowAliveWhenNoSteps' => 'true',
'InstanceGroups' => array(
array( // Group #1
'InstanceCount' => 1,
'InstanceRole' => 'MASTER',
'InstanceType' => 'm1.small',
'Market' => 'ON_DEMAND',
'Name' => 'my-hadoop-cluster',
)
),
'Placement' => array(
'AvailabilityZone' => 'us-east-1d'
)
), array(
// Optional parameters
'LogUri' => 's3n://my-hadoop-logging-bucket',
'AmiVersion' => '2.0',
'AdditionalInfo' => json_encode(array(
'Username' => 'user',
'Password' => 'secret-pass'
)),
'Steps' => array(
new CFStepConfig(array(
'Name' => 'Install Hive',
'ActionOnFailure' => 'CONTINUE',
'HadoopJarStep' => CFHadoopStep::install_hive()
)),
new CFStepConfig(array(
'Name' => 'Install Pig',
'ActionOnFailure' => 'CONTINUE',
'HadoopJarStep' => CFHadoopStep::install_pig()
)),
new CFStepConfig(array(
'Name' => 'Custom Job Step',
'ActionOnFailure' => 'TERMINATE_JOB_FLOW',
'HadoopJarStep' => array(
'Jar' => 's3://my-bucket/my-hadoop.jar',
'Args' => array('--my-task', '-v', 'my-value')
)
))
)
));
// Success?
var_dump($response->isOK());
Result:
bool(true)
Construct a Bootstrap Action
This example shows how to:
- Pass arguments to Hive scripts run by the job flow.
- Invoke a bootstrap action to configure Hadoop.
- Specify the Hive versions for a job flow step.
$emr = new AmazonEMR();
$response = $emr->run_job_flow('my-hadoop-job', array(
// Required parameters
'TerminationProtected' => 'false',
'HadoopVersion' => '0.20.205',
'Ec2KeyName' => 'hadoop',
'KeepJobFlowAliveWhenNoSteps' => 'false',
'InstanceGroups' => array(
array(
'Name' => 'Master Instance Group',
'Market' => 'ON_DEMAND',
'InstanceType' => 'm1.small',
'InstanceCount' => 1,
'InstanceRole' => 'MASTER',
),
array(
'Name' => 'Core Instance Group',
'Market' => 'ON_DEMAND',
'InstanceType' => 'm1.small',
'InstanceCount' => 1,
'InstanceRole' => 'CORE',
),
),
), array(
// Optional parameters
'Name' => 'My Job',
'AmiVersion' => 'latest',
'LogUri' => 's3n://my-hadoop-logging-bucket/',
'AdditionalInfo' => json_encode(array(
'Username' => 'user',
'Password' => 'secret-pass'
)),
'BootstrapActions' => array(
array(
'Name' => 'Bootstrap Action 1',
'ScriptBootstrapAction' => array(
'Path' => 's3://elasticmapreduce/bootstrap-actions/configure-hadoop',
'Args' => array(
'--mapred-config-file',
's3://bandish-test/mapred-site.xml',
),
),
),
),
'Steps' => array(
new CFStepConfig(array(
'Name' => 'Setup Hive',
'ActionOnFailure' => 'TERMINATE_JOB_FLOW',
'HadoopJarStep' => CFHadoopStep::install_hive(),
)),
new CFStepConfig(array(
'Name' => 'Run Hive Script',
'ActionOnFailure' => 'CANCEL_AND_WAIT',
'HadoopJarStep' => array(
'Jar' => 's3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar',
'Args' => array(
's3://us-east-1.elasticmapreduce/libs/hive/hive-script',
'--run-hive-script',
'--base-path', 's3://us-east-1.elasticmapreduce/libs/hive/',
'--hive-versions', '0.7.1.3',
'--args',
'-f', 's3n://myBucket/myHiveScript.hql',
'-d', 'INPUT=Var_Value1',
'-d', 'LIB=Var_Value2',
'-d', 'OUTPUT=Var_Value3',
),
),
)),
),
));
Related Methods
Source
Method defined in services/emr.class.php | Toggle source view (39 lines) | View on GitHub
public function run_job_flow($name, $instances, $opt = null)
{
if (!$opt) $opt = array();
$opt['Name'] = $name;
// Required map (non-list)
$opt = array_merge($opt, CFComplexType::map(array(
'Instances' => (is_array($instances) ? $instances : array($instances))
), 'member'));
// Optional list + map
if (isset($opt['Steps']))
{
$opt = array_merge($opt, CFComplexType::map(array(
'Steps' => $opt['Steps']
), 'member'));
unset($opt['Steps']);
}
// Optional list + map
if (isset($opt['BootstrapActions']))
{
$opt = array_merge($opt, CFComplexType::map(array(
'BootstrapActions' => $opt['BootstrapActions']
), 'member'));
unset($opt['BootstrapActions']);
}
// Optional list (non-map)
if (isset($opt['SupportedProducts']))
{
$opt = array_merge($opt, CFComplexType::map(array(
'SupportedProducts' => (is_array($opt['SupportedProducts']) ? $opt['SupportedProducts'] : array($opt['SupportedProducts']))
), 'member'));
unset($opt['SupportedProducts']);
}
return $this->authenticate('RunJobFlow', $opt);
}