Create a cluster with Apache Spark - Amazon EMR

Create a cluster with Apache Spark

The following procedure creates a cluster with Spark installed using Quick Options in the Amazon EMR console.

You can alternatively use Advanced Options to further customize your cluster setup, or to submit steps to programmatically install applications and then run custom applications. With either cluster creation option, you can choose to use AWS Glue as your Spark SQL metastore. See Use AWS Glue Data Catalog catalog with Spark on Amazon EMR for more information.

To launch a cluster with Spark installed
  1. Open the Amazon EMR console at https://console.aws.amazon.com/emr.

  2. Choose Create cluster to use Quick Options.

  3. Enter a Cluster name. Your cluster name can't contain the characters <, >, $, |, or ` (backtick).

  4. For Software Configuration, choose a Release option.

  5. For Applications, choose the Spark application bundle.

  6. Select other options as necessary and then choose Create cluster.

    Note

    To configure Spark when you are creating the cluster, see Configure Spark.

To launch a cluster with Spark installed using the AWS CLI
  • Create the cluster with the following command.

    aws emr create-cluster --name "Spark cluster" --release-label emr-7.6.0 --applications Name=Spark \ --ec2-attributes KeyName=myKey --instance-type m5.xlarge --instance-count 3 --use-default-roles
Note

Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

To launch a cluster with Spark installed using the SDK for Java

Specify Spark as an application with SupportedProductConfig used in RunJobFlowRequest.

  • The following example shows how to create a cluster with Spark using Java.

    import com.amazonaws.AmazonClientException; import com.amazonaws.auth.AWSCredentials; import com.amazonaws.auth.AWSStaticCredentialsProvider; import com.amazonaws.auth.profile.ProfileCredentialsProvider; import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduce; import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClientBuilder; import com.amazonaws.services.elasticmapreduce.model.*; import com.amazonaws.services.elasticmapreduce.util.StepFactory; public class Main { public static void main(String[] args) { AWSCredentials credentials_profile = null; try { credentials_profile = new ProfileCredentialsProvider("default").getCredentials(); } catch (Exception e) { throw new AmazonClientException( "Cannot load credentials from .aws/credentials file. " + "Make sure that the credentials file exists and the profile name is specified within it.", e); } AmazonElasticMapReduce emr = AmazonElasticMapReduceClientBuilder.standard() .withCredentials(new AWSStaticCredentialsProvider(credentials_profile)) .withRegion(Regions.US_WEST_1) .build(); // create a step to enable debugging in the AWS Management Console StepFactory stepFactory = new StepFactory(); StepConfig enabledebugging = new StepConfig() .withName("Enable debugging") .withActionOnFailure("TERMINATE_JOB_FLOW") .withHadoopJarStep(stepFactory.newEnableDebuggingStep()); Application spark = new Application().withName("Spark"); RunJobFlowRequest request = new RunJobFlowRequest() .withName("Spark Cluster") .withReleaseLabel("emr-5.20.0") .withSteps(enabledebugging) .withApplications(spark) .withLogUri("s3://path/to/my/logs/") .withServiceRole("EMR_DefaultRole") .withJobFlowRole("EMR_EC2_DefaultRole") .withInstances(new JobFlowInstancesConfig() .withEc2SubnetId("subnet-12ab3c45") .withEc2KeyName("myEc2Key") .withInstanceCount(3) .withKeepJobFlowAliveWhenNoSteps(true) .withMasterInstanceType("m4.large") .withSlaveInstanceType("m4.large")); RunJobFlowResult result = emr.runJobFlow(request); System.out.println("The cluster ID is " + result.toString()); } }