Managing Amazon EMR on EC2 - Amazon SageMaker Unified Studio

Amazon SageMaker Unified Studio is in preview release and is subject to change.

Managing Amazon EMR on EC2

Whenever you are working with a project, you can manage that project's Amazon EC2 resources and view both monitoring and logging data for those resources. You can create and configure Amazon EMR on EC2 clusters, as well as terminate and remove those clusters. When clusters are running, data regarding their metrics is a utomatically sent to CloudWatch, while logging data is preserved in the Spark UI.

Adding Amazon EMR on EC2 clusters

As a data worker, you can make use of Amazon EMR on EC2 by adding existing or new Amazon EMR on EC2 clusters as compute instances to a project in the Amazon SageMaker Unified Studio Studio. Within a project, you can use both existing and new Amazon EMR on EC2 clusters.

In order to create a new Amazon EMR on EC2 cluster, the Admin must enable blueprints. On demand creation isn't supported for Amazon EMR on EC2 in Quick setup.

Adding new Amazon EMR on EC2 clusters

After your Admin has enabled blueprints:

  1. From inside the project management view, select "Compute" from the navigation bar.

  2. In the Compute panel, select the "Data analytics" tab.

  3. To create a new Amazon EMR on EC2 cluster, select the "Add compute" dropdown menu and then choose "New compute".

  4. In the "Add compute" modal, you can select the type of compute you would like to add to your project. Select "Create new compute resources".

  5. Select "Amazon EMR on EC2 cluster".

  6. The "Add Compute" dialog box allows you to specify the name of the Amazon EMR on EC2 cluster, provide a description, and choose a release of EMR (such as EMR 7.5) that you want to install on your cluster.

  7. After configuring these settings, select "Add compute". After some time, your Amazon EMR on EC2 cluster will be added to your project.

Adding existing Amazon EMR on EC2 clusters

  1. From inside the project management view, select "Compute" from the navigation bar.

  2. In the Compute panel, select the "Data analytics" tab.

  3. To add an existing Amazon EMR on EC2 cluster, select the "Add compute" dropdown menu and then choose "Connect to existing compute resources".

  4. In the "Add compute" modal, you can select the type of compute you would like to add to your project. Select "EMR on EC2 cluster".

  5. The "Add Compute" dialog box requires you to have the correct permissions to access the Amazon EMR on EC2 Cluster. You can select the "Copy project information" button to copy the data the admin will need to grant the data worker access. Send the information to your admin.

  6. After the account administrator has granted the data worker access, you can specify the ARNs associated with the cluster. You must fill in the "Access role ARN", "EMR on EC2 cluster ARN", "Compute name", and the "Instance profile role ARN". After configuring these settings, select "Add compute". Your Amazon EMR on EC2 instance will be added to your project.

After you have added a cluster to a project, you will be able to see the cluster in the list of "Data analytics" in the Compute panel. You can then view the cluster details by selecting the cluster you want.

Terminating and removing clusters

When you no longer need an Amazon EMR on EC2 cluster, the cluster can be terminated and removed.

To remove a cluster:

  1. Login to the Amazon SageMaker Unified Studio and navigate to the "Data analytics" tab of the Compute section. Select the name of the compute instance you would like to remove.

  2. On the compute details page, select the "Terminate and remove" option.

  3. A dialog box will appear asking you to confirm that you want to terminate and remove the instance of compute, which in this case is your Amazon EMR on EC2 cluster. Confirm that you want to remove the compute, by typing "confirm" in the text box.

  4. Click "Terminate and remove compute" to begin termination and removal.

  5. After a few minutes, your cluster should have been removed.

Monitoring

You can monitor the performance of your Amazon EMR on EC2 clusters to ensure optimal resource use and efficient job execution. Information on metrics is automatically collected and sent to Amazon CloudWatch during operation of an Amazon EMR cluster.

You can see CloudWatch metrics for a specific cluster by selecting the cluster you're interested in from the list of clusters under the Cluster tab. Selecting a cluster will bring you to the Detail view for that cluster. After you've selected a cluster, select the "Monitoring" tab.

You will be able to see a grid view of the CloudWatch Metrics for the cluster you selected.

You can see information presented through different views by using the "Dashboard View" drop-down menu: Cluster Overview, Primary Node Group, Core Node Group, Task Node Group. You can also adjust the time range.

Spark History Server

You can use the live Spark UI in a notebook session to view details such as tasks, executors and logs about Spark jobs.

You can explore the Spark History Server for a cluster at any time. To do this, select your cluster from the list of all clusters assigned to a project, which brings up the Detail view for the cluster. On the Detail page view, select the Applications tab and click on the 'Spark History Server' link.