Configure Docker for use with Amazon EMR clusters
Amazon EMR 6.x supports Hadoop 3, which allows the YARN NodeManager to launch containers either directly on the Amazon EMR cluster or inside a Docker container. Docker containers provide custom execution environments in which application code runs. The custom execution environment is isolated from the execution environment of the YARN NodeManager and other applications.
Docker containers can include special libraries used by the application and they can provide different versions of native tools and libraries, such as R and Python. You can use familiar Docker tooling to define libraries and runtime dependencies for your applications.
Amazon EMR 6.x clusters are configured by default to allow YARN applications, such as Spark, to
run using Docker containers. To customize your container configuration, edit the Docker
support options defined in the yarn-site.xml
and
container-executor.cfg
files available in the /etc/hadoop/conf
directory. For details about each configuration option and how it is used, see Launching applications using Docker containers
You can choose to use Docker when you submit a job. Use the following variables to specify the Docker runtime and Docker image.
YARN_CONTAINER_RUNTIME_TYPE=docker
YARN_CONTAINER_RUNTIME_DOCKER_IMAGE={
DOCKER_IMAGE_NAME
}
When you use Docker containers to run your YARN applications, YARN downloads the Docker image that you specify when you submit your job. For YARN to resolve this Docker image, it must be configured with a Docker registry. The configuration options for a Docker registry depend on whether you deploy the cluster using a public or private subnet.
Docker registries
A Docker registry is a storage and distribution system for Docker images. For Amazon EMR we recommend that you use Amazon ECR, which is a fully managed Docker container registry that allows you to create your own custom images and host them in a highly available and scalable architecture.
Deployment considerations
Docker registries require network access from each host in the cluster. This is because each host downloads images from the Docker registry when your YARN application is running on the cluster. These network connectivity requirements may limit your choice of Docker registry, depending on whether you deploy your Amazon EMR cluster into a public or private subnet.
Public subnet
When EMR clusters are deployed in a public subnet, the nodes running YARN NodeManager can directly access any registry available over the internet.
Private subnet
When EMR clusters are deployed in a private subnet, the nodes running YARN NodeManager don't have direct access to the internet. Docker images can be hosted in Amazon ECR and accessed through AWS PrivateLink.
For more information about how to use AWS PrivateLink to allow access to Amazon ECR in a private subnet scenario,
see Setting up AWS PrivateLink for Amazon ECS, and Amazon ECR
Configuring Docker registries
To use Docker registries with Amazon EMR, you must configure Docker to trust the specific
registry that you want to use to resolve Docker images. The default trust registries are
local (private) and centos. To use other public repositories or Amazon ECR, you can override
docker.trusted.registries
settings in
/etc/hadoop/conf/container-executor.cfg
using the EMR Classification
API with the container-executor
classification key.
The following example shows how to configure the cluster to trust both a public repository,
named your-public-repo
, and an ECR registry endpoint,
123456789123.dkr.ecr.us-east-1.amazonaws.com
. If you use ECR, replace
this endpoint with your specific ECR endpoint.
[ { "Classification": "container-executor", "Configurations": [ { "Classification": "docker", "Properties": { "docker.trusted.registries": "local,centos,
your-public-repo
,123456789123.dkr.ecr.us-east-1.amazonaws.com", "docker.privileged-containers.registries": "local,centos,your-public-repo,123456789123.dkr.ecr.us-east-1.amazonaws.com" } } ] } ]
To launch an Amazon EMR 6.0.0 cluster with this configuration using the AWS Command Line Interface (AWS CLI),
create a file named container-executor.json
with the contents of the
preceding ontainer-executor JSON configuration. Then, use the following commands to launch the
cluster.
export KEYPAIR=<
Name of your Amazon EC2 key-pair
> export SUBNET_ID=<ID of the subnet to which to deploy the cluster
> export INSTANCE_TYPE=<Name of the instance type to use
> export REGION=<Region to which to deploy the cluster
> aws emr create-cluster \ --name "EMR-6.0.0" \ --region $REGION \ --release-label emr-6.0.0 \ --applications Name=Hadoop Name=Spark \ --service-role EMR_DefaultRole \ --ec2-attributes KeyName=$KEYPAIR,InstanceProfile=EMR_EC2_DefaultRole,SubnetId=$SUBNET_ID \ --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=$INSTANCE_TYPE InstanceGroupType=CORE,InstanceCount=2,InstanceType=$INSTANCE_TYPE \ --configuration file://container-executor.json
Configuring YARN to access Amazon ECR on EMR 6.0.0 and earlier
If you're new to Amazon ECR, follow the instructions in Getting started with Amazon ECR and verify that you have access to Amazon ECR from each instance in your Amazon EMR cluster.
On EMR 6.0.0 and earlier, to access Amazon ECR using the Docker command, you must first generate
credentials. To verify that YARN can access images from Amazon ECR, use the container
environment variable YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG
to pass a
reference to the credentials that you generated.
Run the following command on one of the core nodes to get the login line for your ECR account.
aws ecr get-login --region us-east-1 --no-include-email
The get-login
command generates the correct Docker CLI command to run to create credentials. Copy and run the output from get-login
.
sudo docker login -u AWS -p <
password
> https://<account-id
>.dkr.ecr.us-east-1.amazonaws.com
This command generates a config.json
file in the /root/.docker
folder. Copy this file to HDFS so that jobs submitted to the cluster can use it to authenticate to Amazon ECR.
Run the commands below to copy the config.json
file to your home
directory.
mkdir -p ~/.docker sudo cp /root/.docker/config.json ~/.docker/config.json sudo chmod 644 ~/.docker/config.json
Run the commands below to put the config.json in HDFS so it may be used by jobs running on the cluster.
hadoop fs -put ~/.docker/config.json /user/hadoop/
YARN can access ECR as a Docker image registry and pull containers during job execution.
After configuring Docker registries and YARN, you can run YARN applications using Docker containers. For more information, see Run Spark applications with Docker using Amazon EMR 6.0.0.
In EMR 6.1.0 and later, you don't have to manually set up authentication to Amazon ECR. If an
Amazon ECR registry is detected in the container-executor
classification key, the
Amazon ECR auto authentication feature activates, and YARN handles the authentication process
when you submit a Spark job with an ECR image. You can confirm whether automatic
authentication is enabled
by checking
yarn.nodemanager.runtime.linux.docker.ecr-auto-authentication.enabled
in
yarn-site. Automatic authentication is enabled and the YARN authentication setting is set to
true
if the docker.trusted.registries
contains an ECR registry
URL.
Prerequisites for using automatic authentication to Amazon ECR
EMR version 6.1.0 or later
ECR registry included in configuration is in the same Region with the cluster
IAM role with permissions to get authorization token and pull any image
Refer to Setting up with Amazon ECR for more information.
How to enable automatic authentication
Follow Configuring Docker registries to set an Amazon ECR registry as a trusted registry, and make sure the Amazon ECR repository and the cluster are in same Region.
To enable this feature even when the ECR registry is not set in the trusted registry, use the configuration classification to set yarn.nodemanager.runtime.linux.docker.ecr-auto-authentication.enabled
to true
.
How to disable automatic authentication
By default, automatic authentication is disabled if no Amazon ECR registry is detected in the trusted registry.
To disable automatic authentication, even when the Amazon ECR registry is set in the trusted
registry, use the configuration classification to set
yarn.nodemanager.runtime.linux.docker.ecr-auto-authentication.enabled
to
false
.
How to check if automatic authentication is enabled on a cluster
On the master node, use a text editor such as vi
to view the contents of the
file: vi /etc/hadoop/conf.empty/yarn-site.xml
. Check the value of
yarn.nodemanager.runtime.linux.docker.ecr-auto-authentication.enabled
.