# Getting started with AWS Batch tutorials You can use the AWS Batch first-run wizard to get started quickly with AWS Batch. After you complete the Prerequisites, you can use the first-run wizard to create a compute environment, a job definition, and a job queue. You can also submit a sample "Hello World" job using the AWS Batch first-run wizard to test your configuration. If you already have a Docker image that you want to launch in AWS Batch, you can use that image to create a job definition. Afterward, you can use the AWS Batch first-run wizard to create a compute environment, job queue, and submit a sample *Hello World* job. # Getting started with Amazon EC2 orchestration using the Wizard Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing capacity in the AWS Cloud. Using Amazon EC2 eliminates your need to invest in hardware up front, so you can develop and deploy applications faster. You can use Amazon EC2 to launch as many or as few virtual servers as you need, configure security and networking, and manage storage. Amazon EC2 enables you to scale up or down to handle changes in requirements or spikes in popularity, reducing your need to forecast traffic. ## Overview This tutorial demonstrates how to setup AWS Batch with the Wizard to configure Amazon EC2 and run `Hello World`. **Intended Audience** This tutorial is designed for system administrators and developers responsible for setting up, testing, and deploying AWS Batch. **Features Used** This tutorial shows you how to use the AWS Batch console wizard to: + Create and configure an Amazon EC2 compute environment + Create a job queue. + Create a job definition + Create and submit a job to run + View the output of the job in CloudWatch **Time Required** It should take about 10–15 minutes to complete this tutorial. **Regional Restrictions** There are no country or regional restrictions associated with using this solution. **Resource Usage Costs** There's no charge for creating an AWS account. However, by implementing this solution, you might incur some or all of the costs that are listed in the following table. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/batch/latest/userguide/getting-started-ec2.html) ## Prerequisites Before you begin: + Create an AWS account if you don't have one. + Create the [`ecsInstanceRole` Instance role](batch-check-ecsinstancerole.md). ## Step 1: Create a compute environment **Important** To get started as simply and quickly as possible, this tutorial includes steps with default settings. Before creating for production use, we recommend that you familiarize yourself with all settings and deploy with the settings that meet your requirements. To create a compute environment for an Amazon EC2 orchestration, do the following: 1. Open the [AWS Batch console first-run wizard](https://console.aws.amazon.com/batch/home#wizard). 1. For **Configure job and orchestration type**, choose **Amazon Elastic Compute Cloud(Amazon EC2)**. 1. Choose** Next**. 1. In the **Compute environment configuration** section for **Name**, specify a unique name for your compute environment. The name can be up to 128 characters in length. It can contain uppercase and lowercase letters, numbers, hyphens (-), and underscores (\$1). 1. For **Instance role**, choose an existing instance role that has the required IAM permissions attached. This instance role allows the Amazon ECS container instances in your compute environment to make calls to the required AWS API operations. For more information, see [Amazon ECS instance role](instance_IAM_role.md). The default name of the **Instance role** is `ecsInstanceRole`. 1. For **Instance configuration** you can leave the default settings. 1. For **Network configuration** use your default VPC for the AWS Region. 1. Choose **Next**. ## Step 2: Create a job queue A job queue stores your submitted jobs until the AWS Batch Scheduler runs the job on a resource in your compute environment. For more information, see [Job queues](job_queues.md) To create a job queue for an Amazon EC2 orchestration, do the following: 1. For **Job queue configuration** for **Name**, specify a unique name for your job queue. The name can be up to 128 characters in length. It can contain uppercase and lowercase letters, numbers, hyphens (-), and underscores (\$1). 1. For all other configuration options you can leave the default value. 1. Choose **Next**. ## Step 3: Create a job definition AWS Batch job definitions specify how jobs are to be run. Even though each job must reference a job definition, many of the parameters that are specified in the job definition can be overridden at runtime. To create the job definition: 1. For **Create a job definition** 1. for **Name**, specify a unique name for your job queue. The name can be up to 128 characters in length. It can contain uppercase and lowercase letters, numbers, hyphens (-), and underscores (\$1). 1. For **Command - *optional*** you can change `hello world` to a custom message or leave it as is. 1. For all other configuration options you can leave the default value. 1. Choose **Next**. ## Step 4: Create a job To create a job, do the following: 1. In the **Job configuration** section for **Name**, specify a unique name for the job. The name can be up to 128 characters in length. It can contain uppercase and lowercase letters, numbers, hyphens (-), and underscores (\$1). 1. For all other configuration options you can leave the default value. 1. Choose** Next**. ## Step 5: Review and create On the **Review and create** page, review the configuration steps. If you need to make changes, choose **Edit**. When you're finished, choose **Create resources**. 1. For **Review and create** choose **Create resources**. 1. A window opens as AWS Batch starts to allocate your resources. Once complete choose **Go to dashboard**. On the dashboard you should see all of your allocated resources and that the job is in the `Runnable` state. Your job is scheduled to run and should complete in 2–3 minuets. ## Step 6: View the Job's output To view the Job's output, do the following: 1. In the navigation pane choose **Jobs**. 1. In the **Job queue** drop down choose the Job queue you created for the tutorial. 1. The **Jobs** table lists all of your Jobs and what their current status is. Once the Job's **Status** is **Succeeded** choose the **Name** of the Job to view the Job's details. 1. In the **Details** pane choose **Log stream name**. The CloudWatch console for the Job will open and there should be one event with the **Message** of `hello world` or your custom message. ## Step 7: Clean up your tutorial resources You are charged for the Amazon EC2 instance while it is enabled. You can delete the instance to stop incurring charges. To delete the resources you created, do the following: 1. In the navigation pane choose **Job queue**. 1. In the **Job queue** table choose the Job queue you created for the tutorial. 1. Choose **Disable**. Once the Job queue **State** is Disabled you can choose **Delete**. 1. Once the Job queue is deleted, in the navigation pane choose **Compute environments**. 1. Choose the compute environment you created for this tutorial and then choose **Disable**. It may take 1–2 minuets for the compute environment to complete being disabled. 1. Once the compute environment’s **State** is Disabled, choose **Delete**. It may take 1–2 minuets for the compute environment to be deleted. ## Additional resources After you complete the tutorial, you might want to explore the following topics:: + Explore the AWS Batch core components. For more information, see [Components of AWS Batch](batch_components.md). + Learn more about the different [Compute Environments](compute_environments.md) available in AWS Batch. + Learn more about [Job queues](job_queues.md) and their different scheduling options. + Learn more about [Job definitions](job_definitions.md) and the different configuration options. + Learn more about the different types of [Jobs](jobs.md). # Getting started with AWS Batch and Fargate orchestration using the Wizard AWS Fargate launches and scales the compute to closely match the resource requirements that you specify for the container. With Fargate, you don't need to over-provision or pay for additional servers. For more information, see [Fargate](https://docs.aws.amazon.com//batch/latest/userguide/fargate.html#when-to-use-fargate). ## Overview This tutorial demonstrates how to setup AWS Batch with the Wizard to configure AWS Fargate and run `Hello World`. **Intended Audience** This tutorial is designed for system administrators and developers responsible for setting up, testing, and deploying AWS Batch. **Features Used** This tutorial shows you how to use the AWS Batch console wizard to: + Create and configure an AWS Fargate compute environment + Create a job queue. + Create a job definition + Create and submit a job to run + View the output of the job in CloudWatch **Time Required** It should take about 10–15 minutes to complete this tutorial. **Regional Restrictions** There are no country or regional restrictions associated with using this solution. **Resource Usage Costs** There's no charge for creating an AWS account. However, by implementing this solution, you might incur some or all of the costs that are listed in the following table. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/batch/latest/userguide/getting-started-fargate.html) ## Prerequisites Before you begin: + Create an AWS account if you don't have one. + Create the task execution role. If you haven't already created the [Task Execution Role](create-execution-role.md) then you can create it as part of this tutorial. ## Step 1: Create a compute environment **Important** To get started as simply and quickly as possible, this tutorial includes steps with default settings. Before creating for production use, we recommend that you familiarize yourself with all settings and deploy with the settings that meet your requirements. To create a compute environment for a Fargate orchestration, do the following: 1. Open the [AWS Batch console first-run wizard](https://console.aws.amazon.com/batch/home#wizard). 1. For **Configure job and orchestration type**, choose **Fargate**. 1. Choose** Next**. 1. In the **Compute environment configuration** section for **Name**, specify a unique name for your compute environment. The name can be up to 128 characters in length. It can contain uppercase and lowercase letters, numbers, hyphens (-), and underscores (\$1). 1. For all other configuration options you can leave the default value. 1. Choose **Next**. ## Step 2: Create a job queue A job queue stores your submitted jobs until the AWS Batch Scheduler runs the job on a resource in your compute environment. To create a job queue: To create a job queue for a Fargate orchestration, do the following: 1. In the **Job queue configuration** section for **Name**, specify a unique name for your job queue. The name can be up to 128 characters in length. It can contain uppercase and lowercase letters, numbers, hyphens (-), and underscores (\$1). 1. For **Priority**, enter 900 for the job queue. 1. For all other configuration options you can leave the default value. 1. Choose **Next**. ## Step 3: Create a job definition To create the job definition: 1. In the **General configuration** section: 1. In the **General configuration** section for **Name**, specify a unique name for your job definition. The name can be up to 128 characters in length. It can contain uppercase and lowercase letters, numbers, hyphens (-), and underscores (\$1). 1. In the **Fargate platform configuration** section: 1. Turn on **Assign public IP** to assign a public IP address. You need a public IP to download the container image unless you've setup a private image repository. 1. For **Execution role**, choose a task execution role that lets Amazon Elastic Container Service (Amazon ECS) agents make AWS calls on your behalf. Choose either **ecsTaskExecutionRole** or **BatchEcsTaskExecutionRole**. To create the **Execution role** choose **Create an execution role**. In the **Create IAM role** modal choose **Create IAM role**. 1. The IAM console has the permission setting already configured for creating the execution role. 1. For **Trusted entity type** verify that **AWS service** is selected. 1. For **Service or user case** verify that **Elastic Container Service** is selected. 1. Choose **Next**. 1. For **Permissions policies** verify that **AmazonECSTaskExecutionRolePolicy** is selected. 1. Choose **Next**. 1. For **Name, review, and create** verify that the role name is **BatchEcsTaskExecutionRole**. 1. Choose **Create role**. 1. In the AWS Batch console choose the refresh button next to **Execution role**. Choose the **BatchEcsTaskExecutionRole** execution role. 1. In the **Container configuration** section: 1. For **Command**, you can change `hello world` to a custom message or leave it as is. 1. For all other configuration options you can leave the default value. 1. Choose **Next**. ## Step 4: Create a job To create a Fargate job, do the following: 1. In the **Job configuration** section for **Name**, specify a unique name for the job. The name can be up to 128 characters in length. It can contain uppercase and lowercase letters, numbers, hyphens (-), and underscores (\$1). 1. For all other configuration options you can leave the default value. 1. Choose** Next**. ## Step 5: Review and create On the **Review and create** page, review the configuration steps. If you need to make changes, choose **Edit**. When you're finished, choose **Create resources**. ## Step 6: View the Job's output To view the Job's output, do the following: 1. In the navigation pane choose **Jobs**. 1. In the **Job queue** drop down choose the Job queue you created for the tutorial. 1. The **Jobs** table lists all of your Jobs and what their current status is. Once the Job's **Status** is **Succeeded** choose the **Name** of the Job to view the Job's details. 1. In the **Details** pane choose **Log stream name**. The CloudWatch console for the Job will open and there should be one event with the **Message** of `hello world` or your custom message. ## Step 7: Clean up your tutorial resources You are charged for the Amazon EC2 instance while it is enabled. You can delete the instance to stop incurring charges. To delete the resources you created, do the following: 1. In the navigation pane choose **Job queue**. 1. In the **Job queue** table choose the Job queue you created for the tutorial. 1. Choose **Disable**. Once the Job queue **State** is Disabled you can choose **Delete**. 1. Once the Job queue is deleted, in the navigation pane choose **Compute environments**. 1. Choose the compute environment you created for this tutorial and then choose **Disable**. It may take 1–2 minuets for the compute environment to complete being disabled. 1. Once the compute environment’s **State** is Disabled, choose **Delete**. It may take 1–2 minuets for the compute environment to be deleted. ## Additional resources After you complete the tutorial, you might want to explore the following topics:: + Learn more about the [Best practices](best-practices.md). + Explore the AWS Batch core components. For more information, see [Components of AWS Batch](batch_components.md). + Learn more about the different [Compute Environments](compute_environments.md) available in AWS Batch. + Learn more about [Job queues](job_queues.md) and their different scheduling options. + Learn more about [Job definitions](job_definitions.md) and the different configuration options. + Learn more about the different types of [Jobs](jobs.md). # Getting started with AWS Batch and Fargate using the AWS CLI This tutorial demonstrates how to set up AWS Batch with AWS Fargate orchestration and run a simple "Hello World" job using the AWS Command Line Interface (AWS CLI). You'll learn how to create compute environments, job queues, job definitions, and submit jobs to AWS Batch. **Topics** + [ ## Prerequisites ](#prerequisites) + [ ## Create an IAM execution role ](#tutorial-fargate-using-the-aws-cli-create-an-iam-execution-role) + [ ## Create a compute environment ](#create-a-compute-environment) + [ ## Create a job queue ](#cli-create-a-job-queue) + [ ## Create a job definition ](#cli-create-a-job-definition) + [ ## Submit and monitor a job ](#cli-submit-and-monitor-a-job) + [ ## View job output ](#cli-view-job-output) + [ ## Clean up resources ](#cli-clean-up-resources) + [ ## Going to production ](#cli-going-to-production) + [ ## Next steps ](#cli-next-steps) ## Prerequisites Before you begin this tutorial, make sure you have the following. 1. The AWS CLI. If you need to install it, follow the [AWS CLI installation guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). You can also [use AWS CloudShell](https://docs.aws.amazon.com/cloudshell/latest/userguide/welcome.html), which includes the AWS CLI. 1. Configured your AWS CLI with appropriate credentials. Run `aws configure` if you haven't set up your credentials yet. 1. Basic familiarity with command line interfaces and containerization concepts. 1. [How AWS Batch works with IAM](security_iam_service-with-iam.md) to create and manage AWS Batch resources, IAM roles, and VPC resources in your AWS account. 1. A subnet ID and security group ID from a VPC in your AWS account. If you don't have a VPC, you can [create one](https://docs.aws.amazon.com/vpc/latest/userguide/create-vpc.html). For more information about using the AWS CLI to retrieve these resource IDs, see [describe-subnets](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/ec2/describe-subnets.html) and [describe-security-groups](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/ec2/describe-security-groups.html) in the *AWS CLI Command Reference*. **Time Required**: Approximately 15-20 minutes to complete this tutorial. **Cost**: This tutorial uses Fargate compute resources. The estimated cost for completing this tutorial is less than \$10.01 USD, assuming you follow the cleanup instructions to delete resources immediately after completion. Fargate pricing is based on vCPU and memory resources consumed, charged per second with a 1-minute minimum. For current pricing information, see [AWS Fargate pricing](https://aws.amazon.com/fargate/pricing/). ## Create an IAM execution role AWS Batch requires an execution role that allows Amazon Elastic Container Service (Amazon ECS) agents to make AWS API calls on your behalf. This role is necessary for Fargate tasks to pull container images and write logs to Amazon CloudWatch. **Create a trust policy document** First, create a trust policy that allows the Amazon ECS tasks service to assume the role. ``` cat > batch-execution-role-trust-policy.json << EOF { "Version":"2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Principal": { "Service": "ecs-tasks.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } EOF ``` **Create the execution role** The following command creates an IAM role named `BatchEcsTaskExecutionRoleTutorial` using the trust policy you just created. ``` aws iam create-role \ --role-name BatchEcsTaskExecutionRoleTutorial \ --assume-role-policy-document file://batch-execution-role-trust-policy.json ``` **Attach the required policy** Attach the AWS managed policy that provides the necessary permissions for Amazon ECS task execution. ``` aws iam attach-role-policy \ --role-name BatchEcsTaskExecutionRoleTutorial \ --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy ``` The role is now ready to be used by AWS Batch for Fargate task execution. ## Create a compute environment A compute environment defines the compute resources where your batch jobs will run. For this tutorial, you'll create a managed Fargate compute environment that automatically provisions and scales resources based on job requirements. **Create the compute environment** The following command creates a Fargate compute environment. Replace the example subnet and security group IDs with your own per the [Prerequisites](#prerequisites). ``` aws batch create-compute-environment \ --compute-environment-name my-fargate-compute-env \ --type MANAGED \ --state ENABLED \ --compute-resources type=FARGATE,maxvCpus=128,subnets=subnet-a123456b,securityGroupIds=sg-a12b3456 ``` The following shows how the output looks when the command runs successfully. ``` { "computeEnvironmentName": "my-fargate-compute-env", "computeEnvironmentArn": "arn:aws:batch:us-west-2:123456789012:compute-environment/my-fargate-compute-env" } ``` **Wait for the compute environment to be ready** Check the status of your compute environment to ensure it's ready before proceeding. ``` aws batch describe-compute-environments \ --compute-environments my-fargate-compute-env \ --query 'computeEnvironments[0].status' ``` ``` "VALID" ``` When the status shows `VALID`, your compute environment is ready to accept jobs. ## Create a job queue A job queue stores submitted jobs until the AWS Batch scheduler runs them on resources in your compute environment. Jobs are processed in priority order within the queue. **Create the job queue** The following command creates a job queue with priority 900 that uses your Fargate compute environment. ``` aws batch create-job-queue \ --job-queue-name my-fargate-job-queue \ --state ENABLED \ --priority 900 \ --compute-environment-order order=1,computeEnvironment=my-fargate-compute-env ``` The following shows how the output looks when the command runs successfully. ``` { "jobQueueName": "my-fargate-job-queue", "jobQueueArn": "arn:aws:batch:us-west-2:123456789012:job-queue/my-fargate-job-queue" } ``` **Verify the job queue is ready** Check that your job queue is in the `ENABLED` state and ready to accept jobs. ``` aws batch describe-job-queues \ --job-queues my-fargate-job-queue \ --query 'jobQueues[0].state' "ENABLED" ``` ## Create a job definition A job definition specifies how jobs are to be run, including the Docker image to use, resource requirements, and other parameters. For Fargate, you'll use resource requirements instead of traditional vCPU and memory parameters. **Create the job definition** The following command creates a job definition that runs a simple "hello world" command using the busybox container image. Replace `123456789012` with your actual AWS account ID and replace the example AWS Region with your own. ``` aws batch register-job-definition \ --job-definition-name my-fargate-job-def \ --type container \ --platform-capabilities FARGATE \ --container-properties '{ "image": "busybox", "resourceRequirements": [ {"type": "VCPU", "value": "0.25"}, {"type": "MEMORY", "value": "512"} ], "command": ["echo", "hello world"], "networkConfiguration": { "assignPublicIp": "ENABLED" }, "executionRoleArn": "arn:aws:iam::123456789012:role/BatchEcsTaskExecutionRoleTutorial" }, { "jobDefinitionName": "my-fargate-job-def", "jobDefinitionArn": "arn:aws:batch:us-west-2:123456789012:job-definition/my-fargate-job-def:1", "revision": 1 }' ``` The job definition specifies 0.25 vCPU and 512 MB of memory, which are the minimum resources for a Fargate task. The `assignPublicIp` setting is enabled so the container can pull the busybox image from Docker Hub. ## Submit and monitor a job Now that you have all the necessary components, you can submit a job to your queue and monitor its progress. **Submit a job** The following command submits a job to your queue using the job definition you created. ``` aws batch submit-job \ --job-name my-hello-world-job \ --job-queue my-fargate-job-queue \ --job-definition my-fargate-job-def ``` The following shows how the output looks when the command runs successfully. ``` { "jobArn": "arn:aws:batch:us-west-2:123456789012:job/my-hello-world-job", "jobName": "my-hello-world-job", "jobId": "1509xmpl-4224-4da6-9ba9-1d1acc96431a" } ``` Make note of the `jobId` returned in the response, as you'll use it to monitor the job's progress. **Monitor job status** Use the job ID to check the status of your job. The job will progress through several states: `SUBMITTED`, `PENDING`, `RUNNABLE`, `STARTING`, `RUNNING`, and finally `SUCCEEDED` or `FAILED`. ``` aws batch describe-jobs --jobs 1509xmpl-4224-4da6-9ba9-1d1acc96431a ``` The following shows how the output looks when the command runs successfully. ``` { "jobs": [ { "jobArn": "arn:aws:batch:us-west-2:123456789012:job/my-hello-world-job", "jobName": "my-hello-world-job", "jobId": "1509xmpl-4224-4da6-9ba9-1d1acc96431a", "jobQueue": "arn:aws:batch:us-west-2:123456789012:job-queue/my-fargate-job-queue", "status": "SUCCEEDED", "createdAt": 1705161908000, "jobDefinition": "arn:aws:batch:us-west-2:123456789012:job-definition/my-fargate-job-def:1" } ] } ``` When the status shows `SUCCEEDED`, your job has completed successfully. ## View job output After your job completes, you can view its output in Amazon CloudWatch Logs. **Get the log stream name** First, retrieve the log stream name from the job details. Replace the example job ID with your own. ``` aws batch describe-jobs --jobs 1509xmpl-4224-4da6-9ba9-1d1acc96431a \ --query 'jobs[0].attempts[0].containers[0].logStreamName' \ --output text ``` ``` my-fargate-job-def/default/1509xmpl-4224-4da6-9ba9-1d1acc96431a ``` **View the job logs** Use the log stream name to retrieve the job's output from CloudWatch Logs. ``` aws logs get-log-events \ --log-group-name /aws/batch/job \ --log-stream-name my-fargate-job-def/default/1509xmpl-4224-4da6-9ba9-1d1acc96431a \ --query 'events[*].message' \ --output text ``` The output shows "hello world", confirming that your job ran successfully. ## Clean up resources To avoid ongoing charges, clean up the resources you created in this tutorial. You must delete resources in the correct order due to dependencies. **Disable and delete the job queue** First, disable the job queue, then delete it. ``` aws batch update-job-queue \ --job-queue my-fargate-job-queue \ --state DISABLED ``` ``` aws batch delete-job-queue \ --job-queue my-fargate-job-queue ``` **Disable and delete the compute environment** After the job queue is deleted, disable and delete the compute environment. ``` aws batch update-compute-environment \ --compute-environment my-fargate-compute-env \ --state DISABLED ``` ``` aws batch delete-compute-environment \ --compute-environment my-fargate-compute-env ``` **Clean up the IAM role** Remove the policy attachment and delete the IAM role. ``` aws iam detach-role-policy \ --role-name BatchEcsTaskExecutionRoleTutorial \ --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy ``` ``` aws iam delete-role \ --role-name BatchEcsTaskExecutionRoleTutorial ``` **Remove temporary files** Delete the trust policy file you created. ``` rm batch-execution-role-trust-policy.json ``` All resources have been successfully cleaned up. ## Going to production This tutorial is designed to help you understand how AWS Batch works with Fargate. For production deployments, consider the following additional requirements: **Security considerations:** + Create dedicated security groups with minimal required access instead of using default security groups + Use private subnets with NAT Gateway instead of public IP assignment for containers + Store container images in Amazon ECR instead of using public repositories + Implement VPC endpoints for AWS service communication to avoid internet traffic **Architecture considerations:** + Deploy across multiple Availability Zones for high availability + Implement job retry strategies and dead letter queues for error handling + Use multiple job queues with different priorities for workload management + Configure auto scaling policies based on queue depth and resource utilization + Implement monitoring and alerting for job failures and resource utilization **Operational considerations:** + Set up CloudWatch dashboards and alarms for monitoring + Implement proper logging and audit trails + Use CloudFormation or the AWS CDK for infrastructure as code + Establish backup and disaster recovery procedures For comprehensive guidance on production-ready architectures, see the [AWS Well-Architected Framework](https://aws.amazon.com/architecture/well-architected/) and [AWS Security Best Practices](https://aws.amazon.com/architecture/security-identity-compliance/). ## Next steps Now that you've completed this tutorial, you can explore more advanced AWS Batch features: + [Job queues](job_queues.md) – Learn about job queue scheduling and priority management + [Job definitions](job_definitions.md) – Explore advanced job definition configurations including environment variables, volumes, and retry strategies + [Compute environments for AWS Batch](compute_environments.md) – Understand different compute environment types and scaling options + [Multi-node parallel jobs](multi-node-parallel-jobs.md) – Run jobs that span multiple compute nodes + [Array jobs](array_jobs.md) – Submit large numbers of similar jobs efficiently + [Best practices for AWS Batch](best-practices.md) – Learn optimization techniques for production workloads # Getting started with AWS Batch on Amazon EKS AWS Batch on Amazon EKS is a managed service for scheduling and scaling batch workloads into existing Amazon EKS clusters. AWS Batch doesn't create, administer, or perform lifecycle operations of your Amazon EKS clusters on your behalf. AWS Batch orchestration scales up and down nodes managed by AWS Batch and run pods on those nodes. AWS Batch doesn't touch nodes, auto scaling node groups or pods lifecycles that aren't associated with AWS Batch compute environments within your Amazon EKS cluster. For AWS Batch to operate effectively, its [service-linked role](using-service-linked-roles.md) needs Kubernetes role-based access control (RBAC) permissions in your existing Amazon EKS cluster. For more information, see [Using RBAC Authorization](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) in the *Kubernetes documentation*. AWS Batch requires a Kubernetes namespace where it can scope pods as AWS Batch jobs into. We recommend a dedicated namespace to isolate the AWS Batch pods from your other cluster workloads. After AWS Batch has been given RBAC access and a namespace has been established, you can associate that Amazon EKS cluster to an AWS Batch compute environment using the [CreateComputeEnvironment](https://docs.aws.amazon.com/batch/latest/APIReference/API_CreateComputeEnvironment.html) API operation. A job queue can be associated with this new Amazon EKS compute environment. AWS Batch jobs are submitted to the job queue based on an Amazon EKS job definition using the [SubmitJob](https://docs.aws.amazon.com/batch/latest/APIReference/API_SubmitJob.html) API operation. AWS Batch then launches AWS Batch managed nodes and place jobs from job queue as Kubernetes pods into the EKS cluster associated with an AWS Batch compute environment. The following sections cover how to get set up for AWS Batch on Amazon EKS. **Contents** + [ ## Overview ](#getting-started-eks-context) + [ ## Prerequisites ](#getting-started-eks-prerequisites) + [ ## Step 1: Create your Amazon EKS cluster for AWS Batch ](#getting-started-eks-step-0) + [ ## Step 2: Prepare your Amazon EKS cluster for AWS Batch ](#getting-started-eks-step-1) + [ ## Step 3: Create an Amazon EKS compute environment ](#getting-started-eks-step-2) + [ ## Step 4: Create a job queue and attach the compute environment ](#getting-started-eks-step-3) + [ ## Step 5: Create a job definition ](#getting-started-eks-step-4) + [ ## Step 6: Submit a job ](#getting-started-eks-step-5) + [ ## Step 7: View the Job's output ](#getting-started-eks-step-7) + [ ## Step 8: (Optional) Submit a job with overrides ](#getting-started-eks-step-6) + [ ## Step 9: Clean up your tutorial resources ](#getting-started-eks-step-8) + [ ## Additional resources ](#getting-started-eks-additional-resources) ## Overview This tutorial demonstrates how to setup AWS Batch with Amazon EKS using the AWS CLI, `kubectl` and `eksctl`. **Intended Audience** This tutorial is designed for system administrators and developers responsible for setting up, testing, and deploying AWS Batch. **Features Used** This tutorial shows you how to use the AWS CLI, to: + Create and configure an Amazon EKS compute environment + Create a job queue. + Create a job definition + Create and submit a job to run + Submit a job with overrides **Time Required** It should take about 30–40 minutes to complete this tutorial. **Regional Restrictions** There are no country or regional restrictions associated with using this solution. **Resource Usage Costs** There's no charge for creating an AWS account. However, by implementing this solution, you might incur some or all of the costs that are listed in the following table. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/batch/latest/userguide/getting-started-eks.html) ## Prerequisites Before starting this tutorial, you must install and configure the following tools and resources that you need to create and manage both AWS Batch and Amazon EKS resources. + **AWS CLI** – A command line tool for working with AWS services, including Amazon EKS. This guide requires that you use version 2.8.6 or later or 1.26.0 or later. For more information, see [Installing, updating, and uninstalling the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) in the *AWS Command Line Interface User Guide*. After installing the AWS CLI, we recommend that you also configure it. For more information, see [Quick configuration with `aws configure`](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config) in the *AWS Command Line Interface User Guide*. + **`kubectl`** – A command line tool for working with Kubernetes clusters. This guide requires that you use version `1.23` or later. For more information, see [Installing or updating `kubectl`](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) in the *Amazon EKS User Guide*. + **`eksctl`** – A command line tool for working with Amazon EKS clusters that automates many individual tasks. This guide requires that you use version `0.115.0` or later. For more information, see [Installing or updating `eksctl`](https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html) in the **Amazon EKS User Guide**. + **Required IAM permissions** – The IAM security principal that you're using must have permissions to work with Amazon EKS IAM roles and service linked roles, CloudFormation, and a VPC and related resources. For more information, see [Actions, resources, and condition keys for Amazon Elastic Kubernetes Service](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonelastickubernetesservice.html) and [Using service-linked roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/using-service-linked-roles.html) in the *IAM User Guide*. You must complete all steps in this guide as the same user. + **Permissions** – Users calling the [CreateComputeEnvironment](https://docs.aws.amazon.com/batch/latest/APIReference/API_CreateComputeEnvironment.html) API operation to create a compute environment that uses Amazon EKS resources require permissions to the `eks:DescribeCluster` API operation. + **AWS account number** – You need to know your AWS account ID. Follow the directions in [Viewing your AWS account ID](https://docs.aws.amazon.com/IAM/latest/UserGuide/console-account-id.html). + **(Optional) CloudWatch** – To examine the details of [(Optional) Submit a job with overrides](#getting-started-eks-step-6), logging must be configured. For more information, see [Use CloudWatch Logs to monitor AWS Batch on Amazon EKS jobs](batch-eks-cloudwatch-logs.md). ## Step 1: Create your Amazon EKS cluster for AWS Batch **Important** To get started as simply and quickly as possible, this tutorial includes steps with default settings. Before creating for production use, we recommend that you familiarize yourself with all settings and deploy with the settings that meet your requirements. Once you have installed the prerequisites you need to create your cluster using `eksctl`. Creating the cluster can take between 10-15 minutes. ``` $ eksctl create cluster --name my-cluster-name --region region-code ``` In the preceding command replace: + Replace *my-cluster-name* with the name you want to use for your cluster. + Replace *region-code* with the AWS Region to create the cluster in, for example `us-west-2`. The cluster name and region are needed for later in this tutorial. ## Step 2: Prepare your Amazon EKS cluster for AWS Batch All steps are required. 1. **Create a dedicated namespace for AWS Batch jobs** Use `kubectl` to create a new namespace. ``` $ namespace=my-aws-batch-namespace ``` ``` $ cat - <* with your AWS account number. ``` $ eksctl create iamidentitymapping \ --cluster my-cluster-name \ --arn "arn:aws:iam:::role/AWSServiceRoleForBatch" \ --username aws-batch ``` Output: ``` 2022-10-25 20:19:57 [ℹ] adding identity "arn:aws:iam:::role/AWSServiceRoleForBatch" to auth ConfigMap ``` **Note** The path `aws-service-role/batch.amazonaws.com/` has been removed from the ARN of the service-linked role. This is because of an issue with the `aws-auth` configuration map. For more information, see [Roles with paths don't work when the path is included in their ARN in the aws-authconfigmap](https://github.com/kubernetes-sigs/aws-iam-authenticator/issues/268). ## Step 3: Create an Amazon EKS compute environment AWS Batch compute environments define compute resource parameters to meet your batch workload needs. In a managed compute environment, AWS Batch helps you to manage the capacity and instance types of the compute resources (Kubernetes nodes) within your Amazon EKS cluster. This is based on the compute resource specification that you define when you create the compute environment. You can use EC2 On-Demand Instances or EC2 Spot Instances. Now that the **AWSServiceRoleForBatch** service-linked role has access to your Amazon EKS cluster, you can create AWS Batch resources. First, create a compute environment that points to your Amazon EKS cluster. + For `subnets` run `eksctl get cluster my-cluster-name` to get the subnets used by the cluster. + For `securityGroupIds` parameter you can use the same security group as the Amazon EKS cluster. This command retrieves the security group ID for the cluster. ``` $ aws eks describe-cluster \ --name my-cluster-name \ --query cluster.resourcesVpcConfig.clusterSecurityGroupId ``` + The `instanceRole` is created when you create the cluster. To find the `instanceRole` list all entities that use the `AmazonEKSWorkerNodePolicy` policy: ``` $ aws iam list-entities-for-policy --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy ``` The name of the policy role contains the name of the cluster that you created `eksctl-my-cluster-name-nodegroup-example`. To find the `instanceRole` arn run the following command: ``` $ aws iam list-instance-profiles-for-role --role-name eksctl-my-cluster-name-nodegroup-example ``` Output: ``` INSTANCEPROFILES arn:aws:iam:::instance-profile/eks-04cb2200-94b9-c297-8dbe-87f12example ``` For more information, see [Creating the Amazon EKS node IAM role](https://docs.aws.amazon.com/eks/latest/userguide/create-node-role.html#create-worker-node-role) and [Enabling IAM principal access to your cluster](https://docs.aws.amazon.com/eks/latest/userguide/add-user-role.html) in the *Amazon EKS User Guide*. If you're using pod networking, see [Configuring the Amazon VPC CNI plugin for Kubernetes to use IAM roles for service accounts](https://docs.aws.amazon.com/eks/latest/userguide/cni-iam-role.html) in the **Amazon EKS User Guide**. ``` $ cat < ./batch-eks-compute-environment.json { "computeEnvironmentName": "My-Eks-CE1", "type": "MANAGED", "state": "ENABLED", "eksConfiguration": { "eksClusterArn": "arn:aws:eks:region-code:your-account-number:cluster/my-cluster-name", "kubernetesNamespace": "my-aws-batch-namespace" }, "computeResources": { "type": "EC2", "allocationStrategy": "BEST_FIT_PROGRESSIVE", "minvCpus": 0, "maxvCpus": 128, "instanceTypes": [ "m5" ], "subnets": [ "" ], "securityGroupIds": [ "" ], "instanceRole": "" } } EOF ``` ``` $ aws batch create-compute-environment --cli-input-json file://./batch-eks-compute-environment.json ``` **Notes** + Maintenance of an Amazon EKS compute environment is a shared responsibility. For more information, see [Shared responsibility of the Kubernetes nodes](eks-ce-shared-responsibility.md). ## Step 4: Create a job queue and attach the compute environment **Important** It's important to confirm that the compute environment is healthy before proceeding. The [DescribeComputeEnvironments](https://docs.aws.amazon.com/batch/latest/APIReference/API_DescribeComputeEnvironments.html) API operation can be used to do this. ``` $ aws batch describe-compute-environments --compute-environments My-Eks-CE1 ``` Confirm that the `status` parameter is not `INVALID`. If it is, look at the `statusReason` parameter for the cause. For more information, see [Troubleshooting AWS Batch](troubleshooting.md). Jobs submitted to this new job queue are run as pods on AWS Batch managed nodes that joined the Amazon EKS cluster that's associated with your compute environment. ``` $ cat < ./batch-eks-job-queue.json { "jobQueueName": "My-Eks-JQ1", "priority": 10, "computeEnvironmentOrder": [ { "order": 1, "computeEnvironment": "My-Eks-CE1" } ] } EOF ``` ``` $ aws batch create-job-queue --cli-input-json file://./batch-eks-job-queue.json ``` ## Step 5: Create a job definition The following Job definition instructs the pod to sleep for 60 seconds. ``` $ cat < ./batch-eks-job-definition.json { "jobDefinitionName": "MyJobOnEks_Sleep", "type": "container", "eksProperties": { "podProperties": { "hostNetwork": true, "containers": [ { "image": "public.ecr.aws/amazonlinux/amazonlinux:2", "command": [ "sleep", "60" ], "resources": { "limits": { "cpu": "1", "memory": "1024Mi" } } } ], "metadata": { "labels": { "environment": "test" } } } } } EOF ``` ``` $ aws batch register-job-definition --cli-input-json file://./batch-eks-job-definition.json ``` **Notes** + There are considerations for the `cpu` and `memory` parameters. For more information, see [Memory and vCPU considerations for AWS Batch on Amazon EKS](memory-cpu-batch-eks.md). ## Step 6: Submit a job Run the following AWS CLI command to submit a new Job. ``` $ aws batch submit-job --job-queue My-Eks-JQ1 \ --job-definition MyJobOnEks_Sleep --job-name My-Eks-Job1 ``` To check the status of a Job: ``` $ aws batch describe-jobs --job ``` **Notes** + For more information about running jobs on Amazon EKS resources, see [Amazon EKS jobs](eks-jobs.md). ## Step 7: View the Job's output To view the Job's output, do the following: 1. Open the AWS Batch console at [https://console.aws.amazon.com/batch/](https://console.aws.amazon.com/batch/). 1. In the navigation pane choose **Jobs**. 1. In the **Job queue** drop down choose the Job queue you created for the tutorial. 1. The **Jobs** table lists all of your Jobs and what their current status is. Once the Job's **Status** is **Succeeded** choose the **Name** of the Job, *My-Eks-JQ1*, to view the Job's details. 1. In the **Details** pane the **Started at** and **Stopped at** times should be one minute apart. ## Step 8: (Optional) Submit a job with overrides This job overrides the command passed to the container. AWS Batch aggressively cleans up the pods after the jobs complete to reduce the load to Kubernetes. To examine the details of a job, logging must be configured. For more information, see [Use CloudWatch Logs to monitor AWS Batch on Amazon EKS jobs](batch-eks-cloudwatch-logs.md). ``` $ cat < ./submit-job-override.json { "jobName": "EksWithOverrides", "jobQueue": "My-Eks-JQ1", "jobDefinition": "MyJobOnEks_Sleep", "eksPropertiesOverride": { "podProperties": { "containers": [ { "command": [ "/bin/sh" ], "args": [ "-c", "echo hello world" ] } ] } } } EOF ``` ``` $ aws batch submit-job --cli-input-json file://./submit-job-override.json ``` **Notes** + For improved visibility into the details of the operations, enable Amazon EKS control plane logging. For more information, see [Amazon EKS control plane logging](https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html) in the *Amazon EKS User Guide*. + Daemonsets and kubelets overhead affects available vCPU and memory resources, specifically scaling and job placement. For more information, see [Memory and vCPU considerations for AWS Batch on Amazon EKS](memory-cpu-batch-eks.md). To view the Job's output, do the following: 1. Open the AWS Batch console at [https://console.aws.amazon.com/batch/](https://console.aws.amazon.com/batch/). 1. In the navigation pane choose **Jobs**. 1. In the **Job queue** drop down choose the Job queue you created for the tutorial. 1. The **Jobs** table lists all of your Jobs and what their current status is. Once the Job's **Status** is **Succeeded** choose the **Name** of the Job to view the Job's details. 1. In the **Details** pane choose **Log stream name**. The CloudWatch console for the Job will open and there should be one event with the **Message** of `hello world` or your custom message. ## Step 9: Clean up your tutorial resources You are charged for the Amazon EC2 instance while it is enabled. You can delete the instance to stop incurring charges. To delete the resources you created, do the following: 1. Open the AWS Batch console at [https://console.aws.amazon.com/batch/](https://console.aws.amazon.com/batch/). 1. In the navigation pane choose **Job queue**. 1. In the **Job queue** table choose the Job queue you created for the tutorial. 1. Choose **Disable**. Once the Job queue **State** is Disabled you can choose **Delete**. 1. Once the Job queue is deleted, in the navigation pane choose **Compute environments**. 1. Choose the compute environment you created for this tutorial and then choose **Disable**. It may take 1–2 minuets for the compute environment to complete being disabled. 1. Once the compute environment’s **State** is Disabled, choose **Delete**. It may take 1–2 minuets for the compute environment to be deleted. ## Additional resources After you complete the tutorial, you might want to explore the following topics:: + Learn more about the [Best practices](best-practices.md). + Explore the AWS Batch core components. For more information, see [Components of AWS Batch](batch_components.md). + Learn more about the different [Compute Environments](compute_environments.md) available in AWS Batch. + Learn more about [Job queues](job_queues.md) and their different scheduling options. + Learn more about [Job definitions](job_definitions.md) and the different configuration options. + Learn more about the different types of [Jobs](jobs.md). # Getting started with AWS Batch on Amazon EKS Private Clusters AWS Batch is a managed service that orchestrates batch workloads in your Amazon Elastic Kubernetes Service (Amazon EKS) clusters. This includes queuing, dependency tracking, managed job retries and priorities, pod management, and node scaling. This feature connects your existing private Amazon EKS cluster with AWS Batch to run your jobs at scale. You can use [https://eksctl.io/usage/eks-private-cluster/](https://eksctl.io/usage/eks-private-cluster/) (a command line interface for Amazon EKS), the AWS console, or the [AWS Command Line Interface](https://aws.amazon.com/cli/) to create a private Amazon EKS cluster with all the other necessary resources. [Amazon EKS private only clusters](https://docs.aws.amazon.com/eks/latest/userguide/cluster-endpoint.html#private-access) by default have no inbound/outbound internet access, and you can only access the API server from within your VPC or a connected network. Amazon VPC endpoints are used to enable private access to other AWS services. `eksctl` supports creating fully private clusters using a pre-existing Amazon VPC and subnets. `eksctl` also creates Amazon VPC endpoints in the supplied Amazon VPC and modifies route tables for the supplied subnets. Each subnet should have an explicit route table associated with it because `eksctl` does not modify the main route table. Your [cluster](#getting-started-eks-privateclusters) must pull images from a container registry that's in your Amazon VPC. As well, you can create an Amazon Elastic Container Registry in your Amazon VPC and copy container images to it for your nodes to pull from. For more information, see [Copy a container image from one repository to another repository](https://docs.aws.amazon.com/eks/latest/userguide/copy-image-to-repository.html). To get started with Amazon ECR private repositories, see [Amazon ECR private repositories](https://docs.aws.amazon.com/AmazonECR/latest/userguide/Repositories.html). You can optionally create a [pull through cache rule](https://docs.aws.amazon.com/AmazonECR/latest/userguide/pull-through-cache.html) with Amazon ECR. Once a pull through cache rule is created for an external public registry, you can pull an image from that external public registry using your Amazon ECR private registry uniform resource identifier (URI). Then Amazon ECR creates a repository and caches the image. When a cached image is pulled using the Amazon ECR private registry URI, Amazon ECR checks the remote registry to see if there is a new version of the image and updates your private registry up to once every 24 hours. **Contents** + [ ## Overview ](#getting-started-eks-context) + [ ## Prerequisites ](#getting-started-eks-privateclusters-prerequisites) + [ ## Step 1: Create your EKS cluster for AWS Batch ](#getting-started-eks-privateclusters-step-0) + [ ## Step 2: Prepare your EKS cluster for AWS Batch ](#getting-started-eks-privateclusters-step-1) + [ ## Step 3: Create an Amazon EKS compute environment ](#getting-started-eks-privateclusters-2) + [ ## Step 4: Create a job queue and attach the compute environment ](#getting-started-eks-privateclusters-step-3) + [ ## Step 5: Create an Amazon ECR with pull through cache ](#getting-started-eks-privateclusters-step-ecr) + [ ## Step 6: Register a job definition ](#getting-started-eks-privateclusters-step-4) + [ ## Step 7: Submit a job to run ](#getting-started-eks-privateclusters-step-5) + [ ## Step 8: View the Job's output ](#getting-started-eks-privateclusters-step-7) + [ ## Step 9: (Optional) Submit a job with overrides ](#getting-started-eks-privateclusters-step-6) + [ ## Step 10: Clean up your tutorial resources ](#getting-started-eks-privateclusters-step-8) + [ ## Additional resources ](#getting-started-eks-additional-resources) + [ ## Troubleshooting ](#getting-started-eks-privateclusters-troubleshooting) ## Overview This tutorial demonstrates how to setup AWS Batch with a private Amazon EKS using the AWS CloudShell, `kubectl` and `eksctl`. **Intended Audience** This tutorial is designed for system administrators and developers responsible for setting up, testing, and deploying AWS Batch. **Features Used** This tutorial shows you how to use the AWS CLI, to: + Use Amazon Elastic Container Registry (Amazon ECR) to store container images + Create and configure an Amazon EKS compute environment + Create a job queue. + Create a job definition + Create and submit a job to run + Submit a job with overrides **Time Required** It should take about 40–50 minutes to complete this tutorial. **Regional Restrictions** There are no country or regional restrictions associated with using this solution. **Resource Usage Costs** There's no charge for creating an AWS account. However, by implementing this solution, you might incur some or all of the costs that are listed in the following table. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/batch/latest/userguide/getting-started-eks-privateclusters.html) ## Prerequisites This tutorial uses AWS CloudShell which is a browser-based, pre-authenticated shell that you launch directly from the AWS Management Console. This allows for access to the cluster once it no longer has public internet access. The AWS CLI, `kubectl`, and `eksctl` may already be installed as part of AWS CloudShell. For more information on AWS CloudShell, see the [AWS CloudShell*User Guide*](https://docs.aws.amazon.com/cloudshell/latest/userguide/welcome.html). An alternative to AWS CloudShell is to connect to your cluster's VPC or a [connected network](https://docs.aws.amazon.com/whitepapers/latest/aws-vpc-connectivity-options/introduction.html). To run kubectl commands, you will need private access to your Amazon EKS cluster. This means all traffic to your cluster API server must come from within your cluster's VPC or a connected network. + **AWS CLI** – A command line tool for working with AWS services, including Amazon EKS. This guide requires that you use version 2.8.6 or later or 1.26.0 or later. For more information, see [Installing, updating, and uninstalling the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) in the *AWS Command Line Interface User Guide*. After installing the AWS CLI, we recommend that you also configure it. For more information, see [Quick configuration with `aws configure`](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config) in the *AWS Command Line Interface User Guide*. + **`kubectl`** – A command line tool for working with Kubernetes clusters. This guide requires that you use version `1.23` or later. For more information, see [Installing or updating `kubectl`](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) in the *Amazon EKS User Guide*. + **`eksctl`** – A command line tool for working with Amazon EKS clusters that automates many individual tasks. This guide requires that you use version `0.115.0` or later. For more information, see [Installing or updating `eksctl`](https://eksctl.io/installation/) in the **Amazon EKS User Guide**. + **Permissions** – Users calling the [CreateComputeEnvironment](https://docs.aws.amazon.com/batch/latest/APIReference/API_CreateComputeEnvironment.html) API operation to create a compute environment that uses Amazon EKS resources require permissions to the `eks:DescribeCluster` and `eks:ListClusters` API operation. You can attach the [AWSBatchFullAccess](batch_managed_policies.md) managed policy to your user account by following the directions [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) in the *IAM User Guide*. + **InstanceRole** – You need to create an `InstanceRole` for your Amazon EKS nodes that has the `AmazonEKSWorkerNodePolicy` and `AmazonEC2ContainerRegistryPullOnly` polices. For directions on how to create the `InstanceRole` see [Creating the Amazon EKS node IAM role](https://docs.aws.amazon.com/eks/latest/userguide/create-node-role.html#create-worker-node-role). You will need the ARN of the `InstanceRole`. + **AWS account ID** – You need to know your AWS account ID. Follow the directions in [Viewing your AWS account ID](https://docs.aws.amazon.com/IAM/latest/UserGuide/console-account-id.html). + **(Optional) CloudWatch** – To examine the details of [(Optional) Submit a job with overrides](getting-started-eks.md#getting-started-eks-step-6), logging must be configured. For more information, see [Use CloudWatch Logs to monitor AWS Batch on Amazon EKS jobs](batch-eks-cloudwatch-logs.md). ## Step 1: Create your EKS cluster for AWS Batch **Important** To get started as simply and quickly as possible, this tutorial includes steps with default settings. Before creating for production use, we recommend that you familiarize yourself with all settings and deploy with the settings that meet your requirements. We recommend you use `eksctl` and the following config file to create your cluster. To manually setup your cluster follow the directions in [Deploy private clusters with limited internet access](https://docs.aws.amazon.com/eks/latest/userguide/private-clusters.html) in the *Amazon EKS User Guide*. 1. Open the [AWS CloudShell console](https://console.aws.amazon.com/cloudshell/home) and set the region to `us-east-1`. For the rest of the tutorial make sure you are using `us-east-1`. 1. Create a private EKS cluster in the `us-east-1` region using the sample `eksctl` config file. Save the yaml file to your AWS CloudShell environment and name it `clusterConfig.yaml` .You can change *my-test-cluster* with the name you want to use for your cluster. ``` kind: ClusterConfig apiVersion: eksctl.io/v1alpha5 metadata: name: my-test-cluster region: us-east-1 availabilityZones: - us-east-1a - us-east-1b - us-east-1c managedNodeGroups: - name: ng-1 privateNetworking: true privateCluster: enabled: true skipEndpointCreation: false ``` 1. Create your resources using the command: `eksctl create cluster -f clusterConfig.yaml`. Cluster creation can take between 10–15 minutes. 1. Once the cluster has finished being created you have to add your AWS CloudShell IP address to the allow list. To find your AWS CloudShell IP address run the following command: ``` curl http://checkip.amazonaws.com ``` Once you have the public IP address you have to create an allow list rule: ``` aws eks update-cluster-config \ --name my-test-cluster \ --region us-east-1 \ --resources-vpc-config endpointPublicAccess=true,endpointPrivateAccess=true,publicAccessCidrs=["/32"] ``` Then apply the update to the kubectl config file: ``` aws eks update-kubeconfig --name my-test-cluster --region us-east-1 ``` 1. To test that you have access to the nodes run the following command: ``` kubectl get nodes ``` The output of command is: ``` NAME STATUS ROLES AGE VERSION ip-192-168-107-235.ec2.internal Ready none 1h v1.32.3-eks-473151a ip-192-168-165-40.ec2.internal Ready none 1h v1.32.3-eks-473151a ip-192-168-98-54.ec2.internal Ready none 1h v1.32.1-eks-5d632ec ``` ## Step 2: Prepare your EKS cluster for AWS Batch All steps are required and must be done in AWS CloudShell. 1. **Create a dedicated namespace for AWS Batch jobs** Use `kubectl` to create a new namespace. ``` $ namespace=my-aws-batch-namespace ``` ``` $ cat - <:role/AWSServiceRoleForBatch" \ --username aws-batch ``` Output: ``` 2022-10-25 20:19:57 [ℹ] adding identity "arn:aws:iam:::role/AWSServiceRoleForBatch" to auth ConfigMap ``` **Note** The path `aws-service-role/batch.amazonaws.com/` has been removed from the ARN of the service-linked role. This is because of an issue with the `aws-auth` configuration map. For more information, see [Roles with paths don't work when the path is included in their ARN in the aws-authconfigmap](https://github.com/kubernetes-sigs/aws-iam-authenticator/issues/268). ## Step 3: Create an Amazon EKS compute environment AWS Batch compute environments define compute resource parameters to meet your batch workload needs. In a managed compute environment, AWS Batch helps you to manage the capacity and instance types of the compute resources (Kubernetes nodes) within your Amazon EKS cluster. This is based on the compute resource specification that you define when you create the compute environment. You can use EC2 On-Demand Instances or EC2 Spot Instances. Now that the **AWSServiceRoleForBatch** service-linked role has access to your Amazon EKS cluster, you can create AWS Batch resources. First, create a compute environment that points to your Amazon EKS cluster. + For `subnets` run `eksctl get cluster my-test-cluster` to get the subnets used by the cluster. + For `securityGroupIds` parameter you can use the same security group as the Amazon EKS cluster. This command retrieves the security group ID for the cluster. ``` $ aws eks describe-cluster \ --name my-test-cluster \ --query cluster.resourcesVpcConfig.clusterSecurityGroupId ``` + Use the ARN of the `instanceRole` you created in the Prerequisites. ``` $ cat < ./batch-eks-compute-environment.json { "computeEnvironmentName": "My-Eks-CE1", "type": "MANAGED", "state": "ENABLED", "eksConfiguration": { "eksClusterArn": "arn:aws:eks:us-east-1::cluster/my-test-cluster", "kubernetesNamespace": "my-aws-batch-namespace" }, "computeResources": { "type": "EC2", "allocationStrategy": "BEST_FIT_PROGRESSIVE", "minvCpus": 0, "maxvCpus": 128, "instanceTypes": [ "m5" ], "subnets": [ "" ], "securityGroupIds": [ "" ], "instanceRole": "" } } EOF ``` ``` $ aws batch create-compute-environment --cli-input-json file://./batch-eks-compute-environment.json ``` **Notes** + Maintenance of an Amazon EKS compute environment is a shared responsibility. For more information, see [Security in Amazon EKS](https://docs.aws.amazon.com/eks/latest/userguide/security.html). ## Step 4: Create a job queue and attach the compute environment **Important** It's important to confirm that the compute environment is healthy before proceeding. The [DescribeComputeEnvironments](https://docs.aws.amazon.com/batch/latest/APIReference/API_DescribeComputeEnvironments.html) API operation can be used to do this. ``` $ aws batch describe-compute-environments --compute-environments My-Eks-CE1 ``` Confirm that the `status` parameter is not `INVALID`. If it is, look at the `statusReason` parameter for the cause. For more information, see [Troubleshooting AWS Batch](troubleshooting.md). Jobs submitted to this new job queue are run as pods on AWS Batch managed nodes that joined the Amazon EKS cluster that's associated with your compute environment. ``` $ cat < ./batch-eks-job-queue.json { "jobQueueName": "My-Eks-JQ1", "priority": 10, "computeEnvironmentOrder": [ { "order": 1, "computeEnvironment": "My-Eks-CE1" } ] } EOF ``` ``` $ aws batch create-job-queue --cli-input-json file://./batch-eks-job-queue.json ``` ## Step 5: Create an Amazon ECR with pull through cache Because the cluster doesn't have public internet access you have to create an Amazon ECR for container images. The following directions create an Amazon ECR with a pull-through cache rule to store the image. 1. The following command create the pull-through cache rule. You can replace *tutorial-prefix* with a different prefix. ``` aws ecr create-pull-through-cache-rule \ --ecr-repository-prefix "my-prefix" \ --upstream-registry-url "public.ecr.aws" \ --region us-east-1 ``` 1. Authenticate with the public ECR. ``` aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin .dkr.ecr.us-east-1.amazonaws.com ``` Now you can pull an image. ``` docker pull .dkr.ecr.us-east-1.amazonaws.com/my-prefix/amazonlinux/amazonlinux:2 ``` 1. You can verify the repositry and image by running the following commands: ``` aws ecr describe-repositories ``` ``` aws ecr describe-images --repository-name my-prefix/amazonlinux/amazonlinux ``` 1. The image string to use for pulling the container is in the following format: ``` .dkr.ecr.us-east-1.amazonaws.com/my-prefix/amazonlinux/amazonlinux:2 ``` ## Step 6: Register a job definition The following Job definition instructs the pod to sleep for 60 seconds. In the image field of the job definition, instead of providing a link to image in a public ECR repository, provide the link to the image stored in our private ECR repository. See the following sample job definition: ``` $ cat < ./batch-eks-job-definition.json { "jobDefinitionName": "MyJobOnEks_Sleep", "type": "container", "eksProperties": { "podProperties": { "hostNetwork": true, "containers": [ { "image": ".dkr.ecr.us-east-1.amazonaws.com/my-prefix/amazonlinux/amazonlinux:2", "command": [ "sleep", "60" ], "resources": { "limits": { "cpu": "1", "memory": "1024Mi" } } } ], "metadata": { "labels": { "environment": "test" } } } } } EOF ``` ``` $ aws batch register-job-definition --cli-input-json file://./batch-eks-job-definition.json ``` **Notes** + There are considerations for the `cpu` and `memory` parameters. For more information, see [Memory and vCPU considerations for AWS Batch on Amazon EKS](memory-cpu-batch-eks.md). ## Step 7: Submit a job to run Run the following AWS CLI command in AWS CloudShell to submit a new Job and returns the unique JobID. ``` $ aws batch submit-job --job-queue My-Eks-JQ1 \ --job-definition MyJobOnEks_Sleep - -job-name My-Eks-Job1 ``` **Notes** + For more information about running jobs on Amazon EKS resources, see [Amazon EKS jobs](eks-jobs.md). ## Step 8: View the Job's output To check the status of a Job: ``` $ aws batch describe-jobs --job ``` The `startedAt` and `stoppedAt` should be one minute apart. ## Step 9: (Optional) Submit a job with overrides This job overrides the command passed to the container. ``` $ cat < ./submit-job-override.json { "jobName": "EksWithOverrides", "jobQueue": "My-Eks-JQ1", "jobDefinition": "MyJobOnEks_Sleep", "eksPropertiesOverride": { "podProperties": { "containers": [ { "command": [ "/bin/sh" ], "args": [ "-c", "echo hello world" ] } ] } } } EOF ``` ``` $ aws batch submit-job - -cli-input-json file://./submit-job-override.json ``` **Notes** + For improved visibility into the details of the operations, enable Amazon EKS control plane logging. For more information, see [Amazon EKS control plane logging](https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html) in the **Amazon EKS User Guide**. + Daemonsets and kubelets overhead affects available vCPU and memory resources, specifically scaling and job placement. For more information, see [Memory and vCPU considerations for AWS Batch on Amazon EKS](memory-cpu-batch-eks.md). ## Step 10: Clean up your tutorial resources You are charged for the Amazon EC2 instance while it is enabled. You can delete the instance to stop incurring charges. To delete the resources you created, do the following: 1. Open the AWS Batch console at [https://console.aws.amazon.com/batch/](https://console.aws.amazon.com/batch/). 1. In the navigation pane choose **Job queue**. 1. In the **Job queue** table choose the Job queue you created for the tutorial. 1. From **Actions** choose **Disable**. Once the Job queue **State** is Disabled you can choose **Delete**. 1. Once the Job queue is deleted, in the navigation pane choose **Compute environments**. 1. Choose the compute environment you created for this tutorial and then choose **Disable** from **Actions**. It may take 1–2 minuets for the compute environment to complete being disabled. 1. Once the compute environment’s **State** is Disabled, choose **Delete**. It may take 1–2 minuets for the compute environment to be deleted. ## Additional resources After you complete the tutorial, you might want to explore the following topics:: + Learn more about the [Best practices](best-practices.md). + Explore the AWS Batch core components. For more information, see [Components of AWS Batch](batch_components.md). + Learn more about the different [Compute Environments](compute_environments.md) available in AWS Batch. + Learn more about [Job queues](job_queues.md) and their different scheduling options. + Learn more about [Job definitions](job_definitions.md) and the different configuration options. + Learn more about the different types of [Jobs](jobs.md). ## Troubleshooting If nodes launched by AWS Batch don't have access to the Amazon ECR repository (or any other repository) that stores your image, then your jobs could remain in the STARTING state. This is because the pod will not be able to download the image and run your AWS Batch job. If you click on the pod name launched by AWS Batch you should be able to see the error message and confirm the issue. The error message should look similar to the following: ``` Failed to pull image "public.ecr.aws/amazonlinux/amazonlinux:2": rpc error: code = Unknown desc = failed to pull and unpack image "public.ecr.aws/amazonlinux/amazonlinux:2": failed to resolve reference "public.ecr.aws/amazonlinux/amazonlinux:2": failed to do request: Head "https://public.ecr.aws/v2/amazonlinux/amazonlinux/manifests/2": dial tcp: i/o timeout ``` For other common troubleshooting scenarios, see [Troubleshooting AWS Batch](https://docs.aws.amazon.com/batch/latest/userguide/batch-eks-troubleshooting.html). For troubleshooting based on pod status, see [https://repost.aws/knowledge-center/eks-pod-status-troubleshooting](https://repost.aws/knowledge-center/eks-pod-status-troubleshooting). # Getting started with AWS Batch on SageMaker AI AWS Batch service jobs enable you to submit SageMaker Training jobs through AWS Batch job queues with scheduling, prioritization, and queuing capabilities. This tutorial demonstrates how to set up and run a simple SageMaker Training job using AWS Batch service jobs. **Contents** + [ ## Overview ](#getting-started-sagemaker-context) + [ ## Prerequisites ](#getting-started-sagemaker-prerequisites) + [ ## Step 1: Create a SageMaker AI execution role ](#getting-started-sagemaker-step-1) + [ ## Step 2: Create your service environment ](#getting-started-sagemaker-step-2) + [ ## Step 3: Create your SageMaker job queue ](#getting-started-sagemaker-step-3) + [ ## Step 4: Create and submit a training job ](#getting-started-sagemaker-step-4) + [ ## Step 5: Monitor job status ](#getting-started-sagemaker-step-5) + [ ## Step 6: View job output ](#getting-started-sagemaker-step-6) + [ ## Step 7: Clean up your tutorial resources ](#getting-started-sagemaker-step-7) + [ ## Additional resources ](#getting-started-sagemaker-additional-resources) ## Overview This tutorial demonstrates how to setup AWS Batch service jobs for SageMaker Training jobs using the AWS CLI. **Intended Audience** This tutorial is designed for data scientists and developers responsible for setting up and running machine learning training jobs at scale. **Features Used** This tutorial shows you how to use the AWS CLI to: + Create a service environment for SageMaker Training jobs + Create a SageMaker Training job queue + Submit service jobs using the `SubmitServiceJob` API + Monitor job status and view outputs + Access CloudWatch logs for training jobs **Time Required** It should take about 15 minutes to complete this tutorial. **Regional Restrictions** This tutorial can be completed in any AWS Region where both AWS Batch and SageMaker AI are available. **Resource Usage Costs** There's no charge for creating an AWS account. However, by implementing this solution, you might incur costs for the following resources: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/batch/latest/userguide/getting-started-sagemaker.html) ## Prerequisites Before starting this tutorial, you must install and configure the following tools and resources that you need to create and manage both AWS Batch and SageMaker AI resources. + **AWS CLI** – A command line tool for working with AWS services, including AWS Batch and SageMaker AI. This guide requires that you use version 2.8.6 or later. For more information, see [Installing, updating, and uninstalling the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) in the *AWS Command Line Interface User Guide*. After installing the AWS CLI, we recommend that you also configure it. For more information, see [Quick configuration with `aws configure`](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config) in the *AWS Command Line Interface User Guide*. ## Step 1: Create a SageMaker AI execution role SageMaker AI uses execution roles to perform operations on your behalf using other AWS services. You must create an execution role and grant SageMaker AI permissions to use the services and resources needed for training jobs. Use the `AmazonSageMakerFullAccess` managed policy as it includes permissions for Amazon S3. **Note** Use the following directions to create the SageMaker AI execution role for this tutorial. Before you create an execution role for your production environment we recommend you review, [How to use SageMaker AI execution roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) in the *[SageMaker AI Developer guide](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html)*. 1. **Create the IAM role** Create a JSON file named `sagemaker-trust-policy.json` with the following trust policy: ------ #### [ JSON ] **** ``` { "Version":"2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "sagemaker.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } ``` ------ Create the IAM role using the trust policy: ``` aws iam create-role \ --role-name SageMakerExecutionRole \ --assume-role-policy-document file://sagemaker-trust-policy.json \ --description "Execution role for SageMaker training jobs" ``` 1. **Attach managed policies** Attach the required managed policies to the role: ``` aws iam attach-role-policy \ --role-name SageMakerExecutionRole \ --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess ``` ``` aws iam attach-role-policy \ --role-name SageMakerExecutionRole \ --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess ``` 1. **Note the role ARN** Get the role ARN, which you'll need in later steps: ``` aws iam get-role --role-name SageMakerExecutionRole --query 'Role.Arn' --output text ``` Save this ARN as you'll use it when creating your training job payload. ## Step 2: Create your service environment A service environment defines the capacity constraints for SageMaker Training jobs. The service environment encapsulates the maximum number of training instances that can run concurrently. **Important** When you create your first service environment for SageMaker Training, AWS Batch automatically creates a service-linked role called `AWSServiceRoleForAWSBatchWithSagemaker` in your account. This role allows AWS Batch to queue and manage SageMaker Training jobs on your behalf. For more information about this service-linked role and its permissions, see [Using roles for AWS Batch with SageMaker AI](using-service-linked-roles-batch-sagemaker.md). Create a service environment that can handle up to 5 instances: ``` aws batch create-service-environment \ --service-environment-name TutorialServiceEnvironment \ --service-environment-type SAGEMAKER_TRAINING \ --capacity-limits capacityUnit=NUM_INSTANCES,maxCapacity=5 ``` Output: ``` { "serviceEnvironmentName": "TutorialServiceEnvironment", "serviceEnvironmentArn": "arn:aws:batch:your-region:your-account-id:service-environment/TutorialServiceEnvironment" } ``` Verify that your service environment was created successfully: ``` aws batch describe-service-environments --service-environments TutorialServiceEnvironment ``` Output: ``` { "serviceEnvironments": [ { "serviceEnvironmentName": "TutorialServiceEnvironment", "serviceEnvironmentArn": "arn:aws:batch:your-region:your-account-id:service-environment/TutorialServiceEnvironment", "serviceEnvironmentType": "SAGEMAKER_TRAINING", "state": "ENABLED", "status": "VALID", "capacityLimits": [ { "maxCapacity": 5, "capacityUnit": "NUM_INSTANCES" } ], "tags": {} } ] } ``` For more information about service environments, see [Service environments for AWS Batch](service-environments.md). ## Step 3: Create your SageMaker job queue A SageMaker job queue manages the scheduling and execution of service jobs. Jobs submitted to this queue will be dispatched to your service environment based on available capacity. Create a SageMaker Training job queue: ``` aws batch create-job-queue \ --job-queue-name my-sm-training-fifo-jq \ --job-queue-type SAGEMAKER_TRAINING \ --priority 1 \ --service-environment-order order=1,serviceEnvironment=TutorialServiceEnvironment ``` Output: ``` { "jobQueueName": "my-sm-training-fifo-jq", "jobQueueArn": "arn:aws:batch:your-region:your-account-id:job-queue/my-sm-training-fifo-jq" } ``` Verify that your job queue was created successfully: ``` aws batch describe-job-queues --job-queues my-sm-training-fifo-jq ``` Output: ``` { "jobQueues": [ { "jobQueueName": "my-sm-training-fifo-jq", "jobQueueArn": "arn:aws:batch:your-region:your-account-id:job-queue/my-sm-training-fifo-jq", "state": "ENABLED", "status": "VALID", "statusReason": "JobQueue Healthy", "priority": 1, "computeEnvironmentOrder": [], "serviceEnvironmentOrder": [ { "order": 1, "serviceEnvironment": "arn:aws:batch:your-region:your-account-id:service-environment/TutorialServiceEnvironment" } ], "jobQueueType": "SAGEMAKER_TRAINING", "tags": {} } ] } ``` For more information about SageMaker job queues, see [Create a SageMaker Training job queue in AWS Batch](create-sagemaker-job-queue.md). ## Step 4: Create and submit a training job Now you'll create a simple training job and submit it to your job queue. This example uses a basic "hello world" training job that demonstrates the service job functionality. Create a file named `my_training_job.json` with the following content. Replace *your-account-id* with your AWS account ID: **Note** `S3OutputPath` is required for creating the SageMaker Training job but the results of this tutorial are not stored in the Amazon S3 bucket and you can use the path in the following JSON. In your production environment you will need a valid Amazon S3 bucket to store output there if you choose to. ``` { "TrainingJobName": "my-simple-training-job", "RoleArn": "arn:aws:iam::your-account-id:role/SageMakerExecutionRole", "AlgorithmSpecification": { "TrainingInputMode": "File", "TrainingImage": "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-cpu-py310", "ContainerEntrypoint": [ "echo", "hello world" ] }, "ResourceConfig": { "InstanceType": "ml.c5.xlarge", "InstanceCount": 1, "VolumeSizeInGB": 1 }, "OutputDataConfig": { "S3OutputPath": "s3://your-s3-bucket/output" }, "StoppingCondition": { "MaxRuntimeInSeconds": 30 } } ``` Submit the training job using the [SubmitServiceJob](https://docs.aws.amazon.com/batch/latest/APIReference/API_SubmitServiceJob.html) API: ``` aws batch submit-service-job \ --job-queue my-sm-training-fifo-jq \ --job-name my-batch-sm-job \ --service-job-type SAGEMAKER_TRAINING \ --retry-strategy attempts=1 \ --timeout-config attemptDurationSeconds=60 \ --service-request-payload file://my_training_job.json ``` Output: ``` { "jobArn": "arn:aws:batch:your-region:your-account-id:service-job/your-job-id", "jobName": "my-batch-sm-job", "jobId": "your-job-id" } ``` For more information about service job payloads, see [Service job payloads in AWS Batch](service-job-payload.md). For more information about submitting service jobs, see [Submit a service job in AWS Batch](service-job-submit.md). ## Step 5: Monitor job status You can monitor your training jobs using the following AWS Batch APIs: [DescribeServiceJob](https://docs.aws.amazon.com/batch/latest/APIReference/API_DescribeServiceJob.html), [ListServiceJobs](https://docs.aws.amazon.com/batch/latest/APIReference/API_ListServiceJobs.html), and [GetJobQueueSnapshot](https://docs.aws.amazon.com/batch/latest/APIReference/API_GetJobQueueSnapshot.html). This section shows different ways to check job status and queue information. View running jobs in your queue: ``` aws batch list-service-jobs \ --job-queue my-sm-training-fifo-jq --job-status RUNNING ``` Output: ``` { "jobSummaryList": [ { "latestAttempt": { "serviceResourceId": { "name": "TrainingJobArn", "value": "arn:aws:sagemaker:your-region:your-account-id:training-job/AWSBatch

" } }, "createdAt": 1753718760, "jobArn": "arn:aws:batch:your-region:your-account-id:service-job/your-job-id", "jobId": "your-job-id", "jobName": "my-batch-sm-job", "serviceJobType": "SAGEMAKER_TRAINING", "status": "RUNNING", "startedAt": 1753718820 } ] } ``` View jobs that are in the `RUNNABLE` state: ``` aws batch list-service-jobs \ --job-queue my-sm-training-fifo-jq --job-status RUNNABLE ``` Get a snapshot of upcoming jobs in your queue: ``` aws batch get-job-queue-snapshot --job-queue my-sm-training-fifo-jq ``` Output: ``` { "frontOfQueue": { "jobs": [ { "jobArn": "arn:aws:batch:your-region:your-account-id:service-job/your-job-id", "earliestTimeAtPosition": 1753718880 }, { "jobArn": "arn:aws:batch:your-region:your-account-id:service-job/your-job-id-2", "earliestTimeAtPosition": 1753718940 } ], "lastUpdatedAt": 1753718970 } } ``` Search for jobs by name: ``` aws batch list-service-jobs \ --job-queue my-sm-training-fifo-jq \ --filters name=JOB_NAME,values="my-batch-sm-job" ``` Output: ``` { "jobSummaryList": [ { "latestAttempt": { "serviceResourceId": { "name": "TrainingJobArn", "value": "arn:aws:sagemaker:your-region:your-account-id:training-job/AWSBatch

" } }, "createdAt": 1753718760, "jobArn": "arn:aws:batch:your-region:your-account-id:service-job/your-job-id", "jobId": "your-job-id", "jobName": "my-batch-sm-job", "serviceJobType": "SAGEMAKER_TRAINING", "status": "RUNNING" } ] } ``` For more information about job state mapping, see [Mapping AWS Batch service job status to SageMaker AI status](service-job-status.md). ## Step 6: View job output Once your job completes, you can view its output and logs through both AWS Batch and SageMaker AI APIs. Get detailed information about your job from AWS Batch: ``` aws batch describe-service-job \ --job-id your-job-id ``` Output: ``` { "attempts": [ { "serviceResourceId": { "name": "TrainingJobArn", "value": "arn:aws:sagemaker:your-region:your-account-id:training-job/AWSBatch

" }, "startedAt": 1753718820, "stoppedAt": 1753718880, "statusReason": "Received status from SageMaker: Training job completed" } ], "createdAt": 1753718760, "jobArn": "arn:aws:batch:your-region:your-account-id:service-job/your-job-id", "jobId": "your-job-id", "jobName": "my-batch-sm-job", "jobQueue": "arn:aws:batch:your-region:your-account-id:job-queue/my-sm-training-fifo-jq", "latestAttempt": { "serviceResourceId": { "name": "TrainingJobArn", "value": "arn:aws:sagemaker:your-region:your-account-id:training-job/AWSBatch

" } }, "retryStrategy": { "attempts": 1, "evaluateOnExit": [] }, "serviceRequestPayload": "your-training-job-request-json", "serviceJobType": "SAGEMAKER_TRAINING", "startedAt": 1753718820, "status": "SUCCEEDED", "statusReason": "Received status from SageMaker: Training job completed", "stoppedAt": 1753718880, "tags": {}, "timeoutConfig": { "attemptDurationSeconds": 60 } } ``` This command returns comprehensive job information including the SageMaker Training job ARN, which you can use to access the job directly through SageMaker AI: ``` aws sagemaker describe-training-job \ --training-job-name AWSBatch

``` To view the CloudWatch logs for your training job, first get the log stream name: ``` aws logs describe-log-streams \ --log-group-name /aws/sagemaker/TrainingJobs \ --log-stream-name-prefix AWSBatchmy-simple-training-job ``` Output: ``` { "logStreams": [ { "logStreamName": "your-log-stream-name", "creationTime": 1753718830, "firstEventTimestamp": 1753718840, "lastEventTimestamp": 1753718850, "lastIngestionTime": 1753718860, "uploadSequenceToken": upload-sequence-token, "arn": "arn:aws:logs:your-region:your-account-id:log-group:/aws/sagemaker/TrainingJobs:log-stream:AWSBatch

/algo-1-algo-id", "storedBytes": 0 } ] } ``` Then retrieve the logs using the log stream name from the previous response: ``` aws logs get-log-events \ --log-group-name /aws/sagemaker/TrainingJobs \ --log-stream-name your-log-stream-name ``` Output: ``` { "events": [ { "timestamp": 1753718845, "message": "hello world", "ingestionTime": 1753718865 } ], "nextForwardToken": "next-forward-token", "nextBackwardToken": "next-backward-token" } ``` The log output shows the "hello world" message from your training job, confirming that the job executed successfully. ## Step 7: Clean up your tutorial resources When you're done with the tutorial, clean up the resources you created to avoid ongoing charges. First, disable and delete the job queue: ``` aws batch update-job-queue \ --job-queue my-sm-training-fifo-jq \ --state DISABLED ``` Wait for the job queue to be disabled, then delete it: ``` aws batch delete-job-queue \ --job-queue my-sm-training-fifo-jq ``` Next, disable and delete the service environment: ``` aws batch update-service-environment \ --service-environment TutorialServiceEnvironment \ --state DISABLED ``` Wait for the service environment to be disabled, then delete it: ``` aws batch delete-service-environment \ --service-environment TutorialServiceEnvironment ``` ## Additional resources After you complete the tutorial, you might want to explore the following topics: + We recommend using the PySDK for service job creation and submission to your Job queue because PySDK has helper classes and utilities. For an example of using PySDK, see [SageMaker AI examples](https://github.com/aws/amazon-sagemaker-examples) on GitHub. + Learn more about [Service jobs in AWS Batch](service-jobs.md). + Explore [Service job payloads in AWS Batch](service-job-payload.md) for more complex training job configurations. + Learn about [Submit a service job in AWS Batch](service-job-submit.md) and the `SubmitServiceJob` API. + Review [Mapping AWS Batch service job status to SageMaker AI status](service-job-status.md) to understand job state transitions. + Visit the [SageMaker AI Python SDK documentation](https://sagemaker.readthedocs.io/en/stable/) for more feature-rich ways to create and submit SageMaker Training jobs using Python. + Explore [SageMaker example notebooks](https://github.com/aws/amazon-sagemaker-examples) for more complex machine learning workflows.