Reserve training plans for your training jobs or HyperPod clusters
Amazon SageMaker training plans is a capability that allows you to reserve and help maximize the use of GPU capacity for large-scale AI model training workloads. This feature provides access to highly sought-after instance types that cover a range of GPU-accelerated computing options, including the latest NVIDIA GPU technologies and AWS Trainium chips. With SageMaker training plans, you can secure predictable access to these high-demand, high-performance computational resources within your specified timelines and budgets, without the need to manage underlying infrastructure. This flexibility is particularly valuable for organizations dealing with the challenges of acquiring and scheduling these oversubscribed compute instances for their mission-critical AI workloads.
What is SageMaker training plans
SageMaker training plans allow you to reserve compute capacity tailored to your target resource needs, such as SageMaker training jobs or SageMaker HyperPod clusters. The service automatically handles the reservation, provisioning of accelerated compute resources, infrastructure setup, workload execution, and recovery from infrastructure failures.
SageMaker training plans consist of one or more Reserved Capacity blocks, each defined by the following parameters:
-
Specific instance type
-
Quantity of instances
-
Availability Zone
-
Duration
-
Start and end times
Note
-
Training plans are specific to their target resource (either SageMaker Training Job or SageMaker HyperPod) and cannot be interchanged.
-
Multiple Reserved Capacity blocks in a single training plan may be discontinuous. This means there can be gaps between the Reserved Capacity blocks.
Benefits of SageMaker training plans
SageMaker training plans offer the following benefits:
-
Predictable Access: Reserve GPU capacity for your machine learning workloads within specified time frames.
-
Cost Management: Plan and budget for large-scale training requirements in advance.
-
Automated Resource Management: SageMaker training plans handle the provisioning and management of infrastructure.
-
Flexibility: Create training plans for various resources, including SageMaker training jobs and SageMaker HyperPod clusters.
-
Fault Tolerance: Benefit from automatic recovery from infrastructure failures and workload migration across Availability Zones for SageMaker AI training jobs.
SageMaker training plans advance reservation and flexible start times
SageMaker training plans allow you to reserve compute capacity in advance, with flexible start times and durations.
-
Advance reservation: You can reserve a training plan up to 8 weeks (56 days) in advance of the start date.
-
Minimum lead time: SageMaker training plans offerings may be available to start within 30 minutes of reservation, subject to availability.
Note
You can search for and purchase a plan that will be accessible within 30 minutes. To ensure timely activation, the payment transaction must successfully complete at least 5 minutes before the desired start time. For example, if you want a plan to start at 2:00 PM, you can make a last-minute search as late as 1:30 PM and complete your purchase by 1:55 PM to guarantee the plan is ready by 2:00 PM.
-
Reservation duration and instance quantity: SageMaker training plans allow you to reserve instances with specific duration and quantity options. For available instance types in a given AWS Region, duration, and quantity options, see Supported instance types, AWS Regions, and pricing.
-
End time: Training Plans always end at 11:30 AM UTC on the final day of the reservation.
-
Training plan termination: When 30 minutes remain in a Reserved Capacity, SageMaker training plans initiates the process of terminating any running instances within that block until the next Reserved Capacity becomes active. You retain full access to your training plan until 30 minutes before the final Reserved Capacity block's end time.
SageMaker training plans user workflow
SageMaker training plans work through the following steps:
Admin steps:
-
Search and review: Find available plan offerings that match your compute requirements, such as instance type, count, start time, and duration.
-
Create a plan: Reserve a training plan that meets your needs using the ID of your chosen plan offering.
-
Payment and scheduling: Upon successful upfront payment, the plan status becomes
Scheduled
.
Steps for plan users / ML engineers:
-
Resource allocation: Use your plan to queue SageMaker AI training jobs or allocate to a SageMaker HyperPod cluster instance group.
-
Activation: When the plan start date arrives, it becomes
Active
. Based on available reserved capacity, SageMaker training plans automatically launch training jobs or provision instance groups.
Note
The status of the training plan transitions from Scheduled
to
Active
when a Reserved Capacity period begins, and then back to
Scheduled
when waiting for the next Reserved Capacity period to start.
The following diagrams provide a comprehensive overview of how SageMaker training plans interact with different target resources, illustrating a plan's lifecycle and its role in resource allocation for both SageMaker training jobs and SageMaker HyperPod clusters.
-
Training plans for SageMaker Training Job: The first diagram illustrates the end-to-end workflow of the interaction between a training plan and SageMaker Training Job.
-
Training plans for SageMaker HyperPod clusters: The second diagram illustrates the end-to-end workflow of the interaction between a training plan and a SageMaker HyperPod instance group.
Supported instance types, AWS Regions, and pricing
Training plans support reservations for the following specific high-performance instance types, each available in select AWS Regions:
-
ml.p4d.24xlarge
-
ml.p5.48xlarge
-
ml.p5e.48xlarge
-
ml.p5en.48xlarge
-
ml.trn1.32xlarge
-
ml.trn2.48xlarge
Note
The availability of instance types may change over time. For the most up-to-date
information on available instance types according to Region, as well as their respective
prices, see SageMaker Pricing
The availability across multiple regions allows to choose the most suitable location for workloads, considering factors such as data residency requirements and proximity to other AWS services.
Important
-
You can use SageMaker training plans to reserve instances with the following reservation duration and instance quantity options.
-
Reservation durations are available in 1-day increments from 1 to 182 days.
-
The reservation instance quantity options are 1, 2, 4, 8, 16, 32 or 64 instances.
-
-
Make sure that your Training Jobs or HyperPod service quotas allow a maximum number of instances per instance type that exceeds the number of instances specified in your plan. To view your current quotas or request a quota increase, see View SageMaker training plans quotas using the AWS management console.
SageMaker training plans search behavior
When searching for a training plan offering, SageMaker training plans use the following approach to maximize resource availability and flexibility for users, even when demand is high and Reserved Capacity blocks are scarce:
-
Initial continuous search: SageMaker training plans first attempt to find a single, continuous block of Reserved Capacity that matches the specified duration within the start and end dates, while meeting all other specified criteria, including target resource, requested instance type, and number of instances.
-
Two-block search: SageMaker training plans don't return a "no capacity" result if a single continuous Reserved Capacity block meeting all criteria is unavailable. Instead, it automatically attempts to fulfill the request using two separate Reserved Capacity blocks, splitting the total duration across two time segments.
This two-block approach provides more flexibility in resource allocation, potentially securing high-demand instances that would otherwise be unavailable.
Note
SageMaker training plans return up to three offerings of one or two segments. For example, for a 48-hour duration plan, SageMaker training plans might offer a plan with two 24-hour blocks, one continuous 48-hour block, and two blocks with uneven duration.
Considerations
Important
-
Training plans cannot be modified once purchased.
-
Training plans cannot be shared across AWS accounts or within your AWS Organization.
-
When searching for training plan offerings, SageMaker training plans adapts its search strategy based on the target resources:
For SageMaker HyperPod clusters:
-
Offerings are limited to a single Availability Zone (AZ).
-
This ensures consistent network performance and data locality within the cluster.
For SageMaker training jobs:
-
Offerings can span multiple Availability Zones.
-
This is particularly relevant when the plan offering contains multiple discontinuous reserved capacities.
-
For example, a plan might include capacity in AZ-A for one Reserved Capacity block and AZ-B for another. SageMaker training plans can automatically move workloads across Availability Zones (AZs) based on resource availability.
This multi-AZ approach for training jobs provides greater flexibility in resource allocation, increasing the chances of finding suitable capacity for your workload. However, you should be aware that your jobs may run in different AZs during different parts of your reservation period.
-
-
When presented with a two-block offering, users should carefully consider if this split allocation meets their workload requirements. This may require adjusting job scheduling or workload distribution to accommodate the non-continuous nature of the reservation.