Reserve training plans for your training jobs or HyperPod clusters - Amazon SageMaker AI

Reserve training plans for your training jobs or HyperPod clusters

Amazon SageMaker training plans is a capability that allows you to reserve and help maximize the use of GPU capacity for large-scale AI model training workloads. This feature provides access to highly sought-after instance types that cover a range of GPU-accelerated computing options, including the latest NVIDIA GPU technologies and AWS Trainium chips. With SageMaker training plans, you can secure predictable access to these high-demand, high-performance computational resources within your specified timelines and budgets, without the need to manage underlying infrastructure. This flexibility is particularly valuable for organizations dealing with the challenges of acquiring and scheduling these oversubscribed compute instances for their mission-critical AI workloads.

What is SageMaker training plans

SageMaker training plans allow you to create reservations for compute capacity tailored to your specific resource needs, such as SageMaker training jobs or SageMaker HyperPod clusters. The service automatically handles the provisioning of accelerated compute resources, infrastructure setup, workload execution, and recovery from infrastructure failures.

Benefits of SageMaker training plans

SageMaker training plans offer the following benefits:

  • Predictable Access: Reserve GPU capacity for your machine learning workloads within specified time frames.

  • Cost Management: Plan and budget for large-scale training requirements in advance.

  • Automated Resource Management: SageMaker training plans handle the provisioning and management of infrastructure.

  • Flexibility: Create training plans for various resources, including SageMaker training jobs and SageMaker HyperPod clusters.

  • Fault Tolerance: Benefit from automatic recovery from infrastructure failures and workload migration across Availability Zones for SageMaker AI training jobs.

SageMaker training plans user workflow

SageMaker training plans work through the following steps:

Admin steps:

  1. Search and review: Find available plan offerings that match your compute requirements, such as instance type, count, start time, and duration.

  2. Create a plan: Reserve a training plan that meets your needs using the ID of your chosen plan offering.

  3. Payment and scheduling: Upon successful upfront payment, the plan status becomes Scheduled.

Steps for plan users / ML engineers:

  1. Resource allocation: Use your plan to queue SageMaker AI training jobs or allocate to a SageMaker HyperPod cluster instance group.

  2. Activation: When the plan start date arrives, it becomes Active. Based on available reserved capacity, SageMaker training plans automatically launch training jobs or provision instance groups.

The following diagrams provide a comprehensive overview of how SageMaker training plans interact with different target resources, illustrating a plan's lifecycle and its role in resource allocation for both SageMaker training jobs and SageMaker HyperPod clusters.

  • Training plans for SageMaker Training Job: The first diagram illustrates the end-to-end workflow of the interaction between a training plan and SageMaker Training Job.

    Billing, Capacity reservation with training plans, and SageMaker Training Job. Illustration of the training plan lifecycle, and training job states managed by Administrators and ML Engineers.
  • Training plans for SageMaker HyperPod clusters: The second diagram illustrates the end-to-end workflow of the interaction between a training plan and a SageMaker HyperPod instance group.

    Billing, Capacity reservation with training plans, and instance group management workflow. Illustration of the training plan lifecycle and instance group states managed by Administrators and ML Engineers.

Supported instance types and AWS Regions

Training plans support reservations for the following specific high-performance instance types, each available in select AWS Regions:

  • ml.p4d.24xlarge

  • ml.p5.48xlarge

  • ml.p5e.48xlarge

  • ml.p5en.48xlarge

  • ml.trn2.48xlarge

Note

The availability of instance types may change over time. For the most up-to-date information on available instance types according to Region, as well as their respective prices, see SageMaker AI Pricing. Scroll down to the Amazon SageMaker HyperPod flexible training plans section under On-Demand Pricing. Select a Region to view the list of available instance types.

The availability across multiple regions allows to choose the most suitable location for workloads, considering factors such as data residency requirements and proximity to other AWS services.

Important

You can use SageMaker training plans to reserve instances with the following reservation duration and instance quantity options.

  • Reservation durations are available in 1-day increments from 1 to 182 days.

  • The reservation instance quantity options are 1, 2, 4, 8, 16, 32 or 64 instances.

Plan composition

A SageMaker training plans can consist of one or more Reserved Capacity blocks, each defined by:

  • Specific instance type

  • Quantity of instances

  • Availability Zone

  • Duration

  • Start and end times

Note
  • Training plans are specific to their target resource (either SageMaker Training Job or SageMaker HyperPod) and cannot be interchanged.

  • Multiple Reserved Capacity blocks in a single training plan may be discontinuous. This means there can be gaps between the reserved time blocks.

  • The status of the training plan transitions from Scheduled to Active when a Reserved Capacity period begins, and then back to Scheduled when waiting for the next Reserved Capacity period to start.

  • Reserved Capacity termination process: You have full access to all reserved instances until 30 minutes before the Reserved Capacity end time. When there are 30 minutes remaining in your Reserved Capacity, SageMaker training plans begins the process of terminating any running instances within that Reserved Capacity.

SageMaker training plans search behavior

When searching for a training plan offering, SageMaker training plans use the following approach to maximize resource availability and flexibility for users, even when demand is high and continuous blocks of time are scarce:

  • Initial continuous search: The system first attempts to find a single, continuous block of Reserved Capacity that matches all of the specified criteria (Target resource, requested instance type, number of instances, duration of the reservation, start and end dates).

  • Two-block search:

    • SageMaker training plans doesn't immediately return a "no capacity" result if a single continuous Reserved Capacity block meeting all criteria is unavailable. Instead, it automatically attempts to fulfill the request using two separate Reserved Capacity blocks.

    • In this scenario, the total duration of the request is split across two non-contiguous time segments. For example, if a user requested a 48-hour reservation, the system might offer a plan with two 24-hour blocks, potentially on different days or weeks, depending on availability and the start and end dates.

    • This two-block approach provides more flexibility in resource allocation, potentially allowing you to secure high-demand instances that would otherwise be unavailable for your full requested duration.

Note

User consideration:

  • When presented with a two-block offering, users should carefully consider if this split allocation meets their workload requirements.

  • This may require adjusting job scheduling or workload distribution to accommodate the non-continuous nature of the reservation.

When searching for training plan offerings, SageMaker training plans adapt its search strategy based on the target resource:

  • For SageMaker HyperPod clusters:

    • Offerings are limited to a single Availability Zone (AZ).

    • This ensures consistent network performance and data locality within the cluster.

  • For SageMaker training jobs:

    • Offerings can span multiple Availability Zones.

    • This is particularly relevant when the plan offering contains multiple discontinuous reserved capacities.

    • For example, a plan might include capacity in AZ-A for one Reserved Capacity block and AZ-B for another. SageMaker training plans can automatically move workloads across Availability Zones (AZs) based on resource availability.

    This multi-AZ approach for training jobs provides greater flexibility in resource allocation, increasing the chances of finding suitable capacity for your workload. However, users should be aware that their jobs may run in different AZs during different parts of their reservation period.