# Using Amazon EC2 Capacity Blocks for ML with AWS PCS
<a name="capacity-blocks"></a>

Amazon EC2 Capacity Blocks for ML is an Amazon EC2 purchasing option that enables you to pay in advance to reserve GPU-based accelerated computing instances within a specific date and time range to support short duration workloads. Instances that run inside a Capacity Block are automatically placed close together inside Amazon EC2 UltraClusters, for low-latency, petabit-scale, non-blocking networking. For more information, see [Capacity Blocks for ML](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html) in the *Amazon Elastic Compute Cloud User Guide*.

You can use a launch template to have AWS PCS use a Capacity Block when it launches instances for a compute node group.

**Note**  
AWS PCS introduced support for Capacity Blocks since Slurm version 24.05.

## Limitations
<a name="capacity-blocks-limitations"></a>
+ AWS PCS only supports Capacity Blocks with P6-B300, P6-B200, P5en, P5e, P5, and P4d instance families.
+ You can only associate a compute node group with 1 Capacity Block at a time.
+ You can't associate a compute node group with a capacity reservation group that combines multiple Capacity Blocks.
+ Capacity Blocks must be in a `scheduled` or `active` state to use with AWS PCS. You can't use Capacity Blocks in other states, such as `payment-failed`. For more information, see [View Capacity Blocks](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/capacity-blocks-view.html) in the *Amazon Elastic Compute Cloud User Guide*.
+ For P6 and P5 instance types please refer to the relevant AWS documentation: [Software Requirements for P6 Instances](https://docs.aws.amazon.com/dlami/latest/devguide/p6-support-dlami.html#dlami-support-p6), [Maximize network bandwidth on Amazon EC2 instances with multiple network cards](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-acc-inst-types.html)

## Capacity Block expiration
<a name="capacity-blocks-expiration"></a>

Capacity Blocks are limited to a specific date and time range. When a Capacity Block expires:
+ The compute node group associated with that Capacity Block continues to exist and remains associated with the same queues.
+ All instances in the compute node group are terminated and active jobs might fail, based on your Slurm settings.
+ AWS PCS can't launch new instances in the compute node group.
+ All queued or newly submitted jobs remain in pending state until another compute node group is attached to the queue or you update the compute node group to use a new launch template that specifies a new Capacity Block.

# Configure an AWS PCS compute node group to use a Capacity Block
<a name="capacity-blocks-configure-cng"></a>

**To associate a Capacity Block with a compute node group**

1. Create an Amazon EC2 launch template for AWS PCS that specifies your Capacity Block. For more information about creating a launch template for AWS PCS, see [Using Amazon EC2 launch templates with AWS PCS](working-with_launch-templates.md).

   Your launch template must include:
   + The value `MarketType` of `InstanceMarketOptions` must be set to `capacity-block`.
   + A `CapacityReservationSpecification` with a valid `CapacityReservationId`
   + A valid `InstanceType` that matches the instance type of the Capacity Block you purchased.

1. Create a compute node group that uses the launch template. For more information, see [Creating a compute node group in AWS PCS](working-with_cng_create.md). You can also update an existing compute node group to use the launch template. For more information, see [Updating an AWS PCS compute node group](working-with_cng_update.md).

   When you create or update the compute node group:
   + The IAM identity you use to create or update the compute node group must have the following permission:

     ```
     ec2:DescribeCapacityReservations
     ```

     For more information, see [Minimum permissions for AWS PCS](security-min-permissions.md).
   + The Capacity Block must be in a `scheduled` or `active` state.
   + Set the `purchaseOption` of the compute node group to `CAPACITY_BLOCK`.
   + The `maxInstanceCount` of the compute node group must not exceed the size of the Capacity Block.
   + The availability zone of the compute node group must match 1 of the compute node group's subnet availability zones.

**Important**  
You can't change the instance type of a compute node group when you update it. You can only use a Capacity Block with the same instance type as the compute node group. If you want to use a Capacity Block with a different instance type, you must create a new compute node group.

# Frequently asked questions about using Capacity Blocks with AWS PCS
<a name="capacity-blocks-faq"></a>

**I just paid for a Capacity Block and immediately attempted to use it with AWS PCS but compute node group creation failed. What happened?**  
Your Capacity Block might not be in a `scheduled` or `active` state. Try again after the Capacity Block is `scheduled` or `active`.

**I am using a Capacity Block in AWS PCS and I purchased an extension before it expired. How do I continue using it in AWS PCS?**  
You don't have to do anything to continue using the Capacity Block in AWS PCS. The end date of your Capacity Block updates after your extension payment succeeds. As long as your Capacity Block doesn't expire, the compute node group continues to operate. If your extension payment fails, your Capacity Block remains `active` and the compute node group operates until the Capacity Block expires on its original end date.

**What happens to my queued and running jobs if my Capacity Block expires?**  
Queued jobs that didn't start before the Capacity Block expired remain pending until you attach another compute node group to the queue or you update the compute node group with a new Capacity Block. You can still submit jobs to the queue. Your Slurm settings affect active jobs. By default, active jobs are automatically re-queued, but might have errors or fail.

**My Capacity Block expired. Should I do something?**  
You don't have to do anything. You can check the Amazon EC2 console for the status of your EC2 capacity reservations. When a Capacity Block expires, the compute node group associated with that Capacity Block continues to exist and handle the same queues. The compute node group doesn't have any instances to run jobs. You can delete the compute node group or disassociate it from the queues to prevent users from submitting jobs that won't run.

**I want to use a new Capacity Block with my AWS PCS compute node group. What should I do?**  
We recommend you create a new compute node group to use the new Capacity Block. For more information, see [Configure an AWS PCS compute node group to use a Capacity Block](capacity-blocks-configure-cng.md).

**How can I share 1 Capacity Block across clusters and services?**  
You can split a Capacity Block across multiple clusters and services. For example, to split a Capacity Block with 64 `p5.48xlarge` instances with 20 nodes on PCS-Cluster-1, 16 nodes on PCS-Cluster-2, and the remaining nodes for other services, set both `minInstanceCount` and `maxInstanceCount` to 20 for PCS-Cluster-1 and 16 for PCS-Cluster-2.

**Can I use more than 1 Capacity Block or combined capacity with 1 compute node group?**  
No. Only 1 Capacity Block can be associated with a single compute node group. AWS PCS doesn't support capacity reservation groups that combine multiple Capacity Blocks.

**How do I know when my Capacity Blocks start or expire?**  
Independent from AWS PCS, Amazon EC2 sends a `Capacity Block Reservation Delivered` event through EventBridge when a Capacity Block reservation starts and a `Capacity Block Reservation Expiration Warning` event 40 minutes before the Capacity Block reservation expires. For more information, see [Monitor Capacity Blocks using EventBridge](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/capacity-blocks-monitor.html) in the *Amazon Elastic Compute Cloud User Guide*.

**How does Slurm track the state of my Capacity Block?**  
You can run `sinfo` to understand how AWS PCS uses the Capacity Block. In the following example output, a queue is associated with a compute node group that runs 4 instances from an `active` Capacity Block. The nodes are in the `idle` Slurm state (available for use and not yet allocated to any jobs).  

```
$ sinfo  
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST  
fanout up infinite 4 idle node-fanout-[1-4]
```
If the nodes are instead in `maint` state, you can run `scontrol show res` to see details about the Slurm reservation that controls this state. In the following example output, the Capacity Block is `scheduled` with a future start date.  

```
$ scontrol show res                                                                                                  
ReservationName=node-fanout-scheduled StartTime=2025-10-14T13:09:17 EndTime=2025-10-14T13:11:17 Duration=00:02:00    
   Nodes=node-fanout-[1-4] NodeCnt=4 CoreCnt=16 Features=(null) PartitionName=(null) Flags=MAINT,SPEC_NODES          
   TRES=cpu=16                                                                                                       
   Users=root Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null)                          
   MaxStartDelay=(null)                                                                                              
   Comment=node-fanout Scheduled
```

**How can I tell if the errors I'm getting while launching capacity are because my Capacity Block is shared?**  
Check **Capacity Reservations** in the Amazon EC2 console to find how many instances from the Capacity Block are actively provisioned. Check the tags of each instance to find which service or cluster uses it. For example, all instances for AWS PCS have AWS PCS tags such as `aws:pcs:cluster-id = pcs_l0mizqyk5o | aws:pcs:compute-node-group-id = pcs_ic7onkmfqk` that indicate which clusters and compute node groups the instance belongs to. You can then check if the Capacity Block is at maximum capacity.  
You use `scontrol show nodes` to check if a Capacity Block node in an AWS PCS cluster is triggering `ReservationCapacityExceeded`:  

```
[root@ip-172-16-10-54 ~]# scontrol show nodes test-node-8-gamma-cb-2  
NodeName=test-8-gamma-cb-2 CoresPerSocket=1  
   CPUAlloc=0 CPUEfctv=8 CPUTot=8 CPULoad=0.00  
   AvailableFeatures=test-8-gamma-cb,gpu  
   ActiveFeatures=test-8-gamma-cb,gpu  
   Gres=gpu:H100:1  
   NodeAddr=test-8-gamma-cb-2 NodeHostName=test-8-gamma-cb-2  
   RealMemory=249036 AllocMem=0 FreeMem=N/A Sockets=8 Boards=1  
   State=IDLE+CLOUD+POWERING_DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A  
   Partitions=my-q  
   BootTime=None SlurmdStartTime=None  
   LastBusyTime=Unknown ResumeAfterTime=None  
   CfgTRES=cpu=8,mem=249036M,billing=8  
   AllocTRES=  
   CurrentWatts=0 AveWatts=0  
   Reason=Failed to launch backing instance (Error Code: ReservationCapacityExceeded) [root@2025-08-28T15:15:33]
```

**When multiple compute node groups are attached to the same queue, how can I force a job to run on Capacity Block-backed instances?**  
You can use Slurm features and constraints to lock a job to a certain set of nodes. We recommend that you don't set Slurm weights for each compute node group because that only works with nodes that aren't in the `maint` state.