Slurm accounting with AWS ParallelCluster
Starting with version 3.3.0, AWS ParallelCluster supports Slurm accounting with the cluster configuration parameter SlurmSettings / Database.
Starting with version 3.10.0, AWS ParallelCluster supports Slurm accounting with an external Slurmdbd with the cluster configuration parameter SlurmSettings / ExternalSlurmdbd. Using an external Slurmdbd is recommended if multiple clusters share the same database.
With Slurm accounting, you can integrate an external accounting database to do the following:
-
Manage cluster users or groups of users and other entities. With this capability, you can use Slurm's more advanced features, such as resource limit enforcement, fairshare, and QOSs.
-
Collect and save job data, such as the user that ran the job, the job's duration, and the resources it uses. You can view the saved data with the
sacct
utility.
Note
AWS ParallelCluster supports Slurm accounting for Slurm supported
MySQL database servers
Working with Slurm accounting using external Slurmdbd in AWS ParallelCluster v3.10.0 and later
Before you configure Slurm accounting, you must have an existing external Slurmdbd database server, which connects to an existing external database server.
To configure this, define the following:
-
The address of the external Slurmdbd server in ExternalSlurmdbd / Host. The server must exist and be reachable from the head node.
-
The munge key to communicate with the external Slurmdbd server in MungeKeySecretArn.
To step through a tutorial, see Creating a cluster with an external Slurmdbd accounting.
Note
You are responsible to manage the Slurm database accounting entities.
The architecture of the AWS ParallelCluster external SlurmDB support feature enables multiple clusters sharing the same SlurmDB and the same database.
Warning
Traffic between AWS ParallelCluster and the external SlurmDB is not encrypted. It is recommended to run the cluster and the external SlurmDB in a trusted network.
Working with Slurm accounting using head node Slurmdbd in AWS ParallelCluster v3.3.0 and later
Before you configure Slurm accounting, you must have an existing external database server and database that uses mysql
protocol.
To configure Slurm accounting with AWS ParallelCluster, you must define the following:
-
The URI for the external database server in Database / Uri. The server must exist and be reachable from the head node.
-
Credentials to access the external database that are defined in Database / PasswordSecretArn and Database / UserName. AWS ParallelCluster uses this information to configure accounting at the Slurm level and the
slurmdbd
service on the head node.slurmdbd
is the daemon that manages communication between the cluster and the database server.
To step through a tutorial, see Creating a cluster with Slurm accounting.
Note
AWS ParallelCluster performs a basic bootstrap of the Slurm accounting database by setting the default cluster user as database admin in the Slurm database. AWS ParallelCluster doesn't add any other user to the accounting database. The customer is responsible for managing the accounting entities in the Slurm database.
AWS ParallelCluster configures slurmdbd
slurmdbd
configuration file
StorageLoc
Warning
-
We don't recommend setting up more than one cluster to use the same database at once. Doing so can lead to performance issues or even database deadlock situations.
-
If Slurm accounting is enabled on the head node of a cluster, we recommend using an instance type with a powerful CPU, more memory, and higher network bandwidth. Slurm accounting can add strain on the head node of the cluster.
In the current architecture of the AWS ParallelCluster Slurm accounting feature, each cluster has its own instance of the
slurmdbd
daemon as shown in the following diagram example configurations.
If you're adding custom Slurm multi-cluster or federation functionalities to your cluster environment, all clusters must reference the same
slurmdbd
instance. For this alternative, we recommend that you enable AWS ParallelCluster Slurm accounting on one cluster and
manually configure the other clusters to connect to the slurmdbd
that are hosted on the first cluster.
If you're using AWS ParallelCluster versions prior to version 3.3.0, refer to the alternative method to implement Slurm accounting that's
described in this HPC
Blog Post
Slurm accounting considerations
Database and cluster on different VPCs
To enable Slurm accounting, a database server is needed to serve as a backend for the read and write operations that the
slurmdbd
daemon performs. Before the cluster is created or updated to enable Slurm accounting, the head node must be able to reach
the database server.
If you need to deploy the database server on a VPC other than the one that the cluster uses, consider the following:
-
To enable communication between the
slurmdbd
on the cluster side and the database server, you must set up connectivity between the two VPCs. For more information, see VPC Peering in the Amazon Virtual Private Cloud User Guide. -
You must create the security group that you want to attach to the head node on the VPC of the cluster. After the two VPCs have been peered, cross-linking between the database side and the cluster side security groups is available. For more information, see Security Group Rules in the Amazon Virtual Private Cloud User Guide.
Configuring TLS encryption between slurmdbd
and the database
server
With the default Slurm accounting configuration that AWS ParallelCluster provides, slurmdbd
establishes a TLS encrypted
connection to the database server, if the server supports TLS encryption. AWS database services such as Amazon RDS and Amazon Aurora support
TLS encryption by default.
You can require secure connections on the server side by setting the require_secure_transport
parameter on the database server.
This is configured in the provided CloudFormation template.
Following security best practice, we recommend that you also enable server identity verification on the slurmdbd
client. To do
this, configure the StorageParametersslurmdbd.conf
. Upload the server CA certificate to the head node of the cluster. Next, set the SSL_CAStorageParameters
in
slurmdbd.conf
to the path of the server CA certificate on the head node. Doing this enables server identity verification on the
slurmdbd
side. After you make these changes, restart the slurmdbd
service to re-establish connectivity to the database
server with identity verification enabled.
Updating the database credentials
To update the values for Database / UserName or PasswordSecretArn, you must first stop the compute fleet. Suppose that the secret value that's stored in the AWS Secrets Manager secret is changed and its ARN isn't changed. In this situation, the cluster doesn't automatically update the database password to the new value. To update the cluster for the new secret value, run the following command from the head node.
$
sudo /opt/parallelcluster/scripts/slurm/update_slurm_database_password.sh
Warning
To avoid losing accounting data, we recommend that you only change the database password when the compute fleet is stopped.
Database monitoring
We recommend that you enable the monitoring features of the AWS database services. For more information, see Amazon RDS monitoring or Amazon Aurora monitoring documentation.