

# Usage reporting for cost attribution in SageMaker HyperPod
<a name="sagemaker-hyperpod-usage-reporting"></a>

Usage reporting in SageMaker HyperPod EKS-orchestrated clusters provides granular visibility into compute resource consumption. The capability allows organizations to implement transparent cost attribution, allocating cluster costs to teams, projects, or departments based on their actual usage. By tracking metrics such as GPU/CPU hours, and Neuron Core utilization - captured in *both team-level aggregates and task-specific breakdowns* - usage reporting complements HyperPod's [Task Governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md) functionality, ensuring fair cost distribution in shared multi-tenant clusters by:
+ Eliminating guesswork in cost allocation
+ Directly linking expenses to measurable resource consumption
+ Enforcing usage-based accountability in shared infrastructure environments

## Prerequisites
<a name="sagemaker-hyperpod-usage-reporting-prerequisites"></a>

To use this capability:
+ You need:
  + An active **SageMaker HyperPod environment** with a running EKS-orchestrated cluster.
  + (Strongly recommended) **Task Governance configured** with compute quotas and priority rules. For setup instructions, see [Task Governance setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup.md).
+ Familiarize yourself with these core concepts:
  + **Allocated compute quota:** Resources reserved for a team based on predefined quotas in their Task Governance policies. This is *guaranteed capacity* for their workloads.
  + **Borrowed compute:** Idle resources from the shared cluster pool that teams can temporarily use *beyond their allocated quota*. Borrowed compute is assigned dynamically based on priority rules in the Task Governance policies and availability of unused resources.
  + **Compute usage:** The measurement of resources (GPU, CPU, Neuron Core hours) consumed by a team, tracked as:
    + **Allocated utilization**: Usage within the team's quota.
    + **Borrowed utilization**: Usage beyond the quota, drawn from the shared pool.
  + **Cost attribution:** The process of allocating cluster costs to teams based on their *actual compute usage*, including both resources consumed within their predefined quota and resources temporarily used from the shared cluster pool beyond their quota.

## Reports types
<a name="sagemaker-hyperpod-usage-reporting-report-types"></a>

HyperPod's usage reports provide varying operational granularity:
+ **Summary reports** provide organization-wide visibility into compute usage, aggregating total GPU/CPU/Neuron Core hours per team (namespace) while distinguishing between *regular usage* (resources from a team's allocated quota) and *borrowed compute* (overflow capacity from shared pools).
+ **Detailed reports** offer task-level breakdowns by team, tracking exact compute hours spent running specific tasks – including preempted tasks, hourly utilization patterns, and namespace-specific allocations.

**Important**  
HyperPod usage reporting tracks compute utilization across *all Kubernetes namespaces* in a cluster—including those managed by Task Governance, default namespaces, and namespaces created **outside of Task Governance** (e.g., via direct Kubernetes API calls or external tools). This infrastructure-level monitoring ensures comprehensive usage-based accountability, preventing gaps in cost attribution for shared clusters regardless of how namespaces are managed.

## Reports formats and time range
<a name="sagemaker-hyperpod-usage-reporting-formats"></a>

Using the Python script provided in [Generate reports](sagemaker-hyperpod-usage-reporting-generate.md), administrators can generate usage reports on demand in CSV or PDF formats, selecting time ranges from daily snapshots to 180-day (6-month) historical windows.

**Note**  
You can configure the historical window to extend beyond the default 180-day maximum when setting up the reporting infrastructure. For more information on configuring the data retention period, see [Install Usage Report Infrastructure using CloudFormation](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#install-usage-report-infrastructure-using-cloudformation). 

## Illustrative use cases
<a name="sagemaker-hyperpod-usage-reporting-use-cases"></a>

This capability addresses critical scenarios in multi-tenant AI/ML environments such as:

1. **Cost allocation for shared clusters**: An administrator manages a HyperPod cluster shared by 20 teams training generative AI models. Using a *summary usage report*, they analyze daily GPU utilization over 180 days and discover Team A consumed 200 GPU hours of a specific instance type—170 from their allocated quota and 30 from borrowed compute. The administrator invoices Team A based on this reported usage.

1. **Auditing and dispute resolution**: A finance team questions cost attribution accuracy, citing inconsistencies. The administrator can export a *detailed task-level report* to audit discrepancies. By cross-referencing timestamps, instance types, and preempted jobs within the team's namespace, the report transparently reconcile disputed usage data.

# Reports details and data breakdown
<a name="sagemaker-hyperpod-usage-reporting-content"></a>

SageMaker HyperPod's usage reports provide two distinct lenses for analyzing compute resource consumption: **summary reports** for cost allocation and **detailed reports** for granular auditing. Summary reports aggregate cluster-wide usage by team or namespace, highlighting trends in allocated versus borrowed compute across GPU, CPU, and Neuron Core resources. Detailed reports drill into individual tasks, exposing metrics such as execution windows, task status, and priority-class utilization. In this section, we break down the structure of these reports, understand their key metrics, and demonstrate how administrators and finance teams can cross-reference summary trends with task-level data to validate cost attribution accuracy, resolve discrepancies, and optimize shared infrastructure.

## Common report headers
<a name="sagemaker-hyperpod-usage-reporting-content-headers"></a>

Both summary and detailed reports include the following metadata to contextualize the usage data:
+ **ClusterName:** The EKS-orchestrated Hyperpod cluster name where resources were consumed.
+ **Type:** The report category (`Summary Utilization Report` or `Detailed Utilization Report`).
+ **Date Generated:** When the report was created (e.g., `2025-04-18`).
+ **Date Range (UTC):** The timeframe covered (e.g., `2025-04-16 to 2025-04-18`).
+ **Missing data periods:** Gaps in data collection due to cluster downtime or monitoring issues (e.g., `2025-04-16 00:00:00 to 2025-04-19 00:00:00`).

## Summary reports
<a name="sagemaker-hyperpod-usage-reporting-content-summary"></a>

Summary reports provide a per-day high-level overview of compute resource consumption across teams/namespaces, and instance types distinguishing between allocated (reserved quota) and borrowed (lended pool) utilization. These reports are ideal for invoice generation, cost attribution statements, or capacity forecasting.

*Example: A summary report might show that Team A used 200 GPU hours—170 from their allocated quota and 30 borrowed.*

Here's a structured breakdown of the key columns in a summary report:
+ **Date:** The date of the reported usage (e.g., `2025-04-18`).
+ **Namespace:** The Kubernetes namespace associated with the team (e.g., `hyperpod-ns-ml-team`).
+ **Team:** The Owning team/department (e.g., `ml-team`).
+ **Instance Type:** The compute instance used (e.g., ml.g5.4xlarge).
+ **Total/Allocated/Borrowed Utilization (Hours):** The breakdown of GPU, CPU, or Neuron Core usage by category.

  Where:
  + **Total utilization = Allocated utilization \$1 Borrowed utilization**
  + **Allocated utilization** is the actual GPU CPU, or Neuron Core hours a team has used, capped at 100% of their allocated quota.
  + **Borrowed utilization** is the actual GPU, CPU, or Neuron Core hours a team has used *beyond their allocated quota*, drawn from the shared cluster pool based on Task Governance priority rules and resource availability.

Example: 72 GPU hours total (48 allocated, 24 borrowed).

**Note**  
Only total utilization is displayed for namespaces not managed by Task Governance.

## Detailed reports
<a name="sagemaker-hyperpod-usage-reporting-content-detailed"></a>

Detailed reports provide forensic-level visibility into compute usage, breaking down resource consumption by task, exposing granular metrics like task execution windows, status (e.g., Succeeded, Failed), and priority-class usage. These reports are ideal for billing discrepancies validation, or ensuring compliance with governance policies.

Here's a structured breakdown of the key columns in a detailed report:
+ **Date:** The date of the reported usage (e.g., `2025-04-18`).
+ **Period Start/End:** Exact execution window (UTC) for the task. (e.g., `19:54:34`)
+ **Namespace:** The Kubernetes namespace associated with the team (e.g., `hyperpod-ns-ml-team`).
+ **Team:** The Owning team/department (e.g., `ml-team`).
+ **Task:** The identifier for the job/pod (e.g., `pytorchjob-ml-pytorch-job-2p5zt-db686`).
+ **Instance:** The compute instance used (e.g., `ml.g5.4xlarge`).
+ **Status:** Task outcome (Succeeded, Failed, Preempted).
+ **Total Utilization:** Total consumption (hours and instance count) of GPU, CPU, or Neuron Core resources.
+ **Priority Class:** The priority tier assigned (e.g., training-priority).

# Generate reports
<a name="sagemaker-hyperpod-usage-reporting-generate"></a>

This guide provides step-by-step instructions to configure and manage usage reporting for your SageMaker HyperPod clusters. Follow these procedures to deploy infrastructure, generate custom reports, and remove resources when no longer needed.

## Set up usage reporting
<a name="sagemaker-hyperpod-usage-reporting-install"></a>

**Note**  
Before configuring the SageMaker HyperPod usage report infrastructure in your SageMaker HyperPod cluster, ensure you have met all prerequisites detailed in this [https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#prerequisites](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#prerequisites).

Usage reporting in HyperPod requires:
+ Deploying SageMaker HyperPod usage report AWS resources using an CloudFormation stack
+ Installing the SageMaker HyperPod usage report Kubernetes operator via a Helm chart

You can find comprehensive installation instructions in the [SageMaker HyperPod usage report GitHub repository](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md). Specifically, follow the steps in the [Set up](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#set-up-usage-reporting) section.

## Generate usage reports on demand
<a name="sagemaker-hyperpod-usage-reporting-use"></a>

Once the usage reporting infrastructure and Kubernetes operator are installed, job data for your SageMaker HyperPod cluster is automatically collected and stored in the S3 bucket you configured during setup. The operator continuously captures detailed usage metrics in the background, creating raw data files in the `raw` directory of your designated S3 bucket.

To generate an on-demand usage report, you can use the `run.py` script provided in the [SageMaker HyperPod usage report GitHub repository](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md) to extract and export usage metrics. Specifically, you can find the script and comprehensive instructions for generating a report in the [Generate Reports](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#generate-reports) section.

The script allows you to:
+ Specify custom date ranges for report generation
+ Choose between detailed and summary report types
+ Export reports in CSV or PDF formats
+ Direct reports to a specific S3 location

## Clean up usage reporting resources
<a name="sagemaker-hyperpod-usage-reporting-cleanup"></a>

When you no longer need your SageMaker HyperPod usage reporting infrastructure, follow the steps in [Clean Up Resources](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#clean-up-resources) to clean up the Kubernetes operator and AWS resources (in that order). Proper resource deletion helps prevent unnecessary costs.