# Balancing workloads across HA pairs
<a name="monitor-workload-balance"></a>

If you have a file system with multiple high-availability (HA) pairs, then its throughput and storage are spread across each of your HA pairs. FSx for ONTAP automatically balances your files as they are written to your file system, but your workload data and I/O are no longer balanced once you add HA pairs. Additionally, in rare cases, your workload data or I/O could become unbalanced across your file system's existing HA pairs, which can impact your workload's overall performance. If your workload is ever imbalanced, you can rebalance it across each of your file system’s HA pairs (and their commensurate file servers and *aggregates*—the storage pools which make up your primary storage tier).

**Topics**
+ [Primary storage utilization balance](#primary-storage-balance)
+ [File server and disk performance utilization imbalance](#server-disk-imbalance)
+ [Mapping CloudWatch dimensions to ONTAP CLI and REST API resources](#map-dimensions-to-resources)
+ [Rebalancing clients](#rebalancing-clients)
+ [Rebalancing volumes](#rebalancing-volumes)

## Primary storage utilization balance
<a name="primary-storage-balance"></a>

Your file system’s primary storage capacity is divided evenly among each of your HA pairs in storage pools called aggregates. Each HA pair has one aggregate. We recommend that you maintain an average utilization no higher than 80% for your primary storage tier on an ongoing basis. For file systems with multiple HA pairs, we recommend that you maintain an average utilization of up to 80% for every aggregate.

Maintaining 80% utilization ensures there is free space for new incoming data, and maintains a healthy overhead for maintenance operations which can temporarily claim free space on your aggregates.

If you notice that your aggregates are imbalanced, you can either increase your file system’s primary storage capacity (commensurately increasing the storage capacity of each aggregate), or you can move your volumes between aggregates. For more information, see [Moving volumes between aggregates](moving-fg-volumes.md).

## File server and disk performance utilization imbalance
<a name="server-disk-imbalance"></a>

Your file system’s total performance capabilities (such as the network throughput, file server to disk throughput and IOPS, and disk IOPS) is divided evenly among your file system’s HA pairs. We recommend that you maintain an average utilization below 50% (and a maximum peak utilization below 80%) for all performance limits on an ongoing basis—this goes for both the overall utilization of your file system’s file server resources across all HA pairs, as well as on a per-file server basis.

If you notice that your file server performance utilization is imbalanced—and the file servers on which your workload is imbalanced have an ongoing utilization of over 80%—you can use the ONTAP CLI and REST API to further diagnose the cause of performance imbalance and remediate it. Following is a table of possible imbalance indicators and next steps for further diagnosis.


| If your file system's... | Then... | 
| --- | --- | 
| File server disk throughput or file server disk IOPS are imbalanced | You may be experiencing I/O hotspotting on a subset of HA pairs (a subset of your volumes containing an outsized amount of data being accessed) which can limit your workload's overall performance because it's bottlenecked against a subset of HA pairs. For each highly-utilized file server, check the most-utilized volumes to see which volumes have the most activity within an aggregate. For more information on this procedure, see [Rebalancing volumes](#rebalancing-volumes). | 
| Network throughput is imbalanced, but your file server disk throughput, file server disk IOPS, or disk IOPS are not imbalanced  | Your data is evenly-distributed across HA pairs, but your clients are not. For the file servers which have more network throughput utilization than others, check the top clients for each file server, then rebalance those clients by unmounting any volumes from those clients and remounting them using a different endpoint on a different HA pair. For more information on this procedure, see [Rebalancing clients](#rebalancing-clients).  | 

## Mapping CloudWatch dimensions to ONTAP CLI and REST API resources
<a name="map-dimensions-to-resources"></a>

Your second-generation file system has Amazon CloudWatch metrics with the `FileServer` or `Aggregate` dimension. In order to further diagnose cases of imbalance, you need to map these dimension values to specific file servers (or *nodes*) and aggregates in the ONTAP CLI or REST API.
+ For file servers, each file server name maps to a file server (or node) name in ONTAP (for example, `FsxId01234567890abcdef-01`). Odd-numbered file servers are preferred file servers (that is, they service traffic unless the file system has failed over to the secondary file server), while even-numbered file servers are secondary file servers (that is, they serve traffic only when their partner is unavailable). Because of this, secondary file servers will typically show less utilization than preferred file servers.
+ For aggregates, each aggregate name maps to an aggregate in ONTAP (for example, `aggr1`). There is one aggregate for every HA pair, meaning aggregate `aggr1` is shared by file servers `FsxId01234567890abcdef-01` (the active file server) and `FsxId01234567890abcdef-02` (the secondary file server) in an HA pair, aggregate `aggr2` is shared by file servers `FsxId01234567890abcdef-03` and `FsxId01234567890abcdef-04`, and so on.

You can view the mappings between all aggregates and file servers using the ONTAP CLI.

1.  To SSH into the NetApp ONTAP CLI of your file system, follow the steps documented in the [Using the NetApp ONTAP CLI](managing-resources-ontap-apps.md#netapp-ontap-cli) section of the *Amazon FSx for NetApp ONTAP User Guide*.

   ```
   ssh fsxadmin@file-system-management-endpoint-ip-address
   ```

1. Use the [storage aggregate show](https://docs.netapp.com/us-en/ontap-cli-9131/storage-aggregate-show.html) command, specifying the `-fields node` parameter.

   ```
   ::> storage aggregate show -fields node
   aggregate                       node                      
   ------------------------------- ------------------------- 
   aggr1                           FsxId01234567890abcdef-01
   aggr2                           FsxId01234567890abcdef-03
   aggr3                           FsxId01234567890abcdef-05 
   aggr4                           FsxId01234567890abcdef-07
   aggr5                           FsxId01234567890abcdef-09
   aggr6                           FsxId01234567890abcdef-11 
   6 entries were displayed.
   ```

## Rebalancing clients
<a name="rebalancing-clients"></a>

After adding HA pairs or if you’re experiencing I/O imbalance across file servers (specifically with network throughput utilization), you can rebalance your clients. If you’re rebalancing clients after adding HA pairs, you can skip to [Remounting clients](#remounting-clients). Otherwise, you should first identify high-traffic clients you want to move to rebalance your workload I/O. 

If you're experiencing I/O imbalance across file servers (specifically with Network throughput utilization), high I/O clients may be the cause. To identify high-traffic clients, use the ONTAP CLI.

**Identify high-traffic clients**

1. To SSH into the NetApp ONTAP CLI of your file system, follow the steps documented in the [Using the NetApp ONTAP CLI](managing-resources-ontap-apps.md#netapp-ontap-cli) section of the *Amazon FSx for NetApp ONTAP User Guide*.

   ```
   ssh fsxadmin@file-system-management-endpoint-ip-address
   ```

1. To view the highest-traffic clients, use the [statistics top client show](https://docs.netapp.com/us-en/ontap-cli-9131/statistics-top-client-show.html) ONTAP CLI command. You can optionally specify the `-node` parameter to only view the top clients for a specific file server. If you are diagnosing imbalance for a specific file server, use the `-node` parameter, replacing `node_name` with the name of the file server (for example, `FsxId01234567890abcdef-01`).

   You can optionally add the `-interval` parameter, providing the interval over which to measure (in seconds) before each report is output. Increasing the interval (for example, to the maximum 300 seconds) provides a longer-term sample for the amount of traffic driven to each volume. The default is `5` (seconds).

   ```
   ::> statistics top client show -node FsxId01234567890abcdef-01 [-interval [5,300]]
   ```

   In the output, the top clients are shown by their IP address and port.

   ```
                                                          *Total     Total
               Client   Vserver          Node                Ops     (Bps)
   ------------------ --------- ------------------------- ------ ---------
    172.17.236.53:938 svm01     FsxId01234567890abcdef-01   2143 140443648
   172.17.236.160:898 svm02     FsxId01234567890abcdef-01    812  53215232
   ```<a name="remounting-clients"></a>

**Remounting clients**
+ You can rebalance clients to other HA pairs. To do so, unmount the volume from the client and remount it using the DNS name for the SVM’s NFS/SMB endpoint—this returns a random endpoint corresponding to a random HA pair.

  We recommend you re-use the DNS name, but you have the option to explicitly choose which HA pair a given client mounts. To guarantee that you are mounting a client to a different endpoint, you can instead specify a different endpoint IP address than the one that corresponds to the file server that is experiencing high traffic. You can do so by running the following command:

  ```
  ::> network interface show -vserver svm_name -lif nfs_smb_management* -fields address,curr-node
  vserver   lif                  address      curr-node                 
  --------- -------------------- ------------ ------------------------- 
  svm01 nfs_smb_management_1 172.31.15.89 FsxId01234567890abcdef-01 
  svm01 nfs_smb_management_3 172.31.8.112 FsxId01234567890abcdef-03 
  2 entries were displayed.
  ```

  According to the example output for the `statistics top client show` command, client `172.17.236.53` is driving high traffic to `FsxId01234567890abcdef-01`. The output of the `network interface show` command indicates this is the address `172.31.15.89`. To mount to a different endpoint, select any other address (in this example, the only other address is `172.31.8.112`, corresponding to `FsxId01234567890abcdef-03`).

## Rebalancing volumes
<a name="rebalancing-volumes"></a>

If you're experiencing I/O imbalance across your volumes or aggregates, you can rebalance volumes in order to redistribute your I/O traffic across your volumes.

**Note**  
If you're experiencing storage utilization imbalance across your aggregates, there is generally not any performance impact unless the high utilization is coupled with I/O imbalance. While you can move volumes between aggregates to balance storage utilization, we recommend only moving volumes if you are seeing a performance impact, as moving volumes can have adverse impact on performance if you don't also consider the I/O driven to each volume you're considering moving.

1. To SSH into the NetApp ONTAP CLI of your file system, follow the steps documented in the [Using the NetApp ONTAP CLI](managing-resources-ontap-apps.md#netapp-ontap-cli) section of the *Amazon FSx for NetApp ONTAP User Guide*.

   ```
   ssh fsxadmin@file-system-management-endpoint-ip-address
   ```

1. Use the [statistics volume show](https://docs.netapp.com/us-en/ontap-cli-9131/statistics-volume-show.html) ONTAP CLI command to view the highest-traffic volumes for a given aggregate, with the following changes:
   + Replace *aggregate\$1name* with the aggregate’s name (for example, `aggr1`).
   + You can optionally add the `-interval` parameter, providing the interval over which to measure (in seconds) before each report is output. Increasing the interval (for example, to the maximum 300 seconds) provides a longer-term sample for the amount of traffic driven to each volume. The default is `5` (seconds).

   ```
   ::> statistics volume show -aggregate aggregate_name -sort-key total_ops [-interval [5,300]]
   ```

   Depending on the interval you chose, it can take up to 5 minutes to show data. The command shows all volumes in the aggregate, along with the amount of traffic being driven to each aggregate.

   ```
                                *Total Read Write Other      Read Write Latency 
       Volume Vserver Aggregate    Ops  Ops   Ops   Ops     (Bps) (Bps)    (us) 
   ---------- ------- --------- ------ ---- ----- ----- --------- ----- ------- 
   vol1__0007    svm1     aggr1   4078 4078     0     0 267255808     0    1092 
   vol1__0005    svm1     aggr1   4078 4078     0     0 267255808     0    1086 
   vol1__0003    svm1     aggr1   4077 4077     0     0 267223040     0    1086 
   vol1__0001    svm1     aggr1   4077 4077     0     0 267239424     0    1087 
   vol1__0008    svm1     aggr2   2314 2314     0     0 151650304     0    1112 
   vol1__0006    svm1     aggr2   2144 2144     0     0 140509184     0    1104 
   vol1__0002    svm1     aggr2   2183 2183     0     0 143065088     0    1106 
   vol1__0004    svm1     aggr2   2183 2183     0     0 143065088     0    1103
   ```

   The volume statistics are shown on a per-constituent basis (for example, `vol1__0015` is the 15th constituent for FlexGroup `vol1`). You can see from the example output, the constituents for `aggr1` are more highly-utilized than the constituents for `aggr2`. To balance traffic between aggregates, you can move the constituent volumes between aggregates so that traffic is more evenly distributed.

1. If you have added new HA pairs, then you should move existing volumes to new aggregates. For more information, see [Moving volumes between aggregates](moving-fg-volumes.md).