Balancing workloads
If you have a file system with multiple high-availability (HA) pairs, then its throughput and storage are spread across each of your HA pairs. FSx for ONTAP automatically balances your files as they are written to your file system, but your workload data and I/O are no longer balanced once you add HA pairs. Additionally, in rare cases, your workload data or I/O could become unbalanced across your file system's existing HA pairs, which can impact your workload's overall performance. If your workload is ever imbalanced, you can rebalance it across each of your file system’s HA pairs (and their commensurate file servers and aggregates—the storage pools which make up your primary storage tier).
Topics
Primary storage utilization balance
Your file system’s primary storage capacity is divided evenly among each of your HA pairs in storage pools called aggregates. Each HA pair has one aggregate. We recommend that you maintain an average utilization no higher than 80% for your primary storage tier on an ongoing basis. For file systems with multiple HA pairs, we recommend that you maintain an average utilization of up to 80% for every aggregate.
Maintaining 80% utilization ensures there is free space for new incoming data, and maintains a healthy overhead for maintenance operations which can temporarily claim free space on your aggregates.
If you notice that your aggregates are imbalanced, you can either increase your file system’s primary storage capacity (commensurately increasing the storage capacity of each aggregate), or you can move your volumes between aggregates. For more information, see Moving volumes between aggregates.
File server and disk performance utilization imbalance
Your file system’s total performance capabilities (such as the network throughput, file server to disk throughput and IOPS, and disk IOPS) is divided evenly among your file system’s HA pairs. We recommend that you maintain an average utilization below 50% (and a maximum peak utilization below 80%) for all performance limits on an ongoing basis—this goes for both the overall utilization of your file system’s file server resources across all HA pairs, as well as on a per-file server basis.
If you notice that your file server performance utilization is imbalanced—and the file servers on which your workload is imbalanced have an ongoing utilization of over 80%—you can use the ONTAP CLI and REST API to further diagnose the cause of performance imbalance and remediate it. Following is a table of possible imbalance indicators and next steps for further diagnosis.
If your file system's... | Then... |
---|---|
File server disk throughput or file server disk IOPS are imbalanced |
You may be experiencing I/O hotspotting on a subset of HA pairs (a subset of your volumes containing an outsized amount of data being accessed) which can limit your workload's overall performance because it's bottlenecked against a subset of HA pairs. For each highly-utilized file server, check the most-utilized volumes to see which volumes have the most activity within an aggregate. For more information on this procedure, see Rebalancing volumes. |
Network throughput is imbalanced, but your file server disk throughput, file server disk IOPS, or disk IOPS are not imbalanced |
Your data is evenly-distributed across HA pairs, but your clients are not. For the file servers which have more network throughput utilization than others, check the top clients for each file server, then rebalance those clients by unmounting any volumes from those clients and remounting them using a different endpoint on a different HA pair. For more information on this procedure, see Rebalancing clients. |
Mapping CloudWatch dimensions to ONTAP CLI and REST API resources
Your second-generation file system has Amazon CloudWatch metrics with the FileServer
or
Aggregate
dimension. In order to further diagnose cases of imbalance, you
need to map these dimension values to specific file servers (or nodes)
and aggregates in the ONTAP CLI or REST API.
For file servers, each file server name maps to a file server (or node) name in ONTAP (for example,
FsxId01234567890abcdef-01
). Odd-numbered file servers are preferred file servers (that is, they service traffic unless the file system has failed over to the secondary file server), while even-numbered file servers are secondary file servers (that is, they serve traffic only when their partner is unavailable). Because of this, secondary file servers will typically show less utilization than preferred file servers.For aggregates, each aggregate name maps to an aggregate in ONTAP (for example,
aggr1
). There is one aggregate for every HA pair, meaning aggregateaggr1
is shared by file serversFsxId01234567890abcdef-01
(the active file server) andFsxId01234567890abcdef-02
(the secondary file server) in an HA pair, aggregateaggr2
is shared by file serversFsxId01234567890abcdef-03
andFsxId01234567890abcdef-04
, and so on.
You can view the mappings between all aggregates and file servers using the ONTAP CLI.
-
To SSH into the NetApp ONTAP CLI of your file system, follow the steps documented in the Using the NetApp ONTAP CLI section of the Amazon FSx for NetApp ONTAP User Guide.
ssh fsxadmin@
file-system-management-endpoint-ip-address
-
Use the storage aggregate show
command, specifying the -fields node
parameter.::>
storage aggregate show -fields nodeaggregate node ------------------------------- ------------------------- aggr1 FsxId01234567890abcdef-01 aggr2 FsxId01234567890abcdef-03 aggr3 FsxId01234567890abcdef-05 aggr4 FsxId01234567890abcdef-07 aggr5 FsxId01234567890abcdef-09 aggr6 FsxId01234567890abcdef-11 6 entries were displayed.
Rebalancing clients
After adding HA pairs or if you’re experiencing I/O imbalance across file servers (specifically with network throughput utilization), you can rebalance your clients. If you’re rebalancing clients after adding HA pairs, you can skip to Remounting clients. Otherwise, you should first identify high-traffic clients you want to move to rebalance your workload I/O.
If you're experiencing I/O imbalance across file servers (specifically with Network throughput utilization), high I/O clients may be the cause. To identify high-traffic clients, use the ONTAP CLI.
Identify high-traffic clients
-
To SSH into the NetApp ONTAP CLI of your file system, follow the steps documented in the Using the NetApp ONTAP CLI section of the Amazon FSx for NetApp ONTAP User Guide.
ssh fsxadmin@
file-system-management-endpoint-ip-address
-
To view the highest-traffic clients, use the statistics top client show
ONTAP CLI command. You can optionally specify the -node
parameter to only view the top clients for a specific file server. If you are diagnosing imbalance for a specific file server, use the-node
parameter, replacingnode_name
with the name of the file server (for example,FsxId01234567890abcdef-01
).You can optionally add the
-interval
parameter, providing the interval over which to measure (in seconds) before each report is output. Increasing the interval (for example, to the maximum 300 seconds) provides a longer-term sample for the amount of traffic driven to each volume. The default is5
(seconds).::>
statistics top client show -nodeFsxId01234567890abcdef-01
[-interval [5,300]]In the output, the top clients are shown by their IP address and port.
*Total Total Client Vserver Node Ops (Bps) ------------------ --------- ------------------------- ------ --------- 172.17.236.53:938 svm01 FsxId01234567890abcdef-01 2143 140443648 172.17.236.160:898 svm02 FsxId01234567890abcdef-01 812 53215232
Remounting clients
-
You can rebalance clients to other HA pairs. To do so, unmount the volume from the client and remount it using the DNS name for the SVM’s NFS/SMB endpoint—this returns a random endpoint corresponding to a random HA pair.
We recommend you re-use the DNS name, but you have the option to explicitly choose which HA pair a given client mounts. To guarantee that you are mounting a client to a different endpoint, you can instead specify a different endpoint IP address than the one that corresponds to the file server that is experiencing high traffic. You can do so by running the following command:
::>
network interface show -vserversvm_name
-lif nfs_smb_management* -fields address,curr-nodevserver lif address curr-node --------- -------------------- ------------ ------------------------- svm01 nfs_smb_management_1 172.31.15.89 FsxId01234567890abcdef-01 svm01 nfs_smb_management_3 172.31.8.112 FsxId01234567890abcdef-03 2 entries were displayed.
According to the example output for the
statistics top client show
command, client172.17.236.53
is driving high traffic toFsxId01234567890abcdef-01
. The output of thenetwork interface show
command indicates this is the address172.31.15.89
. To mount to a different endpoint, select any other address (in this example, the only other address is172.31.8.112
, corresponding toFsxId01234567890abcdef-03
).
Rebalancing volumes
If you're experiencing I/O imbalance across your volumes or aggregates, you can rebalance volumes in order to redistribute your I/O traffic across your volumes.
Note
If you're experiencing storage utilization imbalance across your aggregates, there is generally not any performance impact unless the high utilization is coupled with I/O imbalance. While you can move volumes between aggregates to balance storage utilization, we recommend only moving volumes if you are seeing a performance impact, as moving volumes can have adverse impact on performance if you don't also consider the I/O driven to each volume you're considering moving.
-
To SSH into the NetApp ONTAP CLI of your file system, follow the steps documented in the Using the NetApp ONTAP CLI section of the Amazon FSx for NetApp ONTAP User Guide.
ssh fsxadmin@
file-system-management-endpoint-ip-address
-
Use the statistics volume show
ONTAP CLI command to view the highest-traffic volumes for a given aggregate, with the following changes: Replace
aggregate_name
with the aggregate’s name (for example,aggr1
).You can optionally add the
-interval
parameter, providing the interval over which to measure (in seconds) before each report is output. Increasing the interval (for example, to the maximum 300 seconds) provides a longer-term sample for the amount of traffic driven to each volume. The default is5
(seconds).
::>
statistics volume show -aggregate
aggregate_name
-sort-key total_ops [-interval [5,300]]Depending on the interval you chose, it can take up to 5 minutes to show data. The command shows all volumes in the aggregate, along with the amount of traffic being driven to each aggregate.
*Total Read Write Other Read Write Latency Volume Vserver Aggregate Ops Ops Ops Ops (Bps) (Bps) (us) ---------- ------- --------- ------ ---- ----- ----- --------- ----- ------- vol1__0007 svm1 aggr1 4078 4078 0 0 267255808 0 1092 vol1__0005 svm1 aggr1 4078 4078 0 0 267255808 0 1086 vol1__0003 svm1 aggr1 4077 4077 0 0 267223040 0 1086 vol1__0001 svm1 aggr1 4077 4077 0 0 267239424 0 1087 vol1__0008 svm1 aggr2 2314 2314 0 0 151650304 0 1112 vol1__0006 svm1 aggr2 2144 2144 0 0 140509184 0 1104 vol1__0002 svm1 aggr2 2183 2183 0 0 143065088 0 1106 vol1__0004 svm1 aggr2 2183 2183 0 0 143065088 0 1103
The volume statistics are shown on a per-constituent basis (for example,
vol1__0015
is the 15th constituent for FlexGroupvol1
). You can see from the example output, the constituents foraggr1
are more highly-utilized than the constituents foraggr2
. To balance traffic between aggregates, you can move the constituent volumes between aggregates so that traffic is more evenly distributed. -
If you have added new HA pairs, then you should move existing volumes to new aggregates. For more information, see Moving volumes between aggregates.