

# Troubleshooting Amazon DocumentDB
<a name="troubleshooting"></a>

The following sections provide information about how to troubleshoot problems that you might encounter when using Amazon DocumentDB (with MongoDB compatibility).

**Topics**
+ [Connectivity issues](troubleshooting.connecting.md)
+ [Indexes](troubleshooting.index-creation.md)
+ [Performance and resource utilization](user_diagnostics.md)
+ [Garbage collection](garbage-collection.md)

# Troubleshooting connectivity issues
<a name="troubleshooting.connecting"></a>

Having trouble connecting? Here are some common scenarios and how to resolve them.

**Topics**
+ [Can't connect to an Amazon DocumentDB endpoint](#troubleshooting-connecting)
+ [Testing a connection to an Amazon DocumentDB instance](#troubleshooting.testing-connection)
+ [Connecting to an invalid endpoint](#troubleshooting.invalid-endpoint)
+ [Driver configuration impacting number of connections](#troubleshooting.driver.config)

## Can't connect to an Amazon DocumentDB endpoint
<a name="troubleshooting-connecting"></a>

When you try to connect to Amazon DocumentDB, the following is one of the most common error messages that you might receive.

```
connecting to: mongodb://docdb-2018-11-08-21-47-27.cluster-ccuszbx3pn5e.us-east-
1.docdb.amazonaws.com:27017/
2018-11-14T14:33:46.451-0800 W NETWORK [thread1] Failed to connect to
172.31.91.193:27017 after 5000ms milliseconds, giving up.
2018-11-14T14:33:46.452-0800 E QUERY [thread1] Error: couldn't connect to server
docdb-2018-11-08-21-47-27.cluster-ccuszbx3pn5e.us-east-1.docdb.amazonaws.com:27017,
connection attempt failed :
connect@src/mongo/shell/mongo.js:237:13
@(connect):1:6
exception: connect failed
```

What this error message typically means is that your client (the mongo shell in this example) cannot access the Amazon DocumentDB endpoint. This might be the case for several reasons:

**Topics**
+ [Connecting from public endpoints](#troubleshooting.cannot-connect.public-endpoints)
+ [Cross region connections](#troubleshooting.cannot-connect.different-regions)
+ [Connecting from different Amazon VPCs](#troubleshooting.cannot-connect.different-vpcs)
+ [Security group blocks inbound connections](#troubleshooting.cannot-connect.inbound-not-allowed)
+ [Java Mongo driver read preference issue](#troubleshooting-cannot-connect-java-mongo-issue)

### Connecting from public endpoints
<a name="troubleshooting.cannot-connect.public-endpoints"></a>

**You are trying to connect to an Amazon DocumentDB cluster directly from your laptop or local development machine.**

Trying to connect to an Amazon DocumentDB cluster directly from a public endpoint, such as your laptop or local development machine, will fail. Amazon DocumentDB is virtual private cloud (VPC)-only and does not currently support public endpoints. Thus, you can't connect directly to your Amazon DocumentDB cluster from your laptop or local development environment outside of your VPC.

To connect to an Amazon DocumentDB cluster from outside an Amazon VPC, you can use an SSH tunnel. For more information, see [Connecting to an Amazon DocumentDB cluster from outside an Amazon VPC](connect-from-outside-a-vpc.md). Additionally, if your development environment is in a different Amazon VPC, you can also use VPC Peering and connect to your Amazon DocumentDB cluster from another Amazon VPC in the same region or a different region.

### Cross region connections
<a name="troubleshooting.cannot-connect.different-regions"></a>

**You are trying to connect to an Amazon DocumentDB cluster in another region.**

If you try to connect to an Amazon DocumentDB cluster from an Amazon EC2 instance in a Region other than the cluster's Region—for example, trying to connect to a cluster in US East (N. Virginia) Region (us-east-1) from US West (Oregon) Region (us-west-2)—the connection will fail.

To verify the Region of your Amazon DocumentDB cluster, run the following command. The Region is in the endpoint.

```
aws docdb describe-db-clusters \
   --db-cluster-identifier sample-cluster \
   --query 'DBClusters[*].Endpoint'
```

Output from this operation looks something like the following. 

```
[
    "sample-cluster.node.us-east-1.docdb.amazonaws.com"
]
```

To verify the Region of your EC2 instance, run the following command.

```
 aws ec2 describe-instances \
     --query 'Reservations[*].Instances[*].Placement.AvailabilityZone'
```

Output from this operation looks something like the following. 

```
[
    [
        "us-east-1a"
    ]
]
```

### Connecting from different Amazon VPCs
<a name="troubleshooting.cannot-connect.different-vpcs"></a>

**You are trying to connect to an Amazon DocumentDB cluster from a VPC that is different than the Amazon VPC your cluster is deployed to.**

If both your Amazon DocumentDB cluster and Amazon EC2 instance are in the same AWS Region, but not in the same Amazon VPC, you cannot connect directly to your Amazon DocumentDB cluster unless VPC Peering is enabled between the two Amazon VPCs.

To verify the Amazon VPC of your Amazon DocumentDB instance, run the following command.

```
aws docdb describe-db-instances \
   --db-instance-identifier sample-instance \
   --query 'DBInstances[*].DBSubnetGroup.VpcId'
```

To verify the Amazon VPC of your Amazon EC2 instance, run the following command.

```
aws ec2 describe-instances \
   --query 'Reservations[*].Instances[*].VpcId'
```

### Security group blocks inbound connections
<a name="troubleshooting.cannot-connect.inbound-not-allowed"></a>

**You are trying to connect to an Amazon DocumentDB cluster, and the cluster’s security group does not allow inbound connections on the cluster’s port (default port: 27017).** 

Suppose that your Amazon DocumentDB cluster and Amazon EC2 instance are both in the same Region and Amazon VPC and use the same Amazon VPC security group. If you can't connect to your Amazon DocumentDB cluster, the likely cause is that your security group (that is, firewall) for your cluster doesn't allow inbound connections on the port you chose for your Amazon DocumentDB cluster (default port is 27017).

To verify the port for your Amazon DocumentDB cluster, run the following command.

```
aws docdb describe-db-clusters \
   --db-cluster-identifier sample-cluster \
   --query 'DBClusters[*].[DBClusterIdentifier,Port]'
```

To get your Amazon DocumentDB security group for your cluster, run the following command.

```
aws docdb describe-db-clusters \
   --db-cluster-identifier sample-cluster \
   --query 'DBClusters[*].[VpcSecurityGroups[*],VpcSecurityGroupId]'
```

To check the inbound rules for your security group, see the following topics in the Amazon EC2 documentation:
+ [ Authorizing Inbound Traffic for Your Linux Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/authorizing-access-to-an-instance.html)
+ [ Authorizing Inbound Traffic for Your Windows Instances](https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/authorizing-access-to-an-instance.html)

### Java Mongo driver read preference issue
<a name="troubleshooting-cannot-connect-java-mongo-issue"></a>

**Client read preferences are not honored and some clients cannot write to Amazon DocumentDB after failover unless they reboot.**

This issue, first discovered in Java Mongo Driver 3.7.x, occurs when a client establishes a connection to Amazon DocumentDB using `MongoClientSettings` and, specifically, when chaining the `applyToClusterSettings` method. The MongoClient Cluster Settings can be defined using a few different methods, such as `hosts()`, `requiredReplicaSetName()`, and `mode()`.

 When the client specifies only one host in the `hosts()` method, the mode is set to `ClusterConnectionMode.SINGLE` instead of `ClusterConnectionMode.MULTIPLE` This causes the client to disregard the read preference and only connect to the server configured in `hosts()`. So even if the client settings are initialized like below, all reads would still go to the primary instead of the secondary.

```
final ServerAddress serverAddress0 = new ServerAddress("cluster-endpoint", 27317));
    final MongoCredential credential = MongoCredential.createCredential("xxx",
            "admin", "xxxx".toCharArray());
    final MongoClientSettings settings = MongoClientSettings.builder()
            .credential(credential)
            .readPreference(ReadPreference.secondaryPreferred())
            .retryWrites(false)
            .applyToSslSettings(builder -> builder
                    .enabled(false))
            .applyToClusterSettings(builder -> builder.hosts(
                            Arrays.asList(serverAddress0
                            ))
                    .requiredReplicaSetName("rs0"))
            .build();
    MongoClient mongoClient = MongoClients.create(settings);
```

**Failover case**

Using the above client connection settings, if there is a failover and a delayed DNS record update for the cluster writer endpoint, the client would still try to issue writes to the old writer (now reader after failover). This results in a server-side error (not primary) which is not handled appropriately by the Java driver (this is still under investigation). Thus, the client can be left in a bad state until the application server is rebooted, for example.

There are two workarounds for this:
+ Clients that connect to Amazon DocumentDB via a connection string will not have this issue, since `ClusterConnectionMode` will be set to `MULTIPLE` when setting read preference.

  ```
  MongoClientURI mongoClientURI = new MongoClientURI("mongodb://usr:pass:cluster-endpoint:27317/test?ssl=false&replicaSet=rs0&readpreference=secondaryPreferred");
  MongoClient mongoClient = MongoClients.create(mongoClientURI.getURI());
  ```

  Or using `MongoClientSettings` builder with the `applyConnectionString` method.

  ```
  final MongoClientSettings settings = MongoClientSettings.builder()
          .credential(credential)
          .applyConnectionString(new ConnectionString("usr:pass:cluster-endpoint:27317/test?ssl=false&replicaSet=rs0&readpreference=secondaryPreferred"))
          .retryWrites(false)
          .applyToSslSettings(builder → builder
                  .enabled(false))
          .build();
  MongoClient mongoClient = MongoClients.create(settings);
  ```
+ Explicitly set `ClusterConnectionMode` to `MULTIPLE`. This is only needed when using `applyToClusterSettings` and `hosts().size() == 1`.

  ```
  final ServerAddress serverAddress0 = new ServerAddress("cluster-endpoint", 27317));
  final MongoCredential credential = MongoCredential.createCredential("xxx","admin", "xxxx".toCharArray());
  final MongoClientSettings settings = MongoClientSettings.builder()
      .credential(credential)
      .readPreference(ReadPreference.secondaryPreferred())
      .retryWrites(false)
      .applyToSslSettings(builder → builder
      .enabled(false))
      .applyToClusterSettings(builder → builder
                  .hosts(Arrays.asList(serverAddress0))
                  .requiredReplicaSetName("rs0"))
                  .mode(ClusterConnectionMode.MULTIPLE))
      .build();
  MongoClient mongoClient = MongoClients.create(settings);
  ```

## Testing a connection to an Amazon DocumentDB instance
<a name="troubleshooting.testing-connection"></a>

You can test your connection to a cluster using common Linux or Windows tools.

 From a Linux or Unix terminal, test the connection by entering the following (replace `cluster-endpoint` with the endpoint, and replace `port` with the port of your instance):

```
nc -zv cluster-endpoint port 
```

The following is an example of a sample operation and the return value: 

```
nc -zv docdbTest.d4c7nm7stsfc0.us-west-2.docdb.amazonaws.com 27017
   
Connection to docdbTest.d4c7nm7stsfc0.us-west-2.docdb.amazonaws.com 27017 port [tcp/*] succeeded!
```

## Connecting to an invalid endpoint
<a name="troubleshooting.invalid-endpoint"></a>

When connecting to an Amazon DocumentDB cluster and you use a cluster endpoint that is not valid, an error similar to the following appears.

```
mongo --ssl \
   --host sample-cluster.node.us-east-1.docdb.amazonaws.com:27017 \
   --sslCAFile global-bundle.pem \
   --username <user-name> \
   --password <password>
```

The output looks like this:

```
MongoDB shell version v3.6
connecting to: mongodb://sample-cluster.node.us-east-1.docdb.amazonaws.com:27017/
2018-11-14T17:21:18.516-0800 I NETWORK [thread1] getaddrinfo("sample-cluster.node.us-east-1.docdb.amazonaws.com") failed: 
nodename nor servname provided, or not known 2018-11-14T17:21:18.537-0800 E QUERY [thread1] Error: couldn't initialize
connection to host sample-cluster.node.us-east-1.docdb.amazonaws.com, address is invalid :
connect@src/mongo/shell/mongo.js:237:13@(connect):1:6
exception: connect failed
```

To get the valid endpoint for a cluster, run the following command: 

```
aws docdb describe-db-clusters \
   --db-cluster-identifier sample-cluster \
   --query 'DBClusters[*].[Endpoint,Port]'
```

To get the valid endpoint for an instance, run the following command:

```
aws docdb describe-db-instances \
   --db-instance-identifier sample-instance \
   --query 'DBInstances[*].[Endpoint.Address,Endpoint.Port]'
```

For more information, see [Understanding Amazon DocumentDB endpoints](endpoints.md). 

## Driver configuration impacting number of connections
<a name="troubleshooting.driver.config"></a>

When using the client driver to connect to an Amazon DocumentDB cluster, it's important to consider the `maxPoolSize` configuration parameter. The `maxPoolSize` setting determines the maximum number of connections that the client driver will maintain in its connection pool.

# Troubleshooting indexes
<a name="troubleshooting.index-creation"></a>

The following topics address what to do if your index or background index build fails.

**Topics**
+ [Index build fails](#troubleshooting.index-build-fails)
+ [Background index build latency issues and fails](#troubleshooting.background-index-build-fails)
+ [Database index bloat](#troubleshooting-database-bloat)

## Index build fails
<a name="troubleshooting.index-build-fails"></a>

Amazon DocumentDB utilizes local storage on an instance as part of the index creation process. You can monitor this disk usage using the **FreeLocalStorage** CloudWatch metric (`CloudWatch -> Metrics -> DocDB -> Instance Metrics`). When an index build consumes all of the local disk and fails, you will receive an error. When migrating data to Amazon DocumentDB, we encourage you to create indexes first and then insert the data. For more information on migration strategies and creating indexes, see [Migrating to Amazon DocumentDB](docdb-migration.md) in the Amazon DocumentDB documentation and the blog: [ Migrate from MongoDB to Amazon DocumentDB using the offline method](https://aws.amazon.com/blogs/database/migrate-from-mongodb-to-amazon-documentdb-using-the-offline-method/). 

When creating indexes on an existing cluster, if the index build is taking longer than expected or is failing, we recommend that you scale up the instance to create the index then, after the index is created, scale back down. Amazon DocumentDB enables you to quickly scale instance sizes in minutes using the AWS Management Console or the AWS CLI. For more information, see [Managing instance classes](db-instance-classes.md). With per-second instance pricing, you only pay for the resource you use up to the second.

## Background index build latency issues and fails
<a name="troubleshooting.background-index-build-fails"></a>

Background index builds in Amazon DocumentDB do not start until all queries on the primary instance that started before the index build was initiated complete executing. If there is a long running query, background index builds will block until the query finishes and thus can take longer than expected to complete. This is true even if collections are empty.

Foreground index builds do not exhibit the same blocking behavior. Instead, foreground index builds take an exclusive lock on the collection until the index build is completed. Thus, to create indexes on empty collection and to avoid blocking on any long running queries, we suggest using foreground index builds.

**Note**  
Amazon DocumentDB allows only one background index build to occur on a collection at any given time. If DDL (Data Definition Language) operations such as `createIndex()` or`dropIndex()` occur on the same collection during a background index build, the background index build fails.

## Database index bloat
<a name="troubleshooting-database-bloat"></a>

Amazon DocumentDB uses Multi-Version Concurrency Control (MVCC) to manage concurrent transactions. When documents are deleted or updated, their previous versions remain in collections and indexes as "dead" versions. The garbage collection process automatically reclaims space from these dead versions for future operations.

Index bloat occurs when a collection's indexes become larger due to the accumulation of dead or obsolete index entries or fragmentation within the pages. The percentage reported represents the amount of index space that can be used by future index entries. This bloat consumes space in both the buffer cache and storage. If you want to remove the bloat, you will need to rebuild indexes.

**Example**  
Run the following command to determine unused storage for your index:  

```
db.coll.aggregate({$indexStats:{}});
```
The result looks similar to this:  

```
{ 
    "name" : "_id_",
    "key" : { 
        "_id" : 1 
    },
    "host" : "devbox-test.localhost.a2z.com:27317",
    "size" : NumberLong(827392),
    "accesses" : {
        "ops" : NumberLong(40000),
        "docsRead" : NumberLong(46049),
        "since" : ISODate("2025-04-03T21:44:51.251Z") 
    },
    "cacheStats" : {
        "blksRead" : NumberLong(264),
        "blksHit" : NumberLong(140190),
        "hitRatio" : 99.8121
    }, 
    "unusedStorageSize" : {
        "unusedSizeBytes" : 409600,
        "unusedSizePercent" : 49.51
    }
}
```

You can rebuild indexes without downtime using the `reIndex` command, which requires a scan of the entire collection. See [Index maintenance using `reIndex`](managing-indexes.md#reIndex).

# Troubleshooting performance and resource utilization
<a name="user_diagnostics"></a>

This section provides questions and solutions for common diagnostics issues in Amazon DocumentDB deployments. The examples provided use the *mongo shell* and are scoped to an individual instance. To find an instance endpoint, see [Understanding Amazon DocumentDB endpoints](endpoints.md). 

**Topics**
+ [How do I determine the number of insert, update, and delete operations performed on my collection through the Mongo API?](#user-diag-performed-operations)
+ [How do I analyze cache performance?](#user-diag-cache-perf)
+ [How do I find and terminate long running or blocked queries?](#user_diagnostics-query_terminating)
+ [How can I see a query plan and optimize a query?](#user_diagnostics-query_plan)
+ [How can I see a query plan in elastic clusters?](#user-diagnostics-ec-query-plan)
+ [How do I list all running operations on an instance?](#user_diagnostics-list_queries)
+ [How do I know when a query is making progress?](#user_diagnostics-query_progressing)
+ [How do I determine why a system suddenly runs slowly?](#user_diagnostics-speed_change)
+ [How do I determine the cause of high CPU utilization on one or more cluster instances?](#user_diagnostics-cpu_utilization)
+ [How do I determine the open cursors on an instance?](#user_diagnostics-open_cursors)
+ [How do I determine the current Amazon DocumentDB engine version?](#user_diagnostics-engine_version)
+ [How do I analyze index usage and identify unused indexes?](#user-diag-index-usage)
+ [How do I identify missing indexes?](#user_diagnostics-identify_missing_indexes)
+ [How do I determine database collection bloat?](#performance-collection-bloat)
+ [Summary of useful queries](#user_diagnostics-useful_queries)

## How do I determine the number of insert, update, and delete operations performed on my collection through the Mongo API?
<a name="user-diag-performed-operations"></a>

To view the number of insert, update, and delete operations performed on a certain collection, run the following command on that collection:

```
db.collection.stats()
```

The output from this command describes the following under its `opCounters` field:
+ **numDocsIns** - The number of documents inserted into this collection. This includes documents inserted using the `insert` and `insertMany` commands, as well as documents inserted by an upsert.
+ **numDocsUpd** - The number of documents updates in this collection. This includes documents updated using the `update` and `findAndModify` commands.
+ **numDocsDel** - The number of documents deleted from this collection. This includes documents deleted using the `deleteOne`, `deleteMany`, `remove`, and `findAndModify` commands.
+ **lastReset** - The time these counters have been last reset. The statistics provided by this command are reset when starting/stopping the cluster or scaling up/down the instance.

An example output from running `db.collection.stats()` is shown below.

```
{
    "ns" : "db.test",
    "count" : ...,
    "size" : ...,
    "avgObjSize" : ...,
    "storageSize" : ...,
    "capped" : false,
    "nindexes" : ...,
    "totalIndexSize" : ...,
    "indexSizes" : {
        "_id_" : ...,
        "x_1" : ...
    },
    "collScans" : ...,
    "idxScans" : ...,
    "opCounter" : {
        "numDocsIns" : ...,
        "numDocsUpd" : ...,
        "numDocsDel" : ...
    },
    "cacheStats" : {
        "collBlksHit" : ...,
        "collBlksRead" : ..,
        "collHitRatio" : ...,
        "idxBlksHit" : ...,
        "idxBlksRead" : ...,
        "idxHitRatio" : ...
    },
    "lastReset" : "2022-09-02 19:41:40.471473+00",
    "ok" : 1,
    "operationTime" : Timestamp(1662159707, 1)
}
```

This stats command should be used when viewing collection-specific counters for insert, update, and delete operation through the Mongo API. Another way to view collection-specific operation counters is by enabling DML auditing. The number of insert, update, and delete operations over all collections during one minute time intervals can be viewed in [Monitoring Amazon DocumentDB with CloudWatch](cloud_watch.md). 

## How do I analyze cache performance?
<a name="user-diag-cache-perf"></a>

Analyzing cache performance can give insights into the efficiency of data retrieval and system performance, and is based on how much data is read from the disk versus the cache. We provide cache statistics about the number of cache hits (data read from the cache) and cache misses (data that is not found in the cache and read from the disk) in order give insight into the cache performance. The cache statistics for a specific collection can by found by running the following command on that collection:

```
db.collection.stats()
```

The values in the `cacheStats` field in the output of this command provide cache statistics for the collection as well as the total cache statistics for the indexes created on the collection. These statistics are listed below:
+ **`collBlksHit`** - The number of blocks read from the cache during operations on this collection.
+ **`collBlksRead`** - The number of blocks read from the disk (cache misses) during operations on this collection.
+ **`collHitRatio`** - The cache hit ratio for this collection (`100 * [collBlksHit / (collBlksHit + collBlksRead)]`).
+ **`idxBlksHit`** - The number of blocks read from the cache for any index created on this collection.
+ **`idxBlksRead`** - The number of blocks read from the disk (cache misses) for any index created on this collection.
+ **`idxHitRatio`** - The cache hit ratio for the indexes created on this collection (`100 * [idxBlksHit / (idxBlksHit + idxBlksRead)]`).
+ **`lastReset`** - The time these statistics have been last reset. The statistics provided by `db.collection.stats()` are reset when starting/stopping the cluster or scaling up/down the instance. 

A breakdown of the `idxBlksHit` and `idxBlksRead` fields for each index can also be found using the `indexStats` command. Index specific cache statistics can be found by running the following command: 

```
db.collection.aggregate([{$indexStats:{}}]).pretty()
```

For each index, the following cache statistics can be found under the `cacheStats` field:
+ **`blksHit`** - The number of blocks read from cache for this index.
+ **`blksRead`** - The number of blocks read from the disk for this index.
+ **`blksHitRatio`** - The cache hit ratio rounded to four decimal places, calculated by `100 * [blksHit / (blksHit + blksRead)]`.

## How do I find and terminate long running or blocked queries?
<a name="user_diagnostics-query_terminating"></a>

User queries can run slowly because of a suboptimal query plan or can be blocked due to resource contention.

To find long running queries that slow down due to a suboptimal query plan, or queries that are blocked due to resource contention, use the `currentOp` command. You can filter the command to help narrow down the list of relevant queries to terminate. You must have `opid` associated with the long running query to be able to terminate the query.

The following query uses the `currentOp` command to list all queries that are either blocked or running for more than 10 seconds.

```
db.adminCommand({
    aggregate: 1,
    pipeline: [
        {$currentOp: {}},
        {$match: 
            {$or: [
                {secs_running: {$gt: 10}},
                {WaitState: {$exists: true}}]}},
        {$project: {_id:0, opid: 1, secs_running: 1}}],
    cursor: {}
});
```

Next, you can narrow down the query to find the `opid` of a query running for more than 10 seconds and terminate it.

**To find and terminate a query running for more than 10 seconds**

1. Find the `opid` of the query.

   ```
   db.adminCommand({
       aggregate: 1,
       pipeline: [
           {$currentOp: {}},
           {$match: 
               {$or: 
                   [{secs_running: {$gt: 10}},
                    {WaitState: {$exists: true}}]}}],
       cursor: {}
   });
   ```

   Output from this operation looks something like the following (JSON format).

   ```
   {
       "waitedMS" : NumberLong(0),
       "cursor" : {
           "firstBatch" : [
               {
                   "opid" : 24646,
                   "secs_running" : 12 
               }
           ],
           "id" : NumberLong(0),
           "ns" : "admin.$cmd"
       },
       "ok" : 1
   }
   ```

1. Terminate the query using the `killOp` operation.

   ```
   db.adminCommand({killOp: 1, op: 24646});
   ```

## How can I see a query plan and optimize a query?
<a name="user_diagnostics-query_plan"></a>

If a query runs slow, it could be because the query execution requires a full scan of the collection to choose the relevant documents. Sometimes creating appropriate indexes enables the query to run faster. To detect this scenario and decide the fields on which to create the indexes, use the `explain` command. 

**Note**  
Amazon DocumentDB emulates the MongoDB 3.6 API on a purpose-built database engine that utilizes a distributed, fault-tolerant, self-healing storage system. As a result, query plans and the output of `explain()` may differ between Amazon DocumentDB and MongoDB. Customers who want control over their query plan can use the `$hint` operator to enforce selection of a preferred index. 

Run the query that you want to improve under the `explain` command as follows.

```
db.runCommand({explain: {<query document>}})
```

The following is an example operation.

```
db.runCommand({explain:{
    aggregate: "sample-document",
    pipeline: [{$match: {x: {$eq: 1}}}],
    cursor: {batchSize: 1}}
});
```

Output from this operation looks something like the following (JSON format).

```
{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "db.test",
        "winningPlan" : {
            "stage" : "COLLSCAN"
        }
    },
    "serverInfo" : {
        "host" : "...",
        "port" : ...,
        "version" : "..."
    },
    "ok" : 1
}
```

The preceding output indicates that the `$match` stage requires scanning the whole collection and checking if the field `"x"` in each document is equal to 1. If there are many documents in the collection, the collection scan (and therefore the overall query performance) is very slow. Thus the presence of the `"COLLSCAN"` in the output of the `explain` command indicates that the query performance can be improved by creating appropriate indexes.

In this example, the query checks whether the field `"x"` equals 1 in all documents. So creating an index on field `"x"` enables the query to avoid the complete collection scan and use the index to return the relevant documents sooner.

After creating an index on field `"x"`, the `explain` output is as follows.

```
{
    "queryPlanner" : {
         "plannerVersion" : 1,
         "namespace" : "db.test",
         "winningPlan" : {
             "stage" : "IXSCAN",
             "indexName" : "x_1",
             "direction" : "forward"
         }
    },
    "serverInfo" : {
        "host" : "...",
        "port" : ...,
        "version" : "..."
    },
    "ok" : 1
}
```

Creating an index on field `"x"` enables the `$match` stage to use an index scan to reduce the number of documents on which the predicate `"x = 1"` must be evaluated.

For small collections, the Amazon DocumentDB query processor can choose not to use an index if the performance gains are negligible.

## How can I see a query plan in elastic clusters?
<a name="user-diagnostics-ec-query-plan"></a>

To examine a query plan in elastic clusters, use the `explain` command. The following is an example `explain` operation on a find query targeting a sharded collection:

```
db.runCommand(
   {
     explain: { find: "cities", filter: {"name": "Seoul"}}
   }
)
```

**Note**  
Amazon DocumentDB emulates MongoDB on a purpose-built database engine. As a result, query plans and the output of `explain()` may differ between Amazon DocumentDB and MongoDB. You can control query plan with the use of the `$hint` operator to enforce selection of a preferred index.

Output from this operation may look something like the following (JSON format):

```
{
  "queryPlanner" : {
    "elasticPlannerVersion" : 1,
    "winningPlan" : {
      "stage" : "SINGLE_SHARD",
      "shards" : [
        {
          "plannerVersion" : 1,
          "namespace" : "population.cities",
          "winningPlan" : {
            "stage" : "SHARD_MERGE",
            "shards" : [
              {
                "shardName" : "f2cf5cfd-fe9c-40ca-b4e5-298ca0d11111",
                "plannerVersion" : 1,
                "namespace" : "population.cities",
                "winningPlan" : {
                  "stage" : "PARTITION_MERGE",
                  "inputStages" : [
                    {
                      "stage" : "COLLSCAN",
                      "partitionCount" : 21
                    }
                  ]
                }
              },
              {
                "shardName" : "8f3f80e2-f96c-446e-8e9d-aab8c7f22222",
                "plannerVersion" : 1,
                "namespace" : "population.cities",
                "winningPlan" : {
                  "stage" : "PARTITION_MERGE",
                  "inputStages" : [
                    {
                      "stage" : "COLLSCAN",
                      "partitionCount" : 21
                    }
                  ]
                }
              },
              {
                "shardName" : "32c5a06f-1b2b-4af1-8849-d7c4a033333",
                "plannerVersion" : 1,
                "namespace" : "population.cities",
                "winningPlan" : {
                  "stage" : "PARTITION_MERGE",
                  "inputStages" : [
                    {
                      "stage" : "COLLSCAN",
                      "partitionCount" : 22
                    }
                  ]
                }
              }
            ]
          },
          "shardName" : "32c5a06f-1b2b-4af1-8849-d7c4a0f3fb58"
        }
      ]
    }
  },
  "serverInfo" : {
    "host" : "example-4788267630.us-east-1.docdb-elastic.amazonaws.com:27017",
    "version" : "5.0.0"
  },
  "ok" : 1,
  "operationTime" : Timestamp(1695097923, 1)
}
```

The preceding output shows the query plan for the `find` query on a three-shard cluster. Each shard has multiple data partitions which can have different input stages. In this example, a “COLLSCAN“ (a collection scan) is run on all partitions before the results are merged at the ”PARTITION\$1MERGE“ stage within each shard. The results across the shards are then merged together at the ”SHARD\$1MERGE“ stage before being sent back to the client.

## How do I list all running operations on an instance?
<a name="user_diagnostics-list_queries"></a>

As a user or primary user, you often want to list all the current operations running on an instance for diagnostics and troubleshooting purposes. (For information about managing users, see [Managing Amazon DocumentDB users](security.managing-users.md).)

With the `mongo` shell, you can use the following query to list all the running operations on an Amazon DocumentDB instance.

```
db.adminCommand({currentOp: 1, $all: 1});
```

The query returns the complete list of all user queries and internal system tasks currently operating on the instance.

Output from this operation looks something like the following (JSON format).

```
{
    "inprog" : [
        {
            "desc" : "INTERNAL"
        },
        {
            "desc" : "TTLMonitor",
            "active" : false
        },
        {
            "client" : ...,
            "desc" : "Conn",
            "active" : true,
            "killPending" : false,
            "opid" : 195,
            "ns" : "admin.$cmd",
            "command" : {
                "currentOp" : 1,
                "$all" : 1
            },
            "op" : "command",
            "$db" : "admin",
            "secs_running" : 0,
            "microsecs_running" : NumberLong(68),
            "clientMetaData" : {
            "application" : {
                "name" : "MongoDB Shell"
            },
            "driver" : {
                ...
            },
            "os" : {
                ...
            }
          }
       },
       {
          "desc": "GARBAGE_COLLECTION",
          "garbageCollection": {
             "databaseName": "testdb",
             "collectionName": "testCollectionA"
          },
          "secs_running": 3,
          "microsecs_running": NumberLong(3123456)
       },
       {
          "desc": "GARBAGE_COLLECTION",
          "garbageCollection": {
             "databaseName": "testdb",
             "collectionName": "testCollectionB"
          },
          "secs_running": 4,
          "microsecs_running": NumberLong(4123456)
       }
    ],
    "ok" : 1
}
```

The following are valid values for the `"desc"` field:
+ **INTERNAL** — Internal system tasks like the cursor cleanup or stale user cleanup tasks.
+ **TTLMonitor** — The Time to Live (TTL) monitor thread. Its running status is reflected in the `"active"` field.
+ **GARBAGE\$1COLLECTION** — The internal garbage collector thread.
+ **CONN** — The user query.
+ **CURSOR** — The operation is an idle cursor waiting on the user to call the "getMore" command to get the next batch of results. In this state, the cursor is consuming memory, but is not consuming any compute.

The preceding output also lists all user queries running in the system. Each user query runs in the context of a database and collection, and the union of these two is called a *namespace*. The namespace of each user query is available in the `"ns"` field.

Sometimes you need to list all user queries that are running in a particular namespace. So the previous output must be filtered on the `"ns"` field. The following is an example query to achieve the output to filter. The query lists all user queries that are currently running in the database `"db"` and collection `"test"` (that is, the `"db.test"` namespace).

```
db.adminCommand({aggregate: 1,
    pipeline: [{$currentOp: {allUsers: true, idleConnections: true}},
               {$match: {ns: {$eq: "db.test"}}}],
    cursor: {}
});
```

As the primary user of the system, you can see queries of all users and also all internal system tasks. All other users can see only their respective queries.

If the total number of queries and internal system tasks exceeds the default batch cursor size, the `mongo` shell automatically generates an iterator object `'it'` to view the rest of the results. Keep executing the `'it'` command until all results have been exhausted.

## How do I know when a query is making progress?
<a name="user_diagnostics-query_progressing"></a>

User queries can run slowly due to a suboptimal query plan, or they can be blocked due to resource contention. Debugging such queries is a multi-step process that can require executing the same step multiple times.

The first step of debugging is to list all queries that are long running or blocked. The following query lists all user queries that have been running for more than 10 seconds or that are waiting for resources.

```
db.adminCommand({aggregate: 1,
                 pipeline: [{$currentOp: {}},
                            {$match: {$or: [{secs_running: {$gt: 10}},
                                            {WaitState: {$exists: true}}]}},
                            {$project: {_id:0,
                                        opid: 1,
                                        secs_running: 1,
                                        WaitState: 1,
                                        blockedOn: 1,
                                        command: 1}}],
                 cursor: {}
                });
```

Repeat the preceding query periodically to determine whether the list of queries changes and to identify the long running or blocked queries.

If the output document for the query of interest has a `WaitState` field, it indicates that resource contention is why the query is running slow or is blocked. The resource contention could either be due to I/O, internal system tasks, or other user queries.

Output from this operation looks something like the following (JSON format).

```
{
    "waitedMS" : NumberLong(0),
    "cursor" : {
        "firstBatch" : [
            {
                "opid" : 201,
                "command" : {
                    "aggregate" : ...
                },
                "secs_running" : 208,
                "WaitState" : "IO"
            }
        ],
        "id" : NumberLong(0),
        "ns" : "admin.$cmd"
    },
    "ok" : 1
}
```

I/O can be a bottleneck if many queries across different collections are running concurrently on the same instance, or if the instance size is too small for the dataset that the query is running on. If the queries are read-only queries, you can mitigate the former situation by separating the queries for each collection across separate replicas. For concurrent updates across different collections, or when the instance size is too small for the dataset, you can mitigate by scaling up the instance.

If the resource contention is due to other user queries, the `"blockedOn"` field in the output document will have the `"opid"` of the query that is affecting this query. Using the `"opid"` follows the chain of `"WaitState"` and `"blockedOn"` fields of all the queries to find the query at the head of the chain.

If the task at the head of the chain is an internal task, the only mitigation in this case would be to terminate the query and rerun it later.

The following is sample output in which the find query is blocked on a collection lock that is owned by another task.

```
{
    "inprog" : [
        {
            "client" : "...",
            "desc" : "Conn",
            "active" : true,
            "killPending" : false,
            "opid" : 75,
            "ns" : "...",
            "command" : {
                "find" : "...",
                "filter" : {

                }
            },
            "op" : "query",
            "$db" : "test",
            "secs_running" : 9,
            "microsecs_running" : NumberLong(9449440),
            "threadId" : 24773,
            "clientMetaData" : {
                "application" : {
                   "name" : "MongoDB Shell"
                },
                "driver" : {
                    ...
                },
                "os" : {
                    ...
                }
            },
            "WaitState" : "CollectionLock",
            "blockedOn" : "INTERNAL"
        },
        {
            "desc" : "INTERNAL"
        },
        {
            "client" : "...",
            ...
            "command" : {
                "currentOp" : 1
            },
            ...
        }
    ],
    "ok" : 1
}
```

If the `"WaitState"` has values `"Latch"`, `"SystemLock"`, `"BufferLock"`, `"BackgroundActivity"`, or `"Other"`, the source of resource contention is internal system tasks. If the situation continues for a long time, the only mitigation would be to terminate the query and rerun it later.

## How do I determine why a system suddenly runs slowly?
<a name="user_diagnostics-speed_change"></a>

The following are some common reasons for a system slowing down:
+ Excessive resource contention between concurrent queries 
+ The number of active concurrent queries increasing over time
+ Internal system tasks such as `"GARBAGE_COLLECTION"`

To monitor the system usage over time, run the following `"currentOp"` query periodically and output the results to an external store. The query counts the number of queries and operations in each namespace in the system. You can then analyze the system usage results to understand the load on the system and take appropriate action.

```
db.adminCommand({aggregate: 1,
                 pipeline: [{$currentOp: {allUsers: true, idleConnections: true}},
                            {$group: {_id: {desc: "$desc", ns: "$ns", WaitState: "$WaitState"}, count: {$sum: 1}}}],
                 cursor: {}
                });
```

This query returns an aggregate of all queries running in each namespace, all the internal system tasks, and the unique number of wait states (if any) per namespace.

Output from this operation looks something like the following (JSON format).

```
{
    "waitedMS" : NumberLong(0),
    "cursor" : {
        "firstBatch" : [
            {
                "_id" : {
                    "desc" : "Conn",
                    "ns" : "db.test",
                    "WaitState" : "CollectionLock"
                },
               "count" : 2
            },
            {
                "_id" : {
                    "desc" : "Conn",
                    "ns" : "admin.$cmd"
                },
                "count" : 1
            },
            {
                "_id" : {
                    "desc" : "TTLMonitor"
                },
                "count" : 1
            }
        ],
        "id" : NumberLong(0),
        "ns" : "admin.$cmd"
    },
    "ok" : 1
}
```

In the preceding output, two user queries in namespace `"db.test"` are blocked on collection lock: one query in the namespace `"admin.$cmd"`, and one internal `"TTLMonitor"` task.

If the output indicates many queries with blocking wait states, see [How do I find and terminate long running or blocked queries?](#user_diagnostics-query_terminating)

## How do I determine the cause of high CPU utilization on one or more cluster instances?
<a name="user_diagnostics-cpu_utilization"></a>

The following sections might help you identify the cause of high instance CPU utilization. Your results can vary depending on the workload.
+ To determine why an instance is suddenly running slowly, see [How do I determine why a system suddenly runs slowly?](#user_diagnostics-speed_change) 
+ To identify and terminate long running queries on a particular instance, see [How do I find and terminate long running or blocked queries?](#user_diagnostics-query_terminating) 
+ To understand whether a query is progressing, see [How do I know when a query is making progress?](#user_diagnostics-query_progressing) 
+ To determine why a query takes a long time to run, see [How can I see a query plan and optimize a query?](#user_diagnostics-query_plan)
+ To track long-running queries over time, see [Profiling Amazon DocumentDB operations](profiling.md).

Depending on the reason for your high instance CPU utilization, doing one or more of the following can help.
+ If the primary instance exhibits high CPU utilization, but the replica instances don't, consider distributing read traffic across replicas via client read preference settings (for example, `secondaryPreferred`). For more information, see [Connecting to Amazon DocumentDB as a replica set](connect-to-replica-set.md). 

  Using replicas for reads can make better use of the cluster’s resources by allowing the primary instance to process more write traffic. Reads from replicas are eventually consistent. 
+ If the high CPU utilization is a result of your write workload, changing the size of the cluster’s instances to a larger instance type increases the number of CPU cores available to service the workload. For more information, see [Instances](what-is.md#what-is-db-instances) and [Instance class specifications](db-instance-classes.md#db-instance-class-specs). 
+ If all cluster instances exhibit high CPU utilization, and the workload is using replicas for reads, adding more replicas to the cluster increases the resources available for read traffic. For more information, see [Adding an Amazon DocumentDB instance to a cluster](db-instance-add.md). 

## How do I determine the open cursors on an instance?
<a name="user_diagnostics-open_cursors"></a>

When connected to a Amazon DocumentDB instance, you can use the command `db.runCommand("listCursors")` to list the open cursors on that instance. There is a limit of up to 4,560 active cursors open at any given time on a given Amazon DocumentDB instance, depending on the instance type. It is generally advised to close cursors that are no longer in use because cursors utilize resources on an instance and have an upper limit. See [Amazon DocumentDB Quotas and limits](limits.md) for specific limits.

```
db.runCommand("listCursors") 
```

## How do I determine the current Amazon DocumentDB engine version?
<a name="user_diagnostics-engine_version"></a>

To determine your current Amazon DocumentDB engine version, run the following command.

```
db.runCommand({getEngineVersion: 1})
```

Output from this operation looks something like the following (JSON format).

```
{ "engineVersion" : "2.x.x", "ok" : 1 }
```

**Note**  
The engine version for Amazon DocumentDB 3.6 is 1.x.x, the engine version for Amazon DocumentDB 4.0 is 2.x.x, the engine version for Amazon DocumentDB 5.0 is 3.x.x, and the engine version for Amazon DocumentDB 8.0 is 4.x.x.

## How do I analyze index usage and identify unused indexes?
<a name="user-diag-index-usage"></a>

To identify the indexes for a given collection, run the following command:

```
db.collection.getIndexes()
```

To analyze how much indexes are being used during operations performed on the collections, the `collStats` and `indexStats` commands can be used. In order to view the total number of scans performed using indexes (index scans) compared to the number of scans performed without an index (collection scans), run the following command: 

```
db.collection.stats()
```

The output for this command includes the following values:
+ **`idxScans`** - The number of scans performed on this collection using an index.
+ **`collScans`** - The number of scans performed on this collection without using an index. These scans would have involved looking over the documents in the collection one at a time.
+ **`lastReset`** - The time these counters have been last reset. The statistics provided by this command are reset when starting/stopping the cluster or scaling up/down the instance.

A breakdown of how much each index is used can be found in the output of the following command. It is a best practice to regularly identify and remove unused indexes in order to improve performance and reduce cost, as it eliminates unnecessary compute, storage, and I/Os used to maintain the indexes.

```
db.collection.aggregate([{$indexStats:{}}]).pretty()
```

The output from this command gives the following values for each index created on the collection:
+ **`ops`** - The number of operations that used the index. If your workload has been running for a sufficiently long time and you are confident that your workload is in a steady state, an `ops` value of zero would indicate that the index is not used at all.
+ **`numDocsRead`** - The number of documents read during operations using this index.
+ **`since`** - The time since Amazon DocumentDB started collecting stats on index usage, which is typically the value since the last database restart or maintenance action.
+ **`size`** - The size of this index in bytes.

The following example is a sample output from running the above command:

```
{
    "name" : "_id_",
    "key" : {
        "_id" : 1
    },
    "host" : "example-host.com:12345",
    "size" : NumberLong(...),
    "accesses" : {
        "ops" : NumberLong(...),
        "docsRead" : NumberLong(...),
        "since" : ISODate("...")
    },
    "cacheStats" : {
        "blksRead" : NumberLong(...),
        "blksHit" : NumberLong(...),
        "hitRatio" : ...
    }
}
{
    "name" : "x_1",
    "key" : {
        "x" : 1
    },
    "host" : "example-host.com:12345",
    "size" : NumberLong(...),
    "accesses" : {
        "ops" : NumberLong(...),
        "docsRead" : NumberLong(...),
        "since" : ISODate("...")
    },
    "cacheStats" : {
        "blksRead" : NumberLong(...),
        "blksHit" : NumberLong(...),
        "hitRatio" : ...
    }
}
```

To determine the overall index size for a collection, run the following command:

```
db.collection.stats()
```

To drop an unused index, run the following command:

```
db.collection.dropIndex("indexName")
```

## How do I identify missing indexes?
<a name="user_diagnostics-identify_missing_indexes"></a>

You can use the [Amazon DocumentDB profiler to log slow queries](https://docs.aws.amazon.com//documentdb/latest/developerguide/profiling.html). A query that appears repeatedly in the slow query log may indicate that an additional index is required to improve that query's performance. 

You can identify opportunities for helpful indexes by looking for long running queries that have one or more stages that perform at least one `COLLSCAN` stage, meaning that the query stage has to read every document in the collection in order to provide a response to the query.

The following example shows a query on a collection of taxi rides that ran on a large collection.

```
db.rides.count({"fare.totalAmount":{$gt:10.0}}))
```

In order to execute this example, the query had to perform a collection scan (i.e. read every single document in the collection) since there is no index on the `fare.totalAmount` field. Output from the Amazon DocumentDB profiler for this query looks something like the following: 

```
{
    ...
    "cursorExhausted": true,
    "nreturned": 0, 
    "responseLength": 0,
    "protocol": "op_query",
    "millis": 300679,
    "planSummary": "COLLSCAN",
    "execStats": {
        "stage": "COLLSCAN",
        "nReturned": "0",
        "executionTimeMillisEstimate": "300678.042"
    },
    "client": "172.31.5.63:53878",
    "appName": "MongoDB Shell",
    "user": "example"
}
```

To speed up the query in this example, you want to create an index on `fare.totalAmount`, as shown below.

```
db.rides.createIndex( {"fare.totalAmount": 1}, {background: true} )
```

**Note**  
Indexes created in the foreground (meaning if the `{background:true}` option was not supplied when creating the index) take an exclusive write lock, which prevents applications from writing data to the collection until the index build completes. Be aware of this potential impact when creating indexes on production clusters. When creating indexes, we recommend setting `{background:true}`. 

In general, you want to create indexes on fields that have high cardinality (for example, a large number of unique values). Creating an index on a field with low cardinality can result in a large index that is not used. The Amazon DocumentDB query optimizer considers the overall size of the collection and selectivity of the indexes when creating a query plan. There are times where you will see the query processor select a `COLLSCAN` even when an index is present. This happens when the query processor estimates that utilizing the index will not yield a performance advantage over scanning the entire collection. If you want to force the query processor to utilize a particular index, you can use the `hint()` operator as shown below.

```
db.collection.find().hint("indexName")
```

## How do I determine database collection bloat?
<a name="performance-collection-bloat"></a>

Collection bloat occurs when a collection's size becomes larger due to the accumulation of dead or obsolete documents or fragmentation within database pages. The percentage reported represents the amount of document space that can be used by future documents. This bloat consumes space in both the buffer cache and storage. To remove the bloat, collections must be reloaded either through dump/restore or using a migration loop-back and switchover during a maintenance window.

**Example**  
Run the following command to determine unused storage for your collection:  

```
db.runCommand({collStats:'coll'})
```
The result looks similar to this:  

```
{
        "ns" : "test.coll",
        "count" : 7500,
        "size" : 23250,
        "avgObjSize" : 31,
        "storageSize" : 106496,
        "unusedStorageSize" : {
                "unusedBytes" : 16384,
                "unusedPercent" : 25.12
        },
        "compression" : {
                "enable" : false
        },
        "capped" : false,
        "nindexes" : 1,
        "totalIndexSize" : 57344,
        "indexSizes" : {
                "_id_" : 57344
        },
        "collScans" : 4,
        "idxScans" : 10000,
        "opCounter" : {
                "numDocsIns" : 1000,
                "numDocsUpd" : 0,
                "numDocsDel" : 250
        },
        "cacheStats" : {
                "collBlksHit" : 3570,
                "collBlksRead" : 8,
                "collHitRatio" : 99.7765,
                "idxBlksHit" : 12293,
                "idxBlksRead" : 6,
                "idxHitRatio" : 99.9513
        },
        "lastReset" : "2024-12-18 00:30:21.552019+00",
        "ok" : 1,
        "operationTime" : Timestamp(1734632375, 1)
}
```

## Summary of useful queries
<a name="user_diagnostics-useful_queries"></a>

The following queries can be useful for monitoring performance and resource utilization in Amazon DocumentDB.
+ Use the following command to view statistics about a specific collection, including operation counters, cache statistics, accesses statistics, and size statistics:

  ```
  db.collection.stats()
  ```
+ Use the following command to view statistics about each index created on a collection including the size of the index, index-specific cache statistics, and index usage statistics:

  ```
  db.collection.aggregate([{$indexStats:{}}]).pretty()
  ```
+ Use the following query to list all activity.

  ```
  db.adminCommand({currentOp: 1, $all: 1});
  ```
+ The following code lists all long running or blocked queries.

  ```
  db.adminCommand({aggregate: 1,
                   pipeline: [{$currentOp: {}},
                              {$match: {$or: [{secs_running: {$gt: 10}},
                                              {WaitState: {$exists: true}}]}},
                              {$project: {_id:0,
                                          opid: 1,
                                          secs_running: 1,
                                          WaitState: 1,
                                          blockedOn: 1,
                                          command: 1}}],
                   cursor: {}
                  });
  ```
+ The following code terminates a query.

  ```
  db.adminCommand({killOp: 1, op: <opid of running or blocked query>});
  ```
+ Use the following code to get an aggregated view of the system state.

  ```
  db.adminCommand({aggregate: 1,
                   pipeline: [{$currentOp: {allUsers: true, idleConnections: true}},
                              {$group: {_id: {desc: "$desc", ns: "$ns", WaitState: "$WaitState"}, count: {$sum: 1}}}],
                   cursor: {}
                  });
  ```

# Garbage collection in Amazon DocumentDB
<a name="garbage-collection"></a>

Amazon DocumentDB implements a multi-version concurrency control (MVCC) database architecture that creates new versions of document and index entries for every update operation. This architecture provides enables transaction isolation, preventing one transaction's changes from appearing in another.

**Topics**
+ [Understanding Garbage collection in Amazon DocumentDB](#understanding-garbage-collection)
+ [Garbage collection process](#garbage-collection-process)
+ [Storage architecture and extended storage](#storage-architecture)
+ [Monitoring garbage collection](#monitoring-garbage-collection)
+ [Example collStats output](#example-collstats-output)
+ [Frequently asked questions](#garbage-collection-faq)

## Understanding Garbage collection in Amazon DocumentDB
<a name="understanding-garbage-collection"></a>

Garbage collection (GC) is an automated background process that maintains optimal system performance and availability in Amazon DocumentDB. Like many modern databases, Amazon DocumentDB's MVCC architecture creates new document and index versions with each update. Each write operation consumes a unique MVCC ID from a finite counter. These IDs identify which transaction a document version belongs to and whether it has been committed or rolled back. Over time, these old versions and their MVCC IDs accumulate, requiring cleanup to prevent performance degradation.

### Functions of garbage collection
<a name="w2aac55c15b7b5"></a>

Garbage collector serves three essential functions:
+ **Reclaims storage space** — It removes obsolete document and index versions that are no longer needed by active queries, freeing space for future write operations.
+ **Prevents MVCC ID overflow** — It prevents MVCC ID overflow by managing the finite counter of MVCC IDs. Without this management, the counter would eventually reach its limit, forcing the database into a temporary read-only mode until IDs are recycled.
+ **Maintains query performance** — It maintains optimal query performance by eliminating dead document versions that would otherwise accumulate and slow down query processing.

## Garbage collection process
<a name="garbage-collection-process"></a>

The GC process operates per collection and can have multiple processes running concurrently on different collections. The process consists of four sequential phases:

1. **Identification** — The system identifies document and index versions no longer referenced by active transactions or queries.

1. **Memory loading** — Old documents and index entries are loaded into memory if not already present.

1. **Deletion** — Obsolete versions are permanently deleted to reclaim storage space.

1. **MVCC ID recycling** — The system recycles MVCC IDs from deleted versions for new operations.

When garbage collection completes processing old document versions, it removes the oldest MVCC IDs from the system. This cleanup is crucial for preventing MVCC ID overflow by recycling MVCC IDs, making them available for new write operations across the cluster. Without this recycling process, the system would eventually exhaust its finite MVCC ID counter and enter a read-only state.

### Garbage collection scheduling
<a name="w2aac55c15b9b9"></a>

Garbage collection runs automatically in the background at periodic intervals. The timing and frequency adjust dynamically based on system load, available resources, write volume, and MVCC ID consumption levels. During high write activity, the GC process executes more frequently to manage the increased number of document versions.

## Storage architecture and extended storage
<a name="storage-architecture"></a>

Amazon DocumentDB uses a sophisticated storage architecture that separates document storage into two distinct segments:

### Base storage segment
<a name="w2aac55c15c11b5"></a>

The base storage segment contains the primary document data and metadata. This segment stores:
+ Document content that fits within the standard page size (8 KB). 
+ Document metadata and structure information.
+ Primary indexes and their entries.
+ Collection-level statistics and configuration.

### Extended storage segment
<a name="w2aac55c15c11b7"></a>

The extended storage segment utilizes a specialized large document object store designed to handle documents that exceed the standard storage page size. This segment provides:
+ **Efficient Large Document Handling** — Documents larger than the base storage threshold are automatically moved to the extended storage segment.
+ **Optimized Storage Layout** — The segment uses a different storage format optimized for large objects, reducing fragmentation and improving access patterns.
+ **Independent Garbage Collection** — The extended storage segment has its own garbage collection process that can run independently of base storage cleanup.
+ **Transparent Access** — Applications access large documents seamlessly without needing to know which storage segment contains the data.

The extended storage segment is particularly beneficial for:
+ Collections with documents containing large embedded arrays.
+ Documents with extensive nested structures.
+ Collections storing binary data or large text fields.
+ Applications with mixed document sizes where some documents significantly exceed average size.

## Monitoring garbage collection
<a name="monitoring-garbage-collection"></a>

### Cluster level metrics
<a name="w2aac55c15c13b3"></a>

**`AvailableMVCCIds`**
+ **Location** — Amazon CloudWatch
+ **Description** — A counter that shows the number of remaining write operations available from a maximum limit of 1.8 billion. When this counter reaches zero, your cluster enters read-only mode until IDs are reclaimed and recycled. The counter decreases with each write operation and increases as garbage collection recycles old MVCC IDs.
+ **Recommendation** — Set an alarm when the value falls below 1.3 billion. This early warning allows you to take recommended steps discussed later.

**`LongestActiveGCRuntime`**
+ **Location** — Amazon CloudWatch
+ **Description** — Duration in seconds of the longest active garbage collection process. Updates every minute and tracks only active operations, excluding processes that complete within the one-minute window.
+ **Recommendation** — Compare with `gcRuntimeStats` historical data to identify abnormal garbage collection behavior, such as extended runtimes during bulk deletions.

### Collection level metrics
<a name="w2aac55c15c13b5"></a>

**`MVCCIDStats: MVCCIdScale`**
+ **Location** — Database collStats command
+ **Description** — Measures MVCC ID age on a scale of 0 to 1, where 1 indicates the maximum age before a cluster enters a read-only state. Use this metric alongside `AvailableMVCCIds` to identify collections containing the oldest MVCC IDs that are aging the cluster.
+ **Recommendation** — Maintain values below 0.3 for each collection.

**`gcRuntimeStats`**
+ **Location** — Database collStats command
+ **Description** — Provides a two-month history of garbage collection metrics, including total runs, average duration, and maximum duration. Only includes garbage collection operations lasting more than five minutes to ensure meaningful statistics.

**Important**  
`gcRuntimeStats`, `documentFragmentStats`, and breakup of collection level metrics into `storageSegmentBase` and `storageSegmentExtended` are only available for Amazon DocumentDB 8.0.

**`storageSizeStats`**
+ **Location** — Database collStats command
+ **Description** — Provides detailed breakdown of storage utilization across different storage segments:
  + `storageSegmentBase` — Storage used by the base storage segment for standard documents
  + `storageSegmentExtended` — Storage used by the extended storage segment for large documents
+ **Usage** — Helps identify collections with significant large document storage and understand storage distribution patterns.

**`unusedStorageSize`** (collection level)
+ **Location** — Database collStats command
+ **Description** — Estimates unused storage space in a collection based on sampled statistics. It includes space from deleted documents and empty segments. The metric provides both combined totals and per-segment breakdowns:
  + Combined `unusedBytes` and `unusedPercent` across all storage segments
  + `storageSegmentBase` — Unused space specifically in the base storage segment
  + `storageSegmentExtended` — Unused space specifically in the extended storage segment

**`documentFragmentStats`**
+ **Location** — Database collStats command
+ **Description** — Provides detailed information about document fragments and dead data within collections. Document fragments represent the internal storage units used by the database engine, and dead fragments indicate data that is no longer accessible but hasn't been reclaimed yet. This metric includes:
  + `totalDocFragmentsCount` — Total number of document fragments in the collection
  + `deadDocFragmentsCount` — Number of fragments containing dead (inaccessible) data
  + `deadDocFragmentsPercent` — Percentage of fragments that contain dead data
  + `deadDocFragmentBytes` — Estimated bytes consumed by dead document fragments
  + Per-segment breakdown for `storageSegmentBase` and `storageSegmentExtended`
+ **Usage** — Monitor this metric to understand the effectiveness of garbage collection and identify collections that may benefit from maintenance operations. High percentages of dead fragments indicate that garbage collection may be falling behind or that the collection would benefit from optimization.

### Index level metrics
<a name="w2aac55c15c13b7"></a>

**`unusedStorageSize`** (index level)
+ **Location** — Database indexStats command
+ **Description** — Estimates unused storage space in an index based on sampled statistics. It includes space from obsolete index entries and empty segments.
+ **Recommendation** — Use the `reIndex` command to rebuild indexes without downtime and reclaim unused space. Refer to Managing Indexes for more details.

## Example collStats output
<a name="example-collstats-output"></a>

The following example shows a typical `collStats` output with garbage collection and storage metrics:

```
{
    "ns" : "Mvcc_consumption_test_db.mvcc_test_collection",
    "MVCCIdStats" : {
        "MVCCIdScale" : 0.03
    },
    "gcRuntimeStats" : {
        "numRuns" : 1,
        "historicalAvgRuntime" : 3295,
        "historicalMaxRuntime" : 3295,
        "lastRuntime" : 3295,
        "lastRuntimeStart" : ISODate("2025-06-24T08:47:14Z")
    },
    "documentFragmentStats" : {
        "totalDocFragmentsCount" : 45000000,
        "deadDocFragmentsCount" : 2250000,
        "deadDocFragmentsPercent" : 5.0,
        "deadDocFragmentBytes" : 98304000,
        "storageSegmentBase" : {
            "totalDocFragmentsCount" : 30000000,
            "deadDocFragmentsCount" : 1500000,
            "deadDocFragmentsPercent" : 5.0,
            "deadDocFragmentBytes" : 65536000
        },
        "storageSegmentExtended" : {
            "totalDocFragmentsCount" : 15000000,
            "deadDocFragmentsCount" : 750000,
            "deadDocFragmentsPercent" : 5.0,
            "deadDocFragmentBytes" : 32768000
        }
    },
    "collScans" : 14,
    "count" : 30000000,
    "size" : 1320000000,
    "avgObjSize" : 44,
    "storageSize" : 6461497344,
    "storageSizeStats" : {
        "storageSegmentBase" : 4307664896,
        "storageSegmentExtended" : 2153832448
    },
    "capped" : false,
    "nindexes" : 2,
    "totalIndexSize" : 9649553408,
    "indexSizes" : {
        "_id_" : 1910661120,
        "c_1" : 7738892288
    },
    "unusedStorageSize" : {
        "unusedBytes" : 4201881600,
        "unusedPercent" : 65.05,
        "storageSegmentBase" : {
            "unusedBytes" : 2801254400,
            "unusedPercent" : 65.05
        },
        "storageSegmentExtended" : {
            "unusedBytes" : 1400627200,
            "unusedPercent" : 65.05
        }
    },
    "cacheStats" : {
        "collBlksHit" : 171659016,
        "collBlksRead" : 754061,
        "collHitRatio" : 99.5627,
        "idxBlksHit" : 692563636,
        "idxBlksRead" : 1177921,
        "idxHitRatio" : 99.8303
    },
    "idxScans" : 41823984,
    "opCounter" : {
        "numDocsIns" : 0,
        "numDocsUpd" : 20911992,
        "numDocsDel" : 0
    },
    "lastReset" : "2025-06-24 05:57:08.219711+00",
    "ok" : 1,
    "operationTime" : Timestamp(1750968826, 1)
}
```

## Frequently asked questions
<a name="garbage-collection-faq"></a>

### How do I identify if garbage collection is not working efficiently?
<a name="w2aac55c15c17b3"></a>

Monitor these warning signs that indicate inefficient garbage collection:
+ **Excessive Collection Bloat** — Steadily increasing `unusedStorageSize` metrics during heavy writes or bulk deletions, especially with large indexes.
+ **High Dead Fragment Percentage** — `documentFragmentStats` showing consistently high `deadDocFragmentsPercent` values (above 10-15%).
+ **Degraded Query Latency** — Increased query latency due to accumulated dead documents.
+ **Extended GC Duration** — Garbage collection operations taking longer than historical averages in `gcRuntimeStats`.
+ **Elevated GC Processing** — High `LongestActiveGCRuntime` indicating the garbage collector cannot keep up with system demands.

### Does garbage collection affect my database performance?
<a name="w2aac55c15c17b5"></a>

Under normal conditions, garbage collection has minimal performance impact. However, when garbage collection falls behind, you may experience:
+ Increased storage costs from accumulated dead documents.
+ Slower query performance due to obsolete index entries.
+ Temporary read-only mode if MVCC IDs are depleted.
+ Higher resource usage during intensive collection runs, especially on smaller instances.
+ Reduced efficiency in extended storage segment operations for large documents.

### Can I manually trigger garbage collection?
<a name="w2aac55c15c17b7"></a>

No, garbage collection in Amazon DocumentDB cannot be manually triggered. The system manages garbage collection automatically as part of its internal maintenance operations.

### What alarms should I set as an operational best practice?
<a name="w2aac55c15c17b9"></a>

We recommend setting up monitoring at both the cluster and collection levels to ensure optimal performance of your Amazon DocumentDB system.

For cluster-level monitoring, start by creating a Amazon CloudWatch alarm for the `AvailableMVCCIds` metric with a threshold of 1.3 billion. This gives you adequate time to take action before the metric reaches zero, at which point your cluster would enter read-only mode. Keep in mind that this metric may fluctuate based on your specific usage patterns - some customers see it drop below 1.3 billion and then recover above 1.5 billion as garbage collection completes its work.

It's also important to monitor the `LongestActiveGCRuntime` metric through Amazon CloudWatch. This metric, along with `gcRuntimeStats`, helps you understand how efficiently garbage collection is performing across your system.

For collection-level monitoring, focus on these key metrics:
+ `MVCCIdScale` — Watch for increasing values that suggest MVCC IDs are aging and may need attention.
+ `gcRuntimeStats` — Identify garbage collection processes taking unusually long or extending over multiple days.
+ `documentFragmentStats` — Monitor `deadDocFragmentsPercent` values - consistently high percentages (above 10-15%) may indicate garbage collection is falling behind.
+ `storageSizeStats` and `unusedStorageSize` — Track storage utilization patterns and identify collections with significant unused space in either storage segment.

Collections with frequent write operations need extra attention, as they generate more work for the garbage collector. We recommend checking these metrics more frequently for collections with heavy write activity to ensure garbage collection keeps up with your workload.

Note that these monitoring recommendations serve as a starting point. As you become more familiar with your system's behavior, you may want to adjust these thresholds to better match your specific usage patterns and requirements.

### What should I do if my `AvailableMVCCIds` falls below 1.3 billion?
<a name="w2aac55c15c17c11"></a>

If your `AvailableMVCCIds` metric drops below 1.3 billion, we recommend taking immediate action to prevent your cluster from entering read-only mode. We recommend first scaling up your instance size to provide the garbage collector with more computing resources. This is our primary recommendation as it allows your application to continue normal operations while giving the garbage collector the additional power it needs to catch up.

If scaling up alone doesn't improve the situation, we recommend considering a reduction in your write operations. Use the `MVCCIdScale` metric to identify which specific collections contain older MVCC IDs that need attention. Additionally, monitor `documentFragmentStats` to identify collections with high dead fragment percentages that may be contributing to garbage collection inefficiency.

Once you've identified these collections, you may need to temporarily reduce write operations to them to allow garbage collection to catch up. During the recovery period, we recommend closely monitoring the `AvailableMVCCIds` metric to ensure your actions are having the desired effect. Your cluster is considered healthy once the `AvailableMVCCIds` value returns to 1.5 billion or higher.

Remember that these steps are preventive measures to help your system recover before it reaches a critical state. The sooner you take action after seeing the metric drop below 1.3 billion, the more likely you are to avoid any impact to your write operations.