View a markdown version of this page

Handle host replacement and connection stalling - Amazon Neptune

Handle host replacement and connection stalling

When Neptune replaces a host (for example, during maintenance or failover), existing connections to that host become invalid. In containerized environments, this can stall all threads in a container if the client doesn't handle the replacement gracefully.

Use current client versions

If you use the Gremlin query language, use a TinkerPop driver version that is compatible with your Neptune engine version (see Accessing a Neptune graph with Gremlin for the compatibility table). If you use the Java driver, consider neptune-gremlin-client — a wrapper around the TinkerPop Java driver that adds connection management features like endpoint health checking and failover handling. It follows the same version compatibility rules as the underlying TinkerPop driver.

Use neptune-gremlin-client version 3.x (or at minimum version 2.0.7), depending on what your Neptune version allows. These newer versions improve resiliency and connection handling.

For openCypher users with the Neo4j driver, close and recreate the Driver object when you detect a connection failure during failover. Neptune supports Bolt protocol versions 1 through 4.0. For more information, see Neptune Best Practices Using openCypher and Bolt.

Use cluster or reader endpoints

Don't connect to instance endpoints directly. Use the cluster endpoint for writes and the reader endpoint for reads. If you must use instance endpoints with neptune-gremlin-client, enable endpoint health-check filtering through the /status API.

Configure liveness probes with tolerance

Set your Kubernetes liveness probe failureThreshold to at least 30 with a 10-second period (300 seconds total). This prevents Kubernetes from restarting pods during the approximately 5-minute window when Neptune is completing a host replacement.

Implement retry with backoff

A single failed request during host replacement shouldn't crash the container. Implement retry logic with exponential backoff on connection failures so that transient errors during replacement resolve without intervention. For guidance on retryable exceptions, see Neptune transaction exceptions.