App Mesh best practices
Important
End of support notice: On September 30, 2026, AWS will discontinue support for AWS App Mesh. After September 30, 2026, you will no longer be able to access the AWS App Mesh console or AWS App Mesh resources. For more information, visit this blog post Migrating from AWS App Mesh to Amazon ECS Service Connect
To achieve the goal of zero failed requests during planned deployments and during the unplanned loss of some hosts, the best practices in this topic implement the following strategy:
-
Increase the likelihood that a request will succeed from the perspective of the application by using a safe default retry strategy. For more information, see Instrument all routes with retries.
-
Increase the likelihood that a retried request succeeds by maximizing the likelihood that the retried request is sent to an actual destination. For more information, see Adjust deployment velocity, Scale out before scale in, and Implement container health checks.
To significantly reduce or eliminate failures, we recommend that you implement the recommendations in all of the following practices.
Instrument all routes with retries
Configure all virtual services to use a virtual router and set a default retry policy for all routes.
This will mitigate failed requests by reselecting a host and sending a new request.
For retry policies, we recommend a value of at least two for maxRetries
, and specifying
the following options for each type of retry event in each route type that supports the retry event
type:
-
TCP –
connection-error
-
HTTP and HTTP/2 –
stream-error
andgateway-error
-
gRPC –
cancelled
andunavailable
Other retry events need to be considered on a case-by-case basis as they may not be safe, such as if
the request isn’t idempotent. You will need to consider and test values for maxRetries
and
perRetryTimeout
that make the appropriate trade off between the maximum latency of a
request (maxRetries
* perRetryTimeout
) versus the increased success rate of
more retries. Additionally, when Envoy attempts to connect to an endpoint that is no longer present,
you should expect that request to consume the full perRetryTimeout
. To configure a retry
policy, see Creating a route and then select the protocol
that you want to route.
Note
If you implemented a route on or after July 29, 2020 and didn't specify a retry policy, then App Mesh may have automatically created a default retry policy similar to the previous policy for each route you created on or after July 29, 2020. For more information, see Default route retry policy.
Adjust deployment velocity
When using rolling deployments, reduce the overall deployment velocity. By default, Amazon ECS configures a deployment strategy of a minimum of 100 percent healthy tasks and 200 percent total tasks. On deployment, this results in two points of high drift:
-
The 100 percent fleet size of new tasks may be visible to Envoys prior to being ready to complete requests (see Implement container health checks for mitigations).
-
The 100 percent fleet size of old tasks may be visible to Envoys while the tasks are being terminated.
When configured with these deployment constraints, container orchestrators may enter a state where they are simultaneously hiding all old destinations and making all new destinations visible. Because your Envoy dataplane is eventually consistent, this can result in periods where the set of destinations visible in your dataplane have diverged from the orchestrator’s point of view. To mitigate this, we recommend maintaining a minimum of 100 percent healthy tasks, but lowering total tasks to 125 percent. This will reduce divergence and improve the reliability of retries. We recommend the following settings for different container runtimes:
Amazon ECS
If your service has a desired count of two or three, set maximumPercent
to 150
percent. Otherwise, set maximumPercent
to 125 percent.
Kubernetes
Configure your deployment's update strategy
, setting maxUnavailable
to
0 percent and maxSurge
to 25 percent. For more information on deployments, see
Kubernetes Deployments
Scale out before scale in
Scale out and scale in can both result in some probability of failed requests in retries. While there are task recommendations that mitigate scale out, the only recommendation for scale in is to minimize the percentage of scaled in tasks at any one time. We recommend that you use a deployment strategy that scales out new Amazon ECS tasks or Kubernetes deployments prior to scaling in old tasks or deployments. This scaling strategy keeps your percentage of scaled in tasks or deployments lower, while maintaining the same velocity. This practice applies to both Amazon ECS tasks and Kubernetes deployments.
Implement container health checks
In the scale up scenario, containers in an Amazon ECS task may come up out of order and may not be initially responsive. We recommend the following suggestions for different container runtimes:
Amazon ECS
To mitigate this, we recommend using container health checks and container dependency ordering to ensure that Envoy is running and healthy prior to any containers requiring outbound network connectivity starting. To correctly configure an application container and Envoy container in a task definition, see Container dependency.
Kubernetes
None, because Kubernetes liveness and readiness
Optimize DNS resolution
If you're using DNS for service discovery, it's essential to select the appropriate IP protocol to
optimize DNS resolution when configuring your meshes. App Mesh supports both IPv4
and
IPv6
, and your choice can impact your service's performance and compatibility. If your
infrastructure doesn't support IPv6
, we recommended you specify an IP setting that aligns
with your infrastructure rather than relying on the default IPv6_PREFERRED
behavior. The
default IPv6_PREFERRED
behavior can degrade service performance.
-
IPv6_PREFERRED – This is the default setting. Envoy performs a DNS lookup for IPv6 addresses first and falls back to
IPv4
if noIPv6
addresses are found. This is beneficial if your infrastructure primarily supportsIPv6
but needsIPv4
compatibility. -
IPv4_PREFERRED – Envoy first looks up
IPv4
addresses and falls back toIPv6
if noIPv4
addresses are available. Use this setting if your infrastructure primarily supportsIPv4
but has someIPv6
compatibility. -
IPv6_ONLY – Choose this option if your services exclusively support
IPv6
traffic. Envoy only performs DNS lookups forIPv6
addresses, ensuring all traffic is routed throughIPv6
. -
IPv4_ONLY – Choose this setting if your services exclusively support
IPv4
traffic. Envoy only performs DNS lookups forIPv4
addresses, ensuring all traffic is routed throughIPv4
.
You can set IP version preferences at both the mesh level and the virtual node level, with virtual node settings overriding those at the mesh level.
For more information, see Service Meshes and Virtual Nodes.