A general approach to debugging Lambda performance issues and errors
Between CloudWatch Logs, CloudWatch Logs Insights, and X-Ray, you can monitor the performance of Lambda-based applications and also drill into specific errors. It’s recommended that you instrument production applications with these services, or use alternative third-party services, to quickly gain insights for troubleshooting production problems.
Generally, once a problem is reported:
-
Reproduce the issue, if possible, or identify the reported errors from logs. Try to identify specific environmental or configuration settings that indicate if the error may be the cause of a downstream outage or specific set of circumstances.
-
Use X-Ray to find all the services involved in a request. For larger serverless applications, this is the fastest way to locate all of the interactions involved in the request. From X-Ray, isolate the service where latency or errors are occurring then drill down further.
-
If a Lambda function is the source of the problem, ensure that the function is not memory-bound or CPU-bound, and that there is available unreserved concurrency to scale up as needed. Performing load tests after making changes can help identify if the issue is resolved and can help simulate typical performance using real-world traffic levels.
-
Use CloudWatch metrics to identify any throttling in the services during the lifetime of the request. You can also set alarms whenever throttling metrics increase above zero, since any level of throttling is likely to result in errors or slow performance in your application.
-
Update your application once the problem is identified. This includes updating error handling and tests in custom code and requesting increases in Service Quotas, if needed. After these changes, attempt to recreate the problem with load testing to ensure that the new quotas and custom code have either resolved the issue or help provide more metrics and instrumentation if the problem occurs again.