Review and monitoring
SM_PERF3: How do you use caching to improve content delivery performance? |
---|
SM_PBP5 – Use a content delivery network and monitor your cache-hit-ratio |
SM_PBP6 – Ensure that cache-control headers for your content are optimized |
SM_PBP7 – Have a cache invalidation runbook |
SM_PBP8 – Minimize negative (error) caching |
A Content Delivery Network (CDN) scales video delivery by serving content from local caches nearest the user and providing optimized routes to origination services. Caching improves time-to-first-byte for clients and reduces the load on origin services. CDNs use a multi-tier architecture with two or three tiers of cache hierarchy before requests make it back to the origin server. These tiers are usually referred to as the edge tier and mid-tier caches. The edge tier is first to receive a request from client and responds fastest in the case of a cache match. The mid-tier has a larger cache depth, but is located only in select locations. Cache misses to the edge tier come back to the mid-tier for another chance at a cache match.
A Cache Hit Ratio (CHR) is the ratio of requests served from cache (matches) to the total requests (misses and matches) over a period of time. Cache matches improve the client experience and cache misses result in a request directly to your origin layer that increases response latency and costs. Monitoring CHR will help you to improve delivery and origin layer performance over time. You can enable CHR, origin latency, and HTTP error rate metrics from your Amazon CloudFront monitoring settings.
CDNs typically employ a last recently used (LRU) caching strategy on each tier. This means that data will be maintained in caches based on the amount of traffic an object receives and the available cache size. Though you can’t guarantee caches will hold content for the next request, you can set Cache-Control headers on the origin to indicate the preferred duration for an object to be kept in a CDN cache. Your CDN should be configured to respect caching headers from your origin server to ensure that live content and manifests are only cached for the appropriate amount of time.
Live streaming manifests are frequently updated to represent the next media object in the stream and should not be cached for longer than half of your segment duration. Caching longer than the segment duration could result in the serving of stale manifests, a delay for clients to retrieve the next media segment, and client buffer exhaustion, which will negatively impact user experience. Live media segments and VOD content (both segments and manifests) should be cached for as long as possible to retain them in delivery caches for the maximum amount of time.
Scenario | Segment Size | Manifest Update Frequency | Segment Cache-Control Header or Cache Behavior | Manifest Cache-Control Header or Cache Behavior |
---|---|---|---|---|
Live | 10 seconds | 10 seconds | 21,600 seconds or max DVR window | 5 seconds or less |
VOD | 10 seconds | Static | 86,400 seconds or longest possible | 86,400 seconds or longest possible |
Recommended cache behaviors for live and Video-on-Demand (VOD) scenarios
There are often times when cached content needs to be modified or invalidated. Have a cache invalidation runbook in place so that you can modify cached objects and invalidate the previous content. This can be achieved by invalidating content with a CDN feature, using variable file names, or using query string parameters to “break” the cache when content is changed.
Caching of error responses from the origin, also known as negative caching, should be minimized as some streaming clients might proactively request future segments before they are published to minimize latency. For live streaming, it should be disabled completely for manifest and segment files. At a minimum, the negative caching duration should not exceed one segment length. Amazon CloudFront caches origin errors for five minutes by default, but you can configure it to suit your needs.
SM_PERF4: How do you monitor viewer experience? |
---|
SM_PBP9 – Collect and analyze real user logs and metrics |
SM_PBP10 – Recognize and respond to playback anomalies |
Infrastructure logging and monitoring only provides you with part of the picture. We recommend that you design a client that sends real-user data directly to monitoring and logging systems. This allows you to benchmark normal behavior, identify anomalies, and correlate events with content delivery systems. For example, session initialization information, like playback URL, user-agent, and network connection status could help you identify issues with a specific origin, client device type, or network environment.
For streaming media, it’s especially important to monitor the health of the video decoder to determine how changes in network topology, video encoding settings, or mobile operating systems impact the end user experience. For example, capturing video buffering events, which directly impact customer satisfaction, should be a key indicator of streaming health.
We recommend that you capture client metrics from streaming sessions with services like Amazon Kinesis and monitor for anomalies with Amazon CloudWatch. Equipped with this data, you can uncover patterns from real users, create alerting systems, and automate remediation tasks. The AWS Partner Network provides another avenue for video-specific monitoring tools that can give you actionable data from playback sessions.
Amazon Prime Video, a streaming media service by Amazon, has many ways of monitoring customer experience. One key metric, Zero-Impact-Rate, measures the rate of streaming sessions that have had any buffers or errors. This is used to baseline customer experience and alert when there are deviations from normal behavior. Here are other client metrics that provide valuable insights into viewer playback experience:
Metric | Description |
---|---|
Time-to-First-Frame | Time between client request for content and first frame being displayed on client |
Playback Frames Per Second | Client displayed frame rate |
Session Resolution | Client displayed resolution |
Session Duration | Duration the client spent watching content |
Buffering Events | Client buffering events |
Zero-Buffer-Rate | Number of sessions that had zero buffer events |
Client Errors Events | Client HTTP or application errors |
Zero-Error-Rate | Percentage of total sessions that had zero error events |
Zero-Impact-Rate | Percentage of total sessions that had zero Buffers or Errors |
Suggested metrics for measuring quality of service