Prepare

CMOPS_1: How do you define the meaningful monitoring KPIs and metrics of your connected mobility platform?

Connected mobility platforms handle millions of vehicles and end users, and billions of messages being generated. The scale and distributed nature of connected mobility platforms create unique challenges for monitoring. Defining the critical KPIs and metrics is a key step of your monitoring strategy, and should include end user mobile applications, vehicles, mobile cellular networks, and cloud infrastructure. Consider starting with the benefits that the connected mobility platform provides to its end users and work backwards to identify KPIs and metrics, which are leading and lagging indicators of success.

[CMOPS_BP1.1] Define end-to-end KPIs and metrics for a connected mobility platform.

Collecting key metrics, logs, and trace information from all parts of your connected mobility platform will give you end to end visibility across the solution. It can decrease the Mean Time to Detect (MTTD), Mean Time to Repair (MTTR), and Mean Time to Restore Service (MTRS) by allowing you to detect issues quickly, and troubleshoot and debug with precision. It can also help you recognizing trends before critical issues occur.

Define KPIs that include business, application, and infrastructure across the solution. Start this process by creating a working team that includes stakeholders from various teams including business teams. Start by defining business KPIs, and then supporting KPIs for the application and infrastructure. These KPIs should be mapped to the metrics collected from tools such as AWS CloudWatch, which helps to observe resources in AWS, on-premises, and in other clouds. Some sample KPIs are as follows:

Business KPIs:

Metrics from end user mobile app including the number of remote commands issued, errors running remote commands, remote command latency, errors from user portals, and errors from infotainment systems.
Mobile network Quality of Service (QoS) metrics such as packet loss, latency, jitter, and interference.
Vehicle health indicators such as diagnostic codes, battery health, fuel efficiency, and repair and maintenance data.
Vehicle security metrics such as the number of unauthorized requests to high-risk ECUs and vehicle disconnected duration metric.

Application KPIs:

Number of transaction failures, application errors, and number of retries for key services.
Response times to and from vehicle, and latency across the application and in critical functions.
Health checks for critical features and functions of the application.
Database metrics such as query response times, number of connections, and IOPS metrics.

Infrastructure KPIs:

Usage metrics that include CPU, Memory, Storage, and Network for critical connected mobility infrastructure. These metrics play a crucial role in proactive infrastructure management, optimizing resource allocation, and enhancing the overall user experience of Connected Mobility.
Service quota metrics to enable proactive service quota management and avoid downtime due to reaching a service quota. Effective quota management prevents connected mobility service disruption.
Security metrics including unpatched instances, security vulnerabilities, security events, and non-compliant resources. These metrics help assessing security status, make informed decisions, and prioritize actions to mitigate risks.
Cloud specific metrics for key services. As an example, AWS Lambda is commonly used as part of connected mobility implementations on AWS. Key Lambda metrics include number of invocations, number of errors, number of retries, number of invocations that were throttled, and duration of function processing. These metrics play a vital role in preventing and addressing various issues that can affect the reliability and performance of your connected mobility platform.

CMOPS_2: How do you observe the health of your connected mobility platform and proactively identify anomalies?

The connected mobility platform includes the vehicle edge, connectivity, cloud-based infrastructure and applications, enables various services like fleet management, remote diagnostics, and predictive maintenance. Implementing observability enables insights about end-to-end system health, availability, performance, and scalability, which in turn helps reduces the time to restore the service after a disruption.

By monitoring metrics such as response times, resource utilization, and error rates, the customer can identify potential bottlenecks or issues that may impact the overall performance of the system. This allows for proactive measures to be taken, such as scaling up resources or optimizing the platform's architecture, to provide uninterrupted service delivery.

Monitoring the health of edge devices is crucial because these devices form the backbone of connected mobility. Edge devices, such as vehicle telematics units or sensors, collect and transmit data to the cloud for analysis and are typically used for decision-making. By continuously monitoring their health, you can detect any anomalies, performance degradation, or malfunctions in real-time and provide timely interventions.

Observability of connected mobility platforms requires that the data from all the subsystems and microservices are aggregated in a data lake, there is a capability to correlate the records, and dive deep to identify the root cause of the issue.

[CMOPS_BP2.1] Implement an observability data lake that aggregates telemetry from all connected mobility components.

Validate that all components of the connected mobility platform are able to send the telemetry data (logs, traces, and metrics) to the observability data lake. Implement transaction correlation IDs in those records to trace the transaction across multiple services.

To investigate intermittent errors in a set of vehicles, it requires a capability to activate debug level logging inside the vehicle. These logs can help the operations team to replay transactions in a test environment and also debug the root cause of the errors in production. As the data loggers will generate a lot of data, this capability should be enabled only during an event. Implement a log and trace framework, based on the standard defined by the AUTOSAR 4 DLT. The end user also should be able to enable this through a diagnostic app in the infotainment screen to assist the troubleshooting with the service engineer.

The logs can then be transferred to the cloud using AWS IoT Core where the AWS IoT rules engine can be used to upload in-vehicle log records to Amazon CloudWatch. You can also upload the log records using Amazon S3 pre-signed URLs if the MQTT publish payload is more than 256 MB. The connected mobility platform should have a capability to import these logs into vehicle simulators in the cloud to replay these transactions in non-production environments

Implement an observability data pipeline to orchestrate and automate data processing workflows, ensuring seamless ingestion and transformation of observability data.

[CMOPS_BP2.2] Set up real-time monitoring and alerting capabilities to detect anomalies promptly.

Use Amazon CloudWatch to store logs and to monitor key metrics such as vehicle status, telemetry, or network connectivity. Use AWS Glue to transform and analyze the log data using services like Amazon Athena or Amazon OpenSearch Service. Implement custom log analysis solutions or use AWS Partner Network offerings for advanced log analytics. When predefined thresholds or anomalies are detected, CloudWatch Alarms can cause notifications via Amazon Simple Notification Service (Amazon SNS) or perform automated actions using AWS Lambda functions.

Implement AWS Lambda functions to process and analyze streaming data from the edge devices in near real-time, triggering alerts or notifications based on predefined thresholds. By leveraging Lambda, you can apply custom logic or machine learning models to the incoming observability data, enabling real-time anomaly detection and triggering alerts or notifications based on specific conditions.

Perform synthetic transaction monitoring that simulates the typical end user journey. The synthetic transaction agents should be deployed in various geographical locations, which periodically sends test transactions to the connected mobility services deployed in the cloud. Implement a vehicle health monitoring agent solution as per AUTOSAR to monitor the vehicle to cloud communication. If the monitoring detects any issues, such as a breach in a performance threshold or any error, it will alert the operations team.

[CMOPS_BP2.3] Implement predictive analytics and proactive operations.

To improve operations and detection of patterns in the cloud, predictive analytics and proactive operations play a significant role in avoiding costly downtimes. By using predictive analytics, historical data is analyzed to identify patterns and trends that can predict potential component failures. This enables proactive maintenance scheduling, reducing the risk of unexpected downtime. Machine learning (ML) models can be trained on historical data, allowing for accurate predictions. Applying predictive analytics algorithms helps generate insights and optimize maintenance planning.

For the application stack in the cloud, use an ML service like Amazon DevOps Guru to detect abnormal operating patterns to identify operational issues. You can extend predictive capability by developing ML algorithms using a service like Amazon SageMaker AI and leveraging an observability data lake. This insight can be used by the solution to automatically recommend remedial actions and runbooks to restore the service

In a connected mobility platform, the most challenging aspect is debugging vehicle to cloud communications. Following best practices can help you debug the in-vehicle connected vehicle components remotely.

[CMOPS_BP2.4] Implement robust remote diagnostic capabilities.

To troubleshoot, robust remote diagnostic capabilities are essential. This entails developing comprehensive diagnostic algorithms that can accurately identify potential issues. Real-time data collected from in-vehicle system is continuously analyzed to detect anomalies and raise timely alerts. By leveraging advanced ML techniques and cloud-based analytics platforms, the diagnostic accuracy can be improved, enabling efficient identification of problems.

Use services such as AWS IoT Core for near real-time diagnostic data ingestion and processing. Use serverless functions such as AWS Lambda to develop diagnostic algorithms that can analyze the events in real-time. This enables the diagnosis of complex patterns and the generation of actionable insights.

[CMOPS_BP2.5] Provide remote access for troubleshooting purposes.

Authorized support personnel can remotely access in-vehicle systems, which expedites the troubleshooting process. However, it's crucial to implement secure communication protocols to protect data privacy and prevent unauthorized access. Utilizing end user consent and virtual private networks (VPNs) or encrypted remote access tools helps establish a secure connection. Regularly auditing and updating access permissions further enhances security while preventing unauthorized access. Follow the security best practices (CMSEC_6) for the remote troubleshooting.

CMOPS_3: Have you set up a validation environment that has feature equality with production environment of the connected mobility platform?

[CMOPS_BP3.1] Validate the system in a non-production environment that has feature equality with production.

The connected mobility platform requires additional validation to incorporate variability of network connectivity, external systems, in-vehicle components, model years of vehicles and diverse environments that the vehicles are operated.

Use digital twin concepts to create a vehicle simulator to introduce common failure scenarios and test subsystem behavior. Modularizing the digital twin model enables additional subsystems to be added in the future. Leverage vehicle observability data collected over time to generate machine learning models and perform predictive testing. Create curated test dataset to test application logic, regression, and performance. Use simulation tools (such as virtual ECUs) to test for scenarios throughout the application/sub-system development lifecycle and to validate subsystem behavior.

Perform real world tests for connectivity (with different hardware and network equipment/environment), security, navigation, entertainment, data collection and analytics. Such tests should be performed with hardware-in-loop using modular test benches.

CMOPS_4: How do you do agile development of the connected mobility platform, with a focus on minimizing disruptions?

[CMOPS_BP4.1] Leverage microservices architecture for connected mobility platform.

The microservices architecture streamlines the development, deployment, scaling, and maintenance of the entire connected mobility platform. For example, the remote commands microservice, which is responsible for sending lock, unlock, and remote start commands to the vehicle, can be independently scaled and deployed. This operation can be performed separately from the vehicle diagnostic microservices, which send periodic vehicle diagnostic reports to the end user. This architecture also enables each microservice to have different disaster recovery (DR) strategies based on their individual recovery time objective (RTO) and recovery point objective (RPO).

[CMOPS_BP4.2] Implement API-first architecture to facilitate the exchange of data and services while developing connected mobility platform.

API operations help in agile development and faster feature-to-market in the context of connected mobility software, it provides clear guidelines, enable modular development, and support collaboration with external stakeholders. For example, automotive Original Equipment Manufacturers (OEMs) deploy developer portals to crowdsource app development on the connected mobility platform. These portals give secure access to the connected vehicle platform through public API operations. Connected vehicle API operations could also be used to securely share connected vehicle data to authorized repairers for compliance with Right to Repair regulations. The API-first approach also helps to easily extend the platform to support multiple model years of vehicles with backward compatibility as longs the API contracts are maintained. It also reduces the impact of changes and reduces time to market for new features without causing disruption to the connected mobility platform.

[CMOPS_BP4.3] Implement DevOps automation.

Incorporate DevOps automation to improve productivity, enhance developer experience, and improve quality of the connected mobility feature releases. Implement a self-service developer portal which provisions pattern-based software templates with preapproved resources and configurations requiring limited manual intervention. Create connected mobility platform templates in the developer portal that include opinionated pre-built modules for observability and other connected mobility best practices baked in. Adopt Continuous Integration/Continuous Deployment (CI/CD) pipelines to automate the testing and deployment of the software updates, ensuring rapid and error-free releases. Automate testing, to simulate scenarios and real-world conditions, helps validate the connected mobility platform for safety and performance.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Organization

Operate