SEC07-BP03 Automate identification and classification - AWS Well-Architected Framework

SEC07-BP03 Automate identification and classification

Automating the identification and classification of data can help you implement the correct controls. Using automation to augment manual determination reduces the risk of human error and exposure.

Desired outcome: You are able to verify whether the proper controls are in place based on your classification and handling policy. Automated tools and services help you to identify and classify the sensitivity level of your data.  Automation also helps you continually monitor your environments to detect and alert if data is being stored or handled in unauthorized ways so corrective action can be taken quickly.

Common anti-patterns:

  • Relying solely on manual processes for data identification and classification, which can be error-prone and time-consuming.  This can lead to inefficient and inconsistent data classification, especially as data volumes grow.

  • Not having mechanisms to track and manage data assets across the organization.

  • Overlooking the need for continuous monitoring and classification of data as it moves and evolves within the organization.

Benefits of establishing this best practice: Automating data identification and classification can lead to more consistent and accurate application of data protection controls, reducing the risk of human error.  Automation can also provide visibility into sensitive data access and movement, helping you detect unauthorized handling and take corrective action.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

While human judgment is often used to classify data during the initial design phases of a workload, consider having systems in place that automate identification and classification on test data as a preventive control. For example, developers can be provided a tool or service to scan representative data to determine its sensitivity.  Within AWS, you can upload data sets into Amazon S3 and scan them using Amazon Macie, Amazon Comprehend, or Amazon Comprehend Medical.  Likewise, consider scanning data as part of unit and integration testing to detect where sensitive data is not expected. Alerting on sensitive data at this stage can highlight gaps in protections before deployment to production. Other features such as sensitive data detection in AWS Glue, Amazon SNS, and Amazon CloudWatch can also be used to detect PII and take mitigating action. For any automated tool or service, understand how it defines sensitive data, and augment it with other human or automated solutions to close any gaps as needed.

As a detective control, use ongoing monitoring of your environments to detect if sensitive data is being stored in non-compliant ways.  This can help detect situations such as sensitive data being emitted into log files or being copied to a data analytics environment without proper de-identification or redaction.  Data that is stored in Amazon S3 can be continually monitored for sensitive data using Amazon Macie.  

Implementation steps

  1. Review the data classification scheme within your organization described in SEC07-BP01.

    1. With an understanding of your organization's data classification scheme, you can establish accurate processes for automated identification and classification that align with your company's policies.

  2. Perform an initial scan of your environments for automated identification and classification.

    1. An initial full scan of your data can help produce a comprehensive understanding of where sensitive data resides in your environments. When a full scan is not initially required or is unable to be completed up-front due to cost, evaluate if data sampling techniques are suitable to achieve your outcomes. For example, Amazon Macie can be configured to perform a broad automated sensitive data discovery operation across your S3 buckets.  This capability uses sampling techniques to cost-efficiently perform a preliminary analysis of where sensitive data resides.  A deeper analysis of S3 buckets can then be performed using a sensitive data discovery job. Other data stores can also be exported to S3 to be scanned by Macie.

    2. Establish access control defined in SEC07-BP02 for your data storage resources identified within your scan.

  3. Configure ongoing scans of your environments.

    1. The automated sensitive data discovery capability of Macie can be used to perform ongoing scans of your environments.  Known S3 buckets that are authorized to store sensitive data can be excluded using an allow list in Macie.

  4. Incorporate identification and classification into your build and test processes.

    1. Identify tools that developers can use to scan data for sensitivity while workloads are in development.  Use these tools as part of integration testing to alert when sensitive data is unexpected and prevent further deployment.

  5. Implement a system or runbook to take action when sensitive data is found in unauthorized locations.

    1. Restrict access to data using auto-remediation. For example, you can move this data to an S3 bucket with restricted access or tag the object if you use attribute-based access control (ABAC). Additionally, consider masking the data when it is detected.

    2. Alert your data protection and incident response teams to investigate the root cause of the incident. Any learnings they identify can help prevent future incidents.

Resources

Related documents:

Related examples:

Related tools: