AWS DMS data validation
Topics
- Replication task statistics
- Replication task statistics with Amazon CloudWatch
- Revalidating tables during a task
- Using JSON editor to modify validation rules
- Validation only tasks
- Troubleshooting
- Redshift Validation Performance
- Enhanced data validation for AWS Database Migration Service
- Limitations
- Amazon S3 target data validation
AWS DMS provides support for data validation to ensure that your data was migrated accurately from the source to the target. If enabled, validation begins immediately after a full load is performed for a table. Validation compares the incremental changes for a CDC-enabled task as they occur.
During data validation, AWS DMS compares each row in the source with its corresponding row at the target, verifies the rows contain the same data, and reports any mismatches. To accomplish this AWS DMS issues appropriate queries to retrieve the data. Note that these queries will consume additional resources at the source and target as well as additional network resources.
For a CDC only task with validation enabled, all pre-existing data in a table is validated before starting validation of new data.
Data validation works with the following source databases wherever AWS DMS supports them as source endpoints:
-
Oracle
-
PostgreSQL-compatible database (PostgreSQL, Aurora PostgreSQL, or Aurora Serverless for PostgreSQL)
-
MySQL-compatible database (MySQL, MariaDB, Aurora MySQL, or Aurora Serverless for MySQL)
-
Microsoft SQL Server
-
IBM Db2 LUW
Data validation works with the following target databases wherever AWS DMS supports them as target endpoints:
-
Oracle
-
PostgreSQL-compatible database (PostgreSQL, Aurora PostgreSQL, or Aurora Serverless for PostgreSQL)
-
MySQL-compatible database (MySQL, MariaDB, Aurora MySQL, or Aurora Serverless for MySQL)
-
Microsoft SQL Server
-
IBM Db2 LUW
-
Amazon Redshift
-
Amazon S3. For information about validating Amazon S3 target data, see Amazon S3 target data validation.
For more information about the supported endpoints, see Working with AWS DMS endpoints.
Data validation requires additional time, beyond the amount required for the migration itself. The extra time required depends on how much data was migrated.
For more information about these settings, see Data validation task settings.
For an example of ValidationSettings
task settings in a JSON file, see Task settings example.
Replication task statistics
When data validation is enabled, AWS DMS provides the following statistics at the table level:
-
ValidationState—The validation state of the table. The parameter can have the following values:
Not enabled—Validation is not enabled for the table in the migration task.
Pending records—Some records in the table are waiting for validation.
Mismatched records—Some records in the table don't match between the source and target. A mismatch might occur for a number of reasons; For more information, check the
awsdms_control.awsdms_validation_failures_v1
table on the target endpoint.Suspended records—Some records in the table can't be validated.
No primary key—The table can't be validated because it had no primary key.
Table error—The table wasn't validated because it was in an error state and some data wasn't migrated.
Validated—All rows in the table are validated. If the table is updated, the status can change from Validated.
Error—The table can't be validated because of an unexpected error.
Pending validation—The table is waiting validation.
Preparing table—Preparing the table enabled in the migration task for validation.
Pending revalidation—All rows in the table are pending validation after the table was updated.
-
ValidationPending—The number of records that have been migrated to the target, but that haven't yet been validated.
-
ValidationSuspended—The number of records that AWS DMS can't compare. For example, if a record at the source is constantly being updated, AWS DMS can't compare the source and the target.
-
ValidationFailed—The number of records that didn't pass the data validation phase.
For an example of ValidationSettings
task settings in a JSON file, see Task settings example.
You can view the data validation information using the console, the AWS CLI, or the AWS DMS API.
On the console, you can choose to validate a task when you create or modify the task. To view the data validation report using the console, choose the task on the Tasks page and choose the Table statistics tab in the details section.
Using the CLI, set the
EnableValidation
parameter totrue
when creating or modifying a task to begin data validation. The following example creates a task and enables data validation.create-replication-task --replication-task-settings '{"ValidationSettings":{"EnableValidation":true}}' --replication-instance-arn arn:aws:dms:us-east-1:5731014: rep:36KWVMB7Q --source-endpoint-arn arn:aws:dms:us-east-1:5731014: endpoint:CSZAEFQURFYMM --target-endpoint-arn arn:aws:dms:us-east-1:5731014: endpoint:CGPP7MF6WT4JQ --migration-type full-load-and-cdc --table-mappings '{"rules": [{"rule-type": "selection", "rule-id": "1", "rule-name": "1", "object-locator": {"schema-name": "data_types", "table-name": "%"}, "rule-action": "include"}]}'
Use the
describe-table-statistics
command to receive the data validation report in JSON format. The following command shows the data validation report.aws dms describe-table-statistics --replication-task-arn arn:aws:dms:us-east-1:5731014: rep:36KWVMB7Q
The report would be similar to the following.
{ "ReplicationTaskArn": "arn:aws:dms:us-west-2:5731014:task:VFPFTYKK2RYSI", "TableStatistics": [ { "ValidationPendingRecords": 2, "Inserts": 25, "ValidationState": "Pending records", "ValidationSuspendedRecords": 0, "LastUpdateTime": 1510181065.349, "FullLoadErrorRows": 0, "FullLoadCondtnlChkFailedRows": 0, "Ddls": 0, "TableName": "t_binary", "ValidationFailedRecords": 0, "Updates": 0, "FullLoadRows": 10, "TableState": "Table completed", "SchemaName": "d_types_s_sqlserver", "Deletes": 0 } }
Using the AWS DMS API, create a task using the CreateReplicationTask action and set the
EnableValidation
parameter to true to validate the data migrated by the task. Use the DescribeTableStatistics action to receive the data validation report in JSON format.
Replication task statistics with Amazon CloudWatch
When Amazon CloudWatch is enabled, AWS DMS provides the following replication task statistics:
ValidationSucceededRecordCount— Number of rows that AWS DMS validated, per minute.
ValidationAttemptedRecordCount— Number of rows that validation was attempted, per minute.
ValidationFailedOverallCount— Number of rows where validation failed.
ValidationSuspendedOverallCount— Number of rows where validation was suspended.
ValidationPendingOverallCount— Number of rows where the validation is still pending.
ValidationBulkQuerySourceLatency— AWS DMS can do data validation in bulk, especially in certain scenarios during a full-load or on-going replication when there are many changes. This metric indicates the latency required to read a bulk set of data from the source endpoint.
ValidationBulkQueryTargetLatency— AWS DMS can do data validation in bulk, especially in certain scenarios during a full-load or on-going replication when there are many changes. This metric indicates the latency required to read a bulk set of data on the target endpoint.
ValidationItemQuerySourceLatency— During on-going replication, data validation can identify on-going changes and validate those changes. This metric indicates the latency in reading those changes from the source. Validation can run more queries than required, based on number of changes, if there are errors during validation.
ValidationItemQueryTargetLatency— During on-going replication, data validation can identify on-going changes and validate the changes row by row. This metric gives us the latency in reading those changes from the target. Validation may run more queries than required, based on number of changes, if there are errors during validation.
To collect data validation information from CloudWatch enabled statistics, select Enable CloudWatch logs when you create or modify a task using the console. Then, to view the data validation information and ensure that your data was migrated accurately from source to target, do the following.
Choose the task on the Database migration tasks page.
Choose the CloudWatch metrics tab.
Select Validation from the drop down menu.
Revalidating tables during a task
While a task is running, you can request AWS DMS to perform data validation.
AWS Management Console
-
Sign in to the AWS Management Console and open the AWS DMS console at https://console.aws.amazon.com/dms/v2/
. If you're signed in as an AWS Identity and Access Management (IAM) user, make sure that you have the appropriate permissions to access AWS DMS. the permissions required, see IAM permissions needed to use AWS DMS.
-
Choose Tasks from the navigation pane.
-
Choose the running task that has the table you want to revalidate.
Choose the Table Statistics tab.
-
Choose the table you want to revalidate (you can choose up to 10 tables at one time). If the task is no longer running, you can't revalidate the table(s).
-
Choose Revalidate.
Using JSON editor to modify validation rules
To add a validation rule to a task using the JSON editor from the AWS DMS Console, do the following:
-
Select Database migration tasks.
-
Select your task from the list of migration tasks.
-
If your task is running, select Stop from the Actions drop down menu.
-
Once the task has stopped, to modify your task, select Modify from the Actions drop down menu.
-
In the Table mappings section, select JSON editor and add your validation rule to your table mappings.
For example, you can add the following validation rule to run a replace function on the source. In this case, if the validation rule encounters a null byte, it validates it as a space.
{ "rule-type": "validation", "rule-id": "1", "rule-name": "1", "rule-target": "column", "object-locator": { "schema-name": "Test-Schema", "table-name": "Test-Table", "column-name": "Test-Column" }, "rule-action": "override-validation-function", "source-function": "REPLACE(${column-name}, chr(0), chr(32))", "target-function": "${column-name}" }
Validation only tasks
You can create validation only tasks to preview and validate data without performing
any migration or data replication. To create a validation only task, set the
EnableValidation
and ValidationOnly
settings to
true
. When enabling ValidationOnly
, additional
requirements apply. For more information, see Data
validation task settings.
For a full load only migration type, a validation only task completes much faster than its CDC equivalent when many failures are reported. But changes to the source or target endpoint are reported as failures for full load mode, a possible disadvantage.
A CDC validation only task delays validation based on average latency, and retries failures multiple times before reporting them. If the majority of data comparisons result in failures, a validation only task for CDC mode is very slow, a potential drawback.
A validation-only task must be set up in the same direction as the replication task, especially for CDC. This is because a CDC Validation Only task detects which rows have changed and need to be revalidated based on the change log on the source. If the target is specified as the source, then it only knows about changes sent to the target by DMS and is not guaranteed to catch replication errors.
Full load validation only
Beginning with AWS DMS version 3.4.6 and higher, a full load validation only task quickly compares all rows from the source and target tables in a single pass, immediately reports any failures, and then shuts down. Validation never is suspended due to failures in this mode, it is optimized for speed. But changes to the source or target endpoint are reported as failures.
Note
Beginning with AWS DMS version 3.4.6 and higher, this validation behavior also applies to full load migration task with validation enabled.
CDC validation only
A CDC validation only task validates all existing rows between the source and target tables on a fresh start. In addition, a CDC validation only task runs continuously, re-validates ongoing replication changes, limits the number of failures reported each pass, and retries mismatched rows before failing them. It is optimized to prevent false positives.
Validation for a table (or the entire task) is suspended if the
FailureMaxCount
or TableFailureMaxCount
thresholds are
breached. This also applies for a CDC or Full Load+CDC migration task with
validation enabled. And a CDC task with validation enabled delays re-validation for
each changed row based on average source and target latency.
But a CDC validation only task doesn't
migrate data and has no latency. It sets ValidationQueryCdcDelaySeconds
to 180 by default. And you can increase the amount to account for high latency
environments and help prevent false positives.
Validation only use cases
Use cases for splitting the data validation portion of a migration or replication task into a separate validation only task includes, but is not limited to, the following:
-
Control exactly when validation occurs — Validation queries add an additional load to both source and target endpoints. So, migrating or replicating data in one task first, then validating the results in another task can be beneficial.
-
Reduce load on the replication instance — Splitting data validation to run on its own instance can be advantageous.
-
Quickly obtain how many rows don't match at a given moment in time — For example, just before or during a maintenance window production cut–over to a target endpoint, you can create a Full Load validation only task to get an answer to your question.
-
When validation failures are expected for a migration task with a CDC component — For example, if migrating Oracle
varchar2
to PostgreSQLjsonb
, CDC validation keeps retrying these failed rows and limits the number of failures reported each time. But, you can create a Full Load validation only task and obtain a quicker answer. -
You've developed a data repair script/utility that reads the validation failure table — (See also, Troubleshooting). A Full Load validation only task quickly reports failures for the data repair script to act upon.
For an example of ValidationSettings
task settings in a JSON file, see Task settings example).
Troubleshooting
During validation, AWS DMS creates a new table at the target endpoint:
awsdms_control.awsdms_validation_failures_v1
. If any record enters the
ValidationSuspended or the
ValidationFailed state, AWS DMS writes diagnostic
information to awsdms_control.awsdms_validation_failures_v1
. You can query this table
to help troubleshoot validation errors.
For information about changing the default schema the table is created in on the target, see Control table task settings.
Following is a description of the awsdms_control.awsdms_validation_failures_v1
table:
Column name | Data type | Description |
---|---|---|
|
|
AWS DMS task identifier. |
TABLE_OWNER |
VARCHAR(128) NOT NULL |
Schema (owner) of the table. |
|
VARCHAR(128) NOT NULL |
Table name. |
FAILURE_TIME |
DATETIME(3) NOT NULL |
Time when the failure occurred. |
KEY_TYPE |
VARCHAR(128) NOT NULL |
Reserved for future use (value is always 'Row') |
KEY |
TEXT NOT NULL |
This is the primary key for row record type. |
FAILURE_TYPE |
VARCHAR(128) NOT NULL |
Severity of validation error. Can be either |
DETAILS |
VARCHAR(8000) NOT NULL |
JSON formatted string of all source/target column values which do not match for the given key. |
The following is a sample query for a MySQL target that will show you all the failures for a task by querying the
awsdms_control.awsdms_validation_failures_v1
table. Note that the schema name and
query syntax will vary across target engine versions. The task name should be the
external resource ID of the task. The external resource ID of the task is the last value
in the task ARN. For example, for a task with an ARN value of
arn:aws:dms:us-west-2:5599:task: VFPFKH4FJR3FTYKK2RYSI, the external resource ID of the
task would be VFPFKH4FJR3FTYKK2RYSI.
select * from awsdms_validation_failures_v1 where TASK_NAME = 'VFPFKH4FJR3FTYKK2RYSI' TASK_NAME VFPFKH4FJR3FTYKK2RYSI TABLE_OWNER DB2PERF TABLE_NAME PERFTEST FAILURE_TIME 2020-06-11 21:58:44 KEY_TYPE Row KEY {"key": ["3451491"]} FAILURE_TYPE RECORD_DIFF DETAILS [[{'MYREAL': '+1.10106036e-01'}, {'MYREAL': '+1.10106044e-01'}],]
You can look at the DETAILS
field to determine which columns
don’t match. Since you have the primary key of the failed record, you can query the
source and target endpoints to see what part of the record does not match.
Redshift Validation Performance
Amazon Redshift differs from relational databases in several ways, including columnar storage, MPP, data compression, and other factors. These differences give Redshift a different performance profile from relational databases.
During the full-load replication phase, validation uses range queries, with the data
size governed by the PartitionSize
setting.
These range-based queries select all the records from the source table.
For ongoing replication, queries switch between range-based and individual-record fetches. The query type is determined dynamically based on multiple factors, such as the following:
Query volume
Types of DML queries on the source table
Task latency
Total number of records
Validation settings such as
PartitionSize
You may see additional load on your Amazon Redshift cluster due to validation queries. As the above factors vary across use cases, you must review your validation query performance and tune your cluster and table accordingly. Some options to mitigate performance isses include the following:
Reduce the
PartitionSize
andThreadCount
settings to help reduce the workload during full-load validation. Note that this will slow down data validation.While Redshift doesn’t enforce primary keys, AWS DMS relies on primary keys to uniquely identify records on the target for data validation. If possible, set the primary key to mirror the sort key so that full load validation queries execute more quickly.
Enhanced data validation for AWS Database Migration Service
Enhanced data validation is now available in version 3.5.4 of the replication engine for both full load and full load with CDC migration tasks. Currently, this enhancement supports migration paths from Oracle to PostgreSQL, SQL Server to PostgreSQL, Oracle to Oracle, and SQL Server to SQL Server.
Prerequisites
-
Oracle: grant the
EXECUTE
permission onSYS.DBMS_CRYPTO
to the user account that accesses the Oracle endpoint:GRANT EXECUTE ON SYS.DBMS_CRYPTO TO dms_endpoint_user;
-
Install the
pgcrypto
extension on the PostgreSQL database:Note
For Amazon RDS for PostgreSQLinstances, the
pgcrypto
extension is already enabled.For self-managed PostgreSQL instances, you need to install the
contrib
module libraries and create the extension:-
Install the
contrib
module libraries. For example, on an Amazon EC2 instance with Amazon Linux and PostgreSQL 15:sudo dnf install postgresql15-contrib
-
Create the
pgcrypto
extension:CREATE EXTENSION IF NOT EXISTS pgcrypto;
-
-
For Amazon RDS for PostgreSQL instances, configure the SSL mode for the AWS DMS endpoint:
-
By default, Amazon RDS forces an SSL connection. When you create a AWS DMS endpoint for a Amazon RDS for PostgreSQL instance, use the "SSL mode" option = "required".
-
If you want to use the "SSL mode" option = "none", set the
rds.force_ssl
parameter to 0 in the RDS Parameter Group.
-
-
For PostgreSQL 12 and 13, create the
BIT_XOR
aggregate:CREATE OR REPLACE AGGREGATE BIT_XOR(IN v bit) (SFUNC = bitxor, STYPE = bit);
Limitations
This enhanced data validation feature has the following limitations:
-
Database endpoint requirements: This improvement is enabled only for database endpoints that meet the following criteria:
Use AWS Secrets Manager to store credentials.
For Microsoft SQL Server, Kerberos authentication is also supported.
-
Database version support:
PostgreSQL 12 and higher
Oracle 12.1 and higher
For Microsoft SQL Server versions lower than 2019, validation of NCHAR and NVARCHAR data types is not supported.
Limitations
-
Data validation requires that the table has a primary key or unique index.
Primary key columns can't be of type
CLOB
,BLOB
, orBYTE
.-
For primary key columns of type
VARCHAR
orCHAR
, the length must be less than 1024. You must specify the length in the datatype. You can't use unbounded data types as a primary key for data validation. -
An Oracle key created with the
NOVALIDATE
clause is not considered a primary key or unique index. -
For an Oracle table with no primary key and only a unique key, the columns with the unique constraint must also have a
NOT NULL
constraint.
-
Validation of NULL PK/UK values aren't supported.
-
If the collation of the primary key column in the target PostgreSQL instance isn't set to "C", the sort order of the primary key is different compared to the sort order in Oracle. If the sort order is different between PostgreSQL and Oracle, data validation fails to validate the records.
-
Data validation generates additional queries against the source and target databases. You must ensure that both databases have enough resources to handle this additional load. This is especially true for Redshift targets. For more information, see Redshift Validation Performance following.
-
Data validation isn't supported when consolidating several databases into one.
-
For a source or target Oracle endpoint, AWS DMS uses DBMS_CRYPTO to validate LOBs. If your Oracle endpoint uses LOBs, then you must grant the execute permission on dbms_crypto to the user account used to access the Oracle endpoint. You can do this by running the following statement:
grant execute on sys.dbms_crypto to
dms_endpoint_user
; -
If the target database is modified outside of AWS DMS during validation, then discrepancies might not be reported accurately. This result can occur if one of your applications writes data to the target table, while AWS DMS is performing validation on that same table.
-
If one or more rows are being continuously modified during validation, then AWS DMS can't validate those rows.
-
If AWS DMS detects more than 10,000 failed or suspended records, it stops the validation. Before you proceed further, resolve any underlying problems with the data.
-
AWS DMS doesn't support data validation of views.
-
AWS DMS doesn't support data validation when character substitution task settings are used.
AWS DMS doesn't support validating the Oracle LONG type.
AWS DMS doesn't support validating the Oracle Spatial type during heterogeneous migration.
Data validation ignores those columns in tables for which data masking transformations exists in table mapping.
Data validation skips an entire table if there is a data masking transformation rule for its PK/UK column. The validation state will show as No primary key for such tables.
For limitations when using S3 target validation, see Limitations for using S3 target validation.