Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Synthetic dataset

Focus mode
Synthetic dataset - Amazon SageMaker AI

SageMaker Clarify uses the Kernel SHAP algorithm. Given a record (also called a sample or an instance) and the SHAP configuration, the explainer first generates a synthetic dataset. SageMaker Clarify then queries the model container for the predictions of the dataset, and then computes and returns the feature attributions. The size of the synthetic dataset affects the runtime for the Clarify explainer. Larger synthetic datasets take more time to obtain model predictions than smaller ones.

The synthetic dataset size is determined by the following formula:

Synthetic dataset size = SHAP baseline size * n_samples

The SHAP baseline size is the number of records in the SHAP baseline data. This information is taken from the ShapBaselineConfig.

The size of n_samples is set by the parameter NumberOfSamples in the explainer configuration and the number of features. If the number of features is n_features, then n_samples is the following:

n_samples = MIN(NumberOfSamples, 2^n_features - 2)

The following shows n_samples if NumberOfSamples is not provided.

n_samples = MIN(2*n_features + 2^11, 2^n_features - 2)

For example, a tabular record with 10 features has a SHAP baseline size of 1. If NumberOfSamples is not provided, the synthetic dataset contains 1022 records. If the record has 20 features, the synthetic dataset contains 2088 records.

For NLP problems, n_features is equal to the number of non-text features plus the number of text units.

Note

The InvokeEndpoint API has a request timeout limit. If the synthetic dataset is too large, the explainer may not be able to complete the computation within this limit. If necessary, use the previous information to understand and reduce the SHAP baseline size and NumberOfSamples. If your model container is set up to handle batch requests, then you can also adjust the value of MaxRecordCount.

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.