Create Regression or Classification Jobs for Tabular Data Using the AutoML API
You can create an Autopilot regression or classification job for tabular data programmatically
by calling the CreateAutoMLJobV2
API action in any language supported by Autopilot or the
AWS CLI. The following is a collection of mandatory and optional input request parameters for the
CreateAutoMLJobV2
API action. You can find the alternative information for the
previous version of this action, CreateAutoMLJob
. However, we recommend using
CreateAutoMLJobV2
.
For information on how this API action translates into a function in the language of your
choice, see the See Also section of CreateAutoMLJobV2
and choose an SDK. As an example, for Python users, see the full request syntax of create_auto_ml_job_v2
in AWS SDK for Python (Boto3).
Note
CreateAutoMLJobV2 and DescribeAutoMLJobV2 are new versions of CreateAutoMLJob and DescribeAutoMLJob which offer backward compatibility.
We recommend using the CreateAutoMLJobV2
. CreateAutoMLJobV2
can
manage tabular problem types identical to those of its previous version
CreateAutoMLJob
, as well as non-tabular problem types such as image or text
classification, or time-series forecasting.
At a minimum, all experiments on tabular data require the specification of the experiment name, providing locations for the input and output data, and specifying which target data to predict. Optionally, you can also specify the type of problem that you want to solve (regression, classification, multiclass classification), choose your modeling strategy (stacked ensembles or hyperparameters optimization), select the list of algorithms used by the Autopilot job to train the data, and more.
After the experiment runs, you can compare trials and delve into the details of the pre-processing steps, algorithms, and hyperparameter ranges of each model. You also have the option to download their explainability and performance reports. Use the provided notebooks to see the results of the automated data exploration or the candidate model definitions.
Find guidelines on how to migrate a CreateAutoMLJob
to
CreateAutoMLJobV2
in Migrate a CreateAutoMLJob to
CreateAutoMLJobV2.
Required parameters
All other parameters are optional.
Optional parameters
The following sections provide details of some optional parameters that you can pass to
your CreateAutoMLJobV2
API action when using tabular data. You can find the
alternative information for the previous version of this action, CreateAutoMLJob
.
However, we recommend using CreateAutoMLJobV2
.
For tabular data, the set of algorithms run on your data to train your model
candidates is dependent on your modeling strategy (ENSEMBLING
or
HYPERPARAMETER_TUNING
). The following details how to set this training
mode.
If you keep blank (or null
), the Mode
is inferred based on
the size of your dataset.
For information on Autopilot's stacked ensembles and hyperparameters optimization training methods, see Training modes and algorithm support
Features selection
Autopilot provides automatic data-preprocessing steps including feature selection and
feature extraction. However, you can manually provide the features to be used in
training with the FeatureSpecificatioS3Uri
attribute.
Selected features should be contained within a JSON file in the following format:
{ "FeatureAttributeNames":["col1", "col2", ...] }
The values listed in ["col1", "col2", ...]
are case sensitive. They
should be a list of strings containing unique values that are subsets of the column
names in the input data.
Note
The list of columns provided as features cannot include the target column.
Algorithms selection
By default, your Autopilot job runs a pre-defined list of algorithms on your dataset to
train model candidates. The list of algorithms depends on the training mode
(ENSEMBLING
or HYPERPARAMETER_TUNING
) used by the
job.
You can provide a subset of the default selection of algorithms.
For the list of available algorithms per training Mode
, see AutoMLAlgorithms
. For details on each algorithm, see Training modes and algorithm
support.
You can provide your own validation dataset and custom data split ratio, or let Autopilot split the dataset automatically.
For information on split and cross-validation in Autopilot see Cross-validation in Autopilot.
Note
In some cases, Autopilot is unable to infer the ProblemType
with high
enough confidence, in which case you must provide the value for the job to
succeed.
You can add a sample weights column to your tabular dataset and then pass it to your AutoML job to request dataset rows to be weighted during training and evaluation.
Support for sample weights is available in ensembling mode only. Your weights should be numeric and non-negative. Data points with invalid or no weight value are excluded. For more information on the available objective metrics, see Autopilot weighted metrics.
You can configure your AutoML job V2 to automatically initiate a remote job on Amazon EMR Serverless when additional compute resources are needed to process large datasets. By seamlessly transitioning to EMR Serverless when required, the AutoML job can handle datasets that would otherwise exceed the initially provisioned resources, without any manual intervention from you. EMR Serverless is available for the tabular and time series problem types. We recommend setting up this option for tabular datasets larger than 5 GB.
To allow your AutoML job V2 to automatically transition to EMR Serverless for large
dataset, you need to provide an EmrServerlessComputeConfig
object, which
includes an ExecutionRoleARN
field, to the AutoMLComputeConfig
of the AutoML job V2 input request.
The ExecutionRoleARN
is the ARN of the IAM role granting the AutoML job
V2 the necessary permissions to run EMR Serverless jobs.
This role should have the following trust relationship:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "emr-serverless.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }
And grant the permissions to:
-
Create, list, and update EMR Serverless applications.
-
Start, list, get, or cancel job runs on an EMR Serverless application.
-
Tag EMR Serverless resources.
-
Pass an IAM role to the EMR Serverless service for execution.
By granting the
iam:PassRole
permission, the AutoML job V2 can temporarily assume theEMRServerlessRuntimeRole-*
role and pass it to the EMR Serverless service. These are the IAM roles used by the EMR Serverless job execution environments to access other AWS services and resources needed during runtime, such as Amazon S3 for data access, CloudWatch for logging, access to the AWS Glue Data Catalog or other services based on your workload requirements.See Job runtime roles for Amazon EMR Serverless for details on this role permissions.
The IAM policy defined in the provided JSON document grants those permissions:
{ "Version": "2012-10-17", "Statement": [{ + "Sid": "EMRServerlessCreateApplicationOperation", + "Effect": "Allow", + "Action": "emr-serverless:CreateApplication", + "Resource": "arn:aws:emr-serverless:*:*:/*", + "Condition": { + "StringEquals": { + "aws:RequestTag/sagemaker:is-canvas-resource": "True", + "aws:ResourceAccount": "${aws:PrincipalAccount}" + } + } + }, + { + "Sid": "EMRServerlessListApplicationOperation", + "Effect": "Allow", + "Action": "emr-serverless:ListApplications", + "Resource": "arn:aws:emr-serverless:*:*:/*", + "Condition": { + "StringEquals": { + "aws:ResourceAccount": "${aws:PrincipalAccount}" + } + } + }, + { + "Sid": "EMRServerlessApplicationOperations", + "Effect": "Allow", + "Action": [ + "emr-serverless:UpdateApplication", + "emr-serverless:GetApplication" + ], + "Resource": "arn:aws:emr-serverless:*:*:/applications/*", + "Condition": { + "StringEquals": { + "aws:ResourceTag/sagemaker:is-canvas-resource": "True", + "aws:ResourceAccount": "${aws:PrincipalAccount}" + } + } + }, + { + "Sid": "EMRServerlessStartJobRunOperation", + "Effect": "Allow", + "Action": "emr-serverless:StartJobRun", + "Resource": "arn:aws:emr-serverless:*:*:/applications/*", + "Condition": { + "StringEquals": { + "aws:RequestTag/sagemaker:is-canvas-resource": "True", + "aws:ResourceAccount": "${aws:PrincipalAccount}" + } + } + }, + { + "Sid": "EMRServerlessListJobRunOperation", + "Effect": "Allow", + "Action": "emr-serverless:ListJobRuns", + "Resource": "arn:aws:emr-serverless:*:*:/applications/*", + "Condition": { + "StringEquals": { + "aws:ResourceTag/sagemaker:is-canvas-resource": "True", + "aws:ResourceAccount": "${aws:PrincipalAccount}" + } + } + }, + { + "Sid": "EMRServerlessJobRunOperations", + "Effect": "Allow", + "Action": [ + "emr-serverless:GetJobRun", + "emr-serverless:CancelJobRun" + ], + "Resource": "arn:aws:emr-serverless:*:*:/applications/*/jobruns/*", + "Condition": { + "StringEquals": { + "aws:ResourceTag/sagemaker:is-canvas-resource": "True", + "aws:ResourceAccount": "${aws:PrincipalAccount}" + } + } + }, + { + "Sid": "EMRServerlessTagResourceOperation", + "Effect": "Allow", + "Action": "emr-serverless:TagResource", + "Resource": "arn:aws:emr-serverless:*:*:/*", + "Condition": { + "StringEquals": { + "aws:RequestTag/sagemaker:is-canvas-resource": "True", + "aws:ResourceAccount": "${aws:PrincipalAccount}" + } + } + }, + { + "Sid": "IAMPassOperationForEMRServerless", + "Effect": "Allow", + "Action": "iam:PassRole", + "Resource": "arn:aws:iam::*:role/EMRServerlessRuntimeRole-*", + "Condition": { + "StringEquals": { + "iam:PassedToService": "emr-serverless.amazonaws.com", + "aws:ResourceAccount": "${aws:PrincipalAccount}" + } + } } ] }
Migrate a CreateAutoMLJob to CreateAutoMLJobV2
We recommend users of CreateAutoMLJob
to migrate to
CreateAutoMLJobV2
.
This section explains the differences in the input parameters between CreateAutoMLJob and CreateAutoMLJobV2 by highlighting the changes in the position, name, or structure of the objects and attributes of the input request between the two versions.
-
Request attributes that did not change between versions.
{ "AutoMLJobName": "string", "AutoMLJobObjective": { "MetricName": "string" }, "ModelDeployConfig": { "AutoGenerateEndpointName": boolean, "EndpointName": "string" }, "OutputDataConfig": { "KmsKeyId": "string", "S3OutputPath": "string" }, "RoleArn": "string", "Tags": [ { "Key": "string", "Value": "string" } ] }
-
Request attributes that changed position and structure between versions.
The following attributes changed position:
DataSplitConfig
,Security Config
,CompletionCriteria
,Mode
,FeatureSpecificationS3Uri
,SampleWeightAttributeName
,TargetAttributeName
. -
The following attributes changed position and structure between versions.
The following JSON illustrates how AutoMLJobConfig.CandidateGenerationConfig of type AutoMLCandidateGenerationConfig moved to AutoMLProblemTypeConfig.TabularJobConfig.CandidateGenerationConfig of type CandidateGenerationConfig in V2.
-
Request attributes that changed name and structure.
The following JSON illustrates how InputDataConfig (An array of AutoMLChannel) changed to AutoMLJobInputDataConfig (An array of AutoMLJobChannel) in V2. Note that the attributes
SampleWeightAttributeName
andTargetAttributeName
move out ofInputDataConfig
and intoAutoMLProblemTypeConfig
.