Troubleshooting Amazon SageMaker Pipelines
When using Amazon SageMaker Pipelines, you might run into issues for various reasons. This topic provides information about common errors and how to resolve them.
Pipeline Definition Issues
Your pipeline definition might not be formatted correctly. This can result in your execution failing or your job being inaccurate. These errors can be caught when the pipeline is created or when an execution occurs. If your definition doesn’t validate, Pipelines returns an error message identifying the character where the JSON file is malformed. To fix this problem, review the steps created using the SageMaker AI Python SDK for accuracy.
You can only include steps in a pipeline definition once. Because of this, steps cannot exist as part of a condition step and a pipeline in the same pipeline.
Examining Pipeline Logs
You can view the status of your steps using the following command:
execution.list_steps()
Each step includes the following information:
-
The ARN of the entity launched by the pipeline, such as SageMaker AI job ARN, model ARN, or model package ARN.
-
The failure reason includes a brief explanation of the step failure.
-
If the step is a condition step, it includes whether the condition is evaluated to true or false.
-
If the execution reuses a previous job execution, the
CacheHit
lists the source execution.
You can also view the error messages and logs in the Amazon SageMaker Studio interface. For information about how to see the logs in Studio, see View the details of a pipeline run.
Missing Permissions
Correct permissions are required for the role that creates the pipeline execution, and the steps that create each of the jobs in your pipeline execution. Without these permissions, you may not be able to submit your pipeline execution or run your SageMaker AI jobs as expected. To ensure that your permissions are properly set up, see IAM Access Management.
Job Execution Errors
You may run into issues when executing your steps because of issues in the scripts that define the functionality of your SageMaker AI jobs. Each job has a set of CloudWatch logs. To view these logs from Studio, see View the details of a pipeline run. For information about using CloudWatch logs with SageMaker AI, see Log groups and streams that Amazon SageMaker AI sends to Amazon CloudWatch Logs.
Property File Errors
You may have issues when incorrectly implementing property files with your pipeline. To ensure that your implementation of property files works as expected, see Pass Data Between Steps.
Issues copying the script to the container in the Dockerfile
You can either copy the script to the container or pass it via the entry_point
argument (of your estimator entity) or code
argument (of your processor entity),
as demonstrated in the following code sample.
step_process = ProcessingStep( name="PreprocessAbaloneData", processor=sklearn_processor, inputs = [ ProcessingInput( input_name='dataset', source=..., destination="/opt/ml/processing/code", ) ], outputs=[ ProcessingOutput(output_name="train", source="/opt/ml/processing/train", destination = processed_data_path), ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation", destination = processed_data_path), ProcessingOutput(output_name="test", source="/opt/ml/processing/test", destination = processed_data_path), ], code=os.path.join(BASE_DIR, "process.py"), ## Code is passed through an argument cache_config = cache_config, job_arguments = ['--input', 'arg1'] ) sklearn_estimator = SKLearn( entry_point=os.path.join(BASE_DIR, "train.py"), ## Code is passed through the entry_point framework_version="0.23-1", instance_type=training_instance_type, role=role, output_path=model_path, # New sagemaker_session=sagemaker_session, # New instance_count=1, # New base_job_name=f"{base_job_prefix}/pilot-train", metric_definitions=[ {'Name': 'train:accuracy', 'Regex': 'accuracy_train=(.*?);'}, {'Name': 'validation:accuracy', 'Regex': 'accuracy_validation=(.*?);'} ], )