Logs for Built-in Algorithms
Amazon SageMaker algorithms produce Amazon CloudWatch logs, which provide detailed information on
the training process. To see the logs, in the AWS management console, choose
CloudWatch, choose Logs, and then
choose the /aws/sagemaker/TrainingJobs log group. Each training
job has one log stream per node on which it was trained. The log stream’s name
begins with the value specified in the TrainingJobName
parameter when
the job was created.
Note
If a job fails and logs do not appear in CloudWatch, it's likely that an error occurred before the start of training. Reasons include specifying the wrong training image or S3 location.
The contents of logs vary by algorithms. However, you can typically find the following information:
-
Confirmation of arguments provided at the beginning of the log
-
Errors that occurred during training
-
Measurement of an algorithm's accuracy or numerical performance
-
Timings for the algorithm and any major stages within the algorithm
Common Errors
If a training job fails, some details about the failure are provided by the
FailureReason
return value in the training job description, as
follows:
sage = boto3.client('sagemaker')
sage.describe_training_job(TrainingJobName=job_name)['FailureReason']
Others are reported only in the CloudWatch logs. Common errors include the following:
-
Misspecifying a hyperparameter or specifying a hyperparameter that is invalid for the algorithm.
From the CloudWatch Log
[10/16/2017 23:45:17 ERROR 139623806805824 train.py:48] Additional properties are not allowed (u'mini_batch_siz' was unexpected)
-
Specifying an invalid value for a hyperparameter.
FailureReason
AlgorithmError: u'abc' is not valid under any of the given schemas\n\nFailed validating u'oneOf' in schema[u'properties'][u'feature_dim']:\n {u'oneOf': [{u'pattern': u'^([1-9][0-9]*)$', u'type': u'string'},\n {u'minimum': 1, u'type': u'integer'}]}\
FailureReason
[10/16/2017 23:57:17 ERROR 140373086025536 train.py:48] u'abc' is not valid under any of the given schemas
-
Inaccurate protobuf file format.
From the CloudWatch log
[10/17/2017 18:01:04 ERROR 140234860816192 train.py:48] cannot copy sequence with size 785 to array axis with dimension 784