Model Parallel Troubleshooting - Amazon SageMaker AI

Model Parallel Troubleshooting

If you run into an error, you can use the following list to try to troubleshoot your training job. If the problem persists, contact AWS Support.

Considerations for Using SageMaker Debugger with the SageMaker Model Parallelism Library

SageMaker Debugger is not available for the SageMaker model parallelism library. Debugger is enabled by default for all SageMaker TensorFlow and PyTorch training jobs, and you might see an error that looks like the following:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/checkpoints/metadata.json.sagemaker-uploading

To fix this issue, disable Debugger by passing debugger_hook_config=False when creating a framework estimator as shown in the following example.

bucket=sagemaker.Session().default_bucket() base_job_name="sagemaker-checkpoint-test" checkpoint_in_bucket="checkpoints" # The S3 URI to store the checkpoints checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket) estimator = TensorFlow( ... distribution={"smdistributed": {"modelparallel": { "enabled": True }}}, checkpoint_s3_uri=checkpoint_s3_bucket, checkpoint_local_path="/opt/ml/checkpoints", debugger_hook_config=False )

Saving Checkpoints

You might run into the following error when saving checkpoints of a large model on SageMaker AI:

InternalServerError: We encountered an internal error. Please try again

This could be caused by a SageMaker AI limitation while uploading the local checkpoint to Amazon S3 during training. To disable checkpointing in SageMaker AI, use the following example to explicitly upload the checkpoints.

If you run into the preceding error, do not use checkpoint_s3_uri with the SageMaker estimator call. While saving checkpoints for larger models, we recommend saving checkpoints to a custom directory and passing the same to the helper function (as a local_path argument).

import os def aws_s3_sync(source, destination): """aws s3 sync in quiet mode and time profile""" import time, subprocess cmd = ["aws", "s3", "sync", "--quiet", source, destination] print(f"Syncing files from {source} to {destination}") start_time = time.time() p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) p.wait() end_time = time.time() print("Time Taken to Sync: ", (end_time-start_time)) return def sync_local_checkpoints_to_s3(local_path="/opt/ml/checkpoints", s3_uri=os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))+'/checkpoints'): """ sample function to sync checkpoints from local path to s3 """ import boto3 #check if local path exists if not os.path.exists(local_path): raise RuntimeError("Provided local path {local_path} does not exist. Please check") #check if s3 bucket exists s3 = boto3.resource('s3') if not s3_uri.startswith("s3://"): raise ValueError(f"Provided s3 uri {s3_uri} is not valid.") s3_bucket = s3_uri.replace('s3://','').split('/')[0] print(f"S3 Bucket: {s3_bucket}") try: s3.meta.client.head_bucket(Bucket=s3_bucket) except Exception as e: raise e aws_s3_sync(local_path, s3_uri) return def sync_s3_checkpoints_to_local(local_path="/opt/ml/checkpoints", s3_uri=os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))+'/checkpoints'): """ sample function to sync checkpoints from s3 to local path """ import boto3 #try to create local path if it does not exist if not os.path.exists(local_path): print(f"Provided local path {local_path} does not exist. Creating...") try: os.makedirs(local_path) except Exception as e: raise RuntimeError(f"Failed to create {local_path}") #check if s3 bucket exists s3 = boto3.resource('s3') if not s3_uri.startswith("s3://"): raise ValueError(f"Provided s3 uri {s3_uri} is not valid.") s3_bucket = s3_uri.replace('s3://','').split('/')[0] print(f"S3 Bucket: {s3_bucket}") try: s3.meta.client.head_bucket(Bucket=s3_bucket) except Exception as e: raise e aws_s3_sync(s3_uri, local_path) return

Usage of helper functions:

#base_s3_uri - user input s3 uri or save to model directory (default) #curr_host - to save checkpoints of current host #iteration - current step/epoch during which checkpoint is saved # save checkpoints on every node using local_rank if smp.local_rank() == 0: base_s3_uri = os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', ''))) curr_host = os.environ['SM_CURRENT_HOST'] full_s3_uri = f'{base_s3_uri}/checkpoints/{curr_host}/{iteration}' sync_local_checkpoints_to_s3(local_path=checkpoint_dir, s3_uri=full_s3_uri)

Convergence Using Model Parallel and TensorFlow

When you use SageMaker AI multi-node training with TensorFlow and the model parallelism library, the loss may not converge as expected because the order of training input files may be different on each node. This may cause different ranks in the same model parallel group to work on different input files, causing inconsistencies. To prevent this, ensure the input files are ordered the same way in all the ranks before they get converted to TensorFlow datasets. One way to achieve this is to sort the input file names in the training script.

Stalling or Crashing Distributed Training Jobs

If your training job has stalling, crashing, or not responding issues, read the following troubleshooting items to identify what's the cause of the issue. If you need any further support, reach out to the SageMaker distributed training team through AWS Support.

  • If you see a distributed training job stalling at the NCCL initialization step, consider the following:

    • If you are using one of the EFA-enabled instances ( ml.p4d or ml.p3dn instances) with a custom VPC and its subnet, ensure that the security group used has inbound and outbound connections for all ports to and from the same SG. You also generally need outbound connections to any IP as a separate rule (for internet access). To find instructions on how to add inbound and outbound rules for EFA communication, refer to SageMaker AI distributed training job stalling during initialization.

  • If you see a distributed training job stalling when checkpointing the full model, this might be because the state_dict() call on the model or optimizer was not made on all ranks with rdp_rank()==0 (when using tensor parallelism) or dp_rank()==0 (when using only pipeline parallelism). These ranks need to communicate to construct the checkpoint to be saved. Similar stalling issues can also happen when checkpointing partial optimizer if shard_optimizer_state is enabled.

    For more information about checkpointing a model with model parallelism, see General Instruction for Saving and Loading and Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library between v1.6.0 and v1.9.0).

  • If the training job crashes with a CUDA Out of Memory error, this means that the distributed training configuration needs to be adjusted to fit the model on the GPU cluster. For more information and best practices, see Setting Up the Right Configuration for a Given Model.

  • If the training job crashes with an uncorrectable ECC error, this means that one of the GPUs in the cluster has gone bad. If you need technical support, share the job ARN with the AWS team and restart your training job from a checkpoint if possible.

  • In rare cases, a job configuration that worked previously but is close to the limits of GPU memory might fail later with a different cluster due to a CUDA Out of Memory error. This could be because some GPU has lower available memory than usual due to ECC errors.

  • Network timeout crash might happen when running a multinode job which doesn’t use all GPUs in the node. To get around this, use all GPUs on the node by ensuring that the processes_per_host parameter is set to the number of GPUs in each instance. For example, this is processes_per_host=8 for ml.p3.16xlarge, ml.p3dn.24xlarge, and ml.p4d.24xlarge instances.

  • If you find that your training job takes a long time during the data downloading stage, make sure the Amazon S3 path you provided to checkpoint_s3_uri for the SageMaker Estimator class is unique for the current training job. If this path is reused across multiple training jobs running simultaneously, all those checkpoints are uploaded and downloaded to the same Amazon S3 path and might significantly increase checkpoint loading time.

  • Use FSx for Lustre when you deal with large data and models.

    • If your dataset is large and fetching it takes a long time, we recommend keeping your dataset in FSx for Lustre.

    • When training models are beyond 10 billion parameters, we recommend using FSx for Lustre for checkpointing.

    • After you create a file system, make sure to wait for the status to become available before starting a training job using it.

Receiving NCCL Error for a PyTorch Training Job

If you encountered the following error, it might be due to a process running out of GPU memory.

NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.

You can resolve this by reducing the batch size or active_microbatches. If auto partitioning is not resulting in a well-balanced partitioning, you might have to consider manual partitioning. For more information, see Pipeline parallelism across nodes.

Receiving RecursionError for a PyTorch Training Job

The library does not support calling super.forward() inside a module's forward call. If you use super.forward(), you might receive the following error message.

RecursionError: maximum recursion depth exceeded

To fix the error, instead of calling super.forward(), you should call super()._orig_forward().