1단계: 훈련 스크립트 조정 PyTorch FSDP 2단계: 훈련 작업 시작

SageMaker 모델 병렬 처리 라이브러리 v2 사용

이 페이지에서는 SageMaker 모델 병렬 처리 라이브러리 v2를 사용하고 훈련 플랫폼 또는 SageMaker HyperPod 클러스터에서 PyTorch SageMaker 완전히 샤딩된 데이터 병렬(FSDP) 훈련 작업을 APIs 실행하는 방법을 알아봅니다.

SMP v2를 사용하여 PyTorch 훈련 작업을 실행하기 위한 다양한 시나리오가 있습니다.

SageMaker 훈련을 위해 v2로 사전 패키징된 v2.0.1 이상용 PyTorch 사전 빌드된 SageMaker 프레임워크 컨테이너 중 하나를 사용합니다SMP.
SMP v2 바이너리 파일을 사용하여 SageMaker HyperPod 클러스터에서 분산 훈련 워크로드를 실행하기 위한 Conda 환경을 설정합니다.
PyTorch v2.0.1 이상용 사전 구축된 SageMaker 프레임워크 컨테이너를 확장하여 사용 사례에 대한 추가 기능 요구 사항을 설치합니다. 사전 구축된 컨테이너를 확장하는 방법은 섹션을 참조하세요사전 빌드 컨테이너 확장.
또한 자체 Docker 컨테이너를 가져와서 훈련 도구 키트를 사용하여 모든 SageMaker 훈련 환경을 수동으로 설정하고 SMP v2 바이너리 파일을 설치할 수 있습니다. SageMaker 이는 종속성의 복잡성으로 인해 가장 권장하지 않는 옵션입니다. 자체 Docker 컨테이너를 실행하는 방법을 알아보려면 자체 훈련 컨테이너 조정을 참조하세요.

이 시작 안내서에서는 처음 두 시나리오를 다룹니다.

1단계: 훈련 스크립트 조정 PyTorch FSDP

SMP v2 라이브러리를 활성화하고 구성하려면 스크립트 상단에 torch.sagemaker.init() 모듈을 가져와 추가하는 것으로 시작합니다. 이 모듈은 에서 준비SMPv2 핵심 기능 구성 매개변수할 의 SMP 구성 사전을 사용합니다2단계: 훈련 작업 시작. 또한 SMP v2에서 제공하는 다양한 핵심 기능을 사용하려면 훈련 스크립트를 조정하기 위해 몇 가지를 더 변경해야 할 수 있습니다. SMP v2 코어 기능을 사용하기 위한 훈련 스크립트 조정에 대한 자세한 지침은 에서 확인할 수 있습니다 SageMaker 모델 병렬 처리 라이브러리 v2의 핵심 기능.

2단계: 훈련 작업 시작

SMP 핵심 기능을 사용하여 훈련 작업을 시작하기 PyTorch FSDP 위한 SMP 배포 옵션을 구성하는 방법을 알아봅니다.

SageMaker Training

SageMaker Python 에서 PyTorch 프레임워크 추정기 클래스의 훈련 작업 시작자 객체를 설정할 때 다음과 같이 distribution 인수를 SMPv2 핵심 기능 구성 매개변수 통해 를 SDK구성합니다.

참고

SMP v2에 대한 distribution 구성은 v2.200SDK부터 SageMaker Python에 통합됩니다. SageMaker Python SDK v2.200 이상을 사용해야 합니다.

참고

SMP v2에서는 추정기의 distribution 인수에 torch_distributed 대해 를 smdistributed SageMaker PyTorch 로 구성해야 합니다. 를 사용하면 분산 의 기본 다중 노드 작업 시작 관리자torchrun인 를 torch_distributed SageMaker 실행합니다. PyTorch


from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    framework_version=2.2.0,
    py_version="310"
    # image_uri="<smp-docker-image-uri>" # For using prior versions, specify the SMP image URI directly.
    entry_point="your-training-script.py", # Pass the training script you adapted with SMP from Step 1.
    ... # Configure other required and optional parameters
    distribution={
        "torch_distributed": { "enabled": True },
        "smdistributed": {
            "modelparallel": {
                "enabled": True,
                "parameters": {
                    "hybrid_shard_degree": Integer,
                    "sm_activation_offloading": Boolean,
                    "activation_loading_horizon": Integer,
                    "fsdp_cache_flush_warnings": Boolean,
                    "allow_empty_shards": Boolean,
                    "tensor_parallel_degree": Integer,
                    "expert_parallel_degree": Integer,
                    "random_seed": Integer
                }
            }
        }
    }
)

중요

의 이전 버전 중 하나 PyTorch 또는 를 최신 버전 SMP 대신 사용하려면 framework_version 및 py_version 페어 대신 image_uri 인수를 사용하여 SMP Docker 이미지를 직접 지정해야 합니다. 다음은 의 예입니다.


estimator = PyTorch(
    ...,
    image_uri="658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121"
)

SMP Docker 이미지를 찾으려면 섹션을 URIs참조하세요지원되는 프레임워크.

SageMaker HyperPod

시작하기 전에 다음 사전 조건이 충족되는지 확인하세요.

HyperPod 클러스터에 탑재된 Amazon FSx 공유 디렉터리(/fsx)입니다.
FSx 공유 디렉터리에 Conda가 설치되었습니다. Conda를 설치하는 방법을 알아보려면 Conda 사용 설명서의 Linux에 설치의 지침을 참조하세요.
cuda11.8 또는 HyperPod 클러스터의 헤드 및 컴퓨팅 노드에 cuda12.1 설치됩니다.

사전 요구 사항이 모두 충족되면 클러스터에서 SMP HyperPod v2로 워크로드를 시작하는 방법에 대한 다음 지침을 따르세요.

의 사전이 포함된 smp_config.json 파일을 준비합니다SMPv2 핵심 기능 구성 매개변수. 이 JSON 파일을 훈련 스크립트를 저장하는 에 업로드하거나 1단계에서 torch.sagemaker.init() 모듈에 지정한 경로에 업로드해야 합니다. 1단계의 훈련 스크립트에 있는 torch.sagemaker.init() 모듈에 구성 사전을 이미 전달한 경우 이 단계를 건너뛸 수 있습니다.
```
// smp_config.json
{
    "hybrid_shard_degree": Integer,
    "sm_activation_offloading": Boolean,
    "activation_loading_horizon": Integer,
    "fsdp_cache_flush_warnings": Boolean,
    "allow_empty_shards": Boolean,
    "tensor_parallel_degree": Integer,
    "expert_parallel_degree": Integer,
    "random_seed": Integer
}
```
smp_config.json 파일을 파일 시스템의 디렉터리에 업로드합니다. 디렉터리 경로는 1단계에서 지정한 경로와 일치해야 합니다. 이미 구성 사전을 훈련 스크립트의 torch.sagemaker.init() 모듈에 전달한 경우 이 단계를 건너뛸 수 있습니다.
클러스터의 컴퓨팅 노드에서 다음 명령을 사용하여 터미널 세션을 시작합니다.
```
sudo su -l ubuntu
```

컴퓨팅 노드에서 Conda 환경을 생성합니다. 다음 코드는 Conda 환경을 생성하고 SMP, SMDDPCUDA, 및 기타 종속성을 설치하는 예제 스크립트입니다.


# Run on compute nodes
SMP_CUDA_VER=<11.8 or 12.1>

source /fsx/<path_to_miniconda>/miniconda3/bin/activate

export ENV_PATH=/fsx/<path to miniconda>/miniconda3/envs/<ENV_NAME>
conda create -p ${ENV_PATH} python=3.10

conda activate ${ENV_PATH}

# Verify aws-cli is installed: Expect something like "aws-cli/2.15.0*"
aws ‐‐version
# Install aws-cli if not already installed
# https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html#cliv2-linux-install

# Install the SMP library
conda install pytorch="2.0.1=sm_py3.10_cuda${SMP_CUDA_VER}*" packaging ‐‐override-channels \
  -c https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-2.0.0-pt-2.0.1/2023-12-11/smp-v2/ \
  -c pytorch -c numba/label/dev \
  -c nvidia -c conda-forge

# Install dependencies of the script as below
python -m pip install packaging transformers==4.31.0 accelerate ninja tensorboard h5py datasets \
    && python -m pip install expecttest hypothesis \
    && python -m pip install "flash-attn>=2.0.4" ‐‐no-build-isolation

# Install the SMDDP wheel
SMDDP_WHL="smdistributed_dataparallel-2.0.2-cp310-cp310-linux_x86_64.whl" \
  && wget -q https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/${SMDDP_WHL} \
  && pip install ‐‐force ${SMDDP_WHL} \
  && rm ${SMDDP_WHL}

# cuDNN installation for Transformer Engine installation for CUDA 11.8
# Please download from below link, you need to agree to terms 
# https://developer.nvidia.com/downloads/compute/cudnn/secure/8.9.5/local_installers/11.x/cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz

tar xf cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz \
    && rm -rf /usr/local/cuda-$SMP_CUDA_VER/include/cudnn* /usr/local/cuda-$SMP_CUDA_VER/lib/cudnn* \
    && cp ./cudnn-linux-x86_64-8.9.5.30_cuda11-archive/include/* /usr/local/cuda-$SMP_CUDA_VER/include/ \
    && cp ./cudnn-linux-x86_64-8.9.5.30_cuda11-archive/lib/* /usr/local/cuda-$SMP_CUDA_VER/lib/ \
    && rm -rf cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz \
    && rm -rf cudnn-linux-x86_64-8.9.5.30_cuda11-archive/

# Please download from below link, you need to agree to terms 
# https://developer.download.nvidia.com/compute/cudnn/secure/8.9.7/local_installers/12.x/cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
# cuDNN installation for TransformerEngine installation for cuda12.1
tar xf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
    && rm -rf /usr/local/cuda-$SMP_CUDA_VER/include/cudnn* /usr/local/cuda-$SMP_CUDA_VER/lib/cudnn* \
    && cp ./cudnn-linux-x86_64-8.9.7.29_cuda12-archive/include/* /usr/local/cuda-$SMP_CUDA_VER/include/ \
    && cp ./cudnn-linux-x86_64-8.9.7.29_cuda12-archive/lib/* /usr/local/cuda-$SMP_CUDA_VER/lib/ \
    && rm -rf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
    && rm -rf cudnn-linux-x86_64-8.9.7.29_cuda12-archive/
    
# TransformerEngine installation
export CUDA_HOME=/usr/local/cuda-$SMP_CUDA_VER
export CUDNN_PATH=/usr/local/cuda-$SMP_CUDA_VER/lib
export CUDNN_LIBRARY=/usr/local/cuda-$SMP_CUDA_VER/lib
export CUDNN_INCLUDE_DIR=/usr/local/cuda-$SMP_CUDA_VER/include
export PATH=/usr/local/cuda-$SMP_CUDA_VER/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-$SMP_CUDA_VER/lib

python -m pip install ‐‐no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v1.0

테스트 훈련 작업을 실행합니다.
1. 공유 파일 시스템(/fsx)에서 Awsome 분산 훈련 GitHub 리포지토리 를 복제하고 3.test_cases/11.modelparallel 폴더로 이동합니다.
```
git clone https://github.com/aws-samples/awsome-distributed-training/
cd awsome-distributed-training/3.test_cases/11.modelparallel
```
2. 다음과 sbatch 같이 를 사용하여 작업을 제출합니다.
```
conda activate <ENV_PATH>
sbatch -N 16 conda_launch.sh
```
  작업 제출이 성공하면 이 sbatch 명령의 출력 메시지가 와 유사해야 합니다Submitted batch job ABCDEF.
3. 의 현재 디렉터리에서 로그 파일을 확인합니다logs/.
```
tail -f ./logs/fsdp_smp_ABCDEF.out
```

javascript가 브라우저에서 비활성화되거나 사용이 불가합니다.

AWS 설명서를 사용하려면 Javascript가 활성화되어야 합니다. 지침을 보려면 브라우저의 도움말 페이지를 참조하십시오.

문서 규칙

지원되는 프레임워크 및 AWS 리전

SMP v2의 핵심 기능