Trainium SageMaker 훈련 작업 훈련 전 자습서

PDF

RSS

포커스 모드

Trainium SageMaker 훈련 작업 훈련 전 자습서 - Amazon SageMaker AI

Trainium SageMaker 훈련 작업을 위한 환경 설정 Jupyter Notebook으로 훈련 작업 시작 레시피 시작 관리자를 사용하여 훈련 작업 시작

이 자습서에서는 AWS Trainium 인스턴스에서 SageMaker 훈련 작업을 사용하여 사전 훈련 작업을 설정하고 실행하는 프로세스를 안내합니다.

환경을 설정합니다
훈련 작업 시작

시작하기 전에 다음 사전 조건이 있는지 확인합니다.

사전 조건

환경 설정을 시작하기 전에 다음 사항을 확인해야 합니다.

데이터를 로드하고 훈련 아티팩트를 출력할 수 있는 Amazon FSx 파일 시스템 또는 S3 버킷입니다.
Amazon SageMaker AI의 ml.trn1.32xlarge 인스턴스에 대한 Service Quota를 요청합니다. 서비스 할당량 증가를 요청하려면 다음을 수행합니다.
ml.trn1.32xlarge 인스턴스에 대한 서비스 할당량 증가를 요청하려면
1. AWS Service Quotas 콘솔로 이동합니다.
2. AWS 서비스를 선택합니다.
3. JupyterLab을 선택합니다.
4. 에 대해 인스턴스 하나를 지정합니다ml.trn1.32xlarge.
AmazonSageMakerFullAccess 및 AmazonEC2FullAccess 관리형 정책을 사용하여 AWS Identity and Access Management (IAM) 역할을 생성합니다. 이러한 정책은 Amazon SageMaker AI에 예제를 실행할 수 있는 권한을 제공합니다.
다음 형식 중 하나의 데이터:
- JSON
- JSONGZ(압축 JSON)
- 화살표
(선택 사항) HuggingFace에서 사전 훈련된 가중치가 필요하거나 Llama 3.2 모델을 훈련하는 경우 훈련을 시작하기 전에 HuggingFace 토큰을 받아야 합니다. 토큰 가져오기에 대한 자세한 내용은 사용자 액세스 토큰을 참조하세요.

Trainium SageMaker 훈련 작업을 위한 환경 설정

SageMaker 훈련 작업을 실행하기 전에 aws configure 명령을 사용하여 AWS 자격 증명과 기본 리전을 구성합니다. 또는 , AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY및와 같은 환경 변수를 통해 자격 증명을 제공할 수도 있습니다AWS_SESSION_TOKEN. 자세한 내용은 SageMaker AI Python SDK를 참조하세요.

SageMaker AI JupyterLab의 SageMaker AI Jupyter 노트북을 사용하여 SageMaker 훈련 작업을 시작하는 것이 좋습니다. JupyterLab 자세한 내용은 SageMaker JupyterLab 단원을 참조하십시오.

(선택 사항) Amazon SageMaker Studio에서 Jupyter 노트북을 사용하는 경우 다음 명령 실행을 건너뛸 수 있습니다. 버전 >= python 3.9를 사용해야 합니다.


# set up a virtual environment
python3 -m venv ${PWD}/venv
source venv/bin/activate
# install dependencies after git clone.

git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
cd sagemaker-hyperpod-recipes
pip3 install -r requirements.txt

SageMaker AI Python SDK 설치
```
pip3 install --upgrade sagemaker
```
- llama 3.2 멀티모달 훈련 작업을 실행하는 경우 transformers 버전은 4.45.2 이상이어야 합니다.
  - SageMaker AI Python SDKtransformers==4.45.2를 사용하는 경우에만 source_dirrequirements.txt의에 추가합니다.
  - HyperPod 레시피를 사용하여를 클러스터 유형sm_jobs으로 시작하는 경우 변환기 버전을 지정할 필요가 없습니다.
- Container: Neuron 컨테이너는 SageMaker AI Python SDK에 의해 자동으로 설정됩니다.

Jupyter Notebook으로 훈련 작업 시작

다음 Python 코드를 사용하여 레시피를 사용하여 SageMaker 훈련 작업을 실행할 수 있습니다. SageMaker AI Python SDK의 PyTorch 예측기를 활용하여 레시피를 제출합니다. 다음 예제에서는 llama3-8b 레시피를 SageMaker AI 훈련 작업으로 시작합니다.

compiler_cache_url: Amazon S3 아티팩트와 같은 컴파일된 아티팩트를 저장하는 데 사용되는 캐시입니다.


import os
import sagemaker,boto3
from sagemaker.debugger import TensorBoardOutputConfig

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

recipe_overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "explicit_log_dir": "/opt/ml/output/tensorboard",
    },
    "data": {
        "train_dir": "/opt/ml/input/data/train",
    },
    "model": {
        "model_config": "/opt/ml/input/data/train/config.json",
    },
    "compiler_cache_url": "<compiler_cache_url>"
} 

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=os.path.join(output, 'tensorboard'),
    container_local_output_path=overrides["exp_manager"]["explicit_log_dir"]
)

estimator = PyTorch(
    output_path=output_path,
    base_job_name=f"llama-trn",
    role=role,
    instance_type="ml.trn1.32xlarge",
    sagemaker_session=sagemaker_session,
    training_recipe="training/llama/hf_llama3_70b_seq8k_trn1x16_pretrain",
    recipe_overrides=recipe_overrides,
)

estimator.fit(inputs={"train": "your-inputs"}, wait=True)

앞의 코드는 훈련 레시피를 사용하여 PyTorch 예측기 객체를 생성한 다음 fit() 메서드를 사용하여 모델에 맞춥니다. training_recipe 파라미터를 사용하여 훈련에 사용할 레시피를 지정합니다.

레시피 시작 관리자를 사용하여 훈련 작업 시작

./recipes_collection/cluster/sm_jobs.yaml 업데이트

compiler_cache_url: 아티팩트를 저장하는 데 사용되는 URL입니다. Amazon S3 URL일 수 있습니다.


sm_jobs_config:
  output_path: <s3_output_path>
  wait: True
  tensorboard_config:
    output_path: <s3_output_path>
    container_logs_path: /opt/ml/output/tensorboard  # Path to logs on the container
  wait: True  # Whether to wait for training job to finish
  inputs:  # Inputs to call fit with. Set either s3 or file_system, not both.
    s3:  # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation.
      train: <s3_train_data_path>
      val: null
  additional_estimator_kwargs:  # All other additional args to pass to estimator. Must be int, float or string.
    max_run: 180000
    image_uri: <your_image_uri>
    enable_remote_debug: True
    py_version: py39
  recipe_overrides:
    model:
      exp_manager:
        exp_dir: <exp_dir>
      data:
        train_dir: /opt/ml/input/data/train
        val_dir: /opt/ml/input/data/val

./recipes_collection/config.yaml 업데이트


defaults:
  - _self_
  - cluster: sm_jobs
  - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain
cluster_type: sm_jobs # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above.

instance_type: ml.trn1.32xlarge
base_results_dir: ~/sm_job/hf_llama3_8B # Location to store the results, checkpoints and logs.

를 사용하여 작업 시작 main.py


python3 main.py --config-path recipes_collection --config-name config

SageMaker 훈련 작업 구성에 대한 자세한 내용은 섹션을 참조하세요SageMaker 훈련 작업 훈련 전 자습서(GPU).

javascript가 브라우저에서 비활성화되거나 사용이 불가합니다.

AWS 설명서를 사용하려면 Javascript가 활성화되어야 합니다. 지침을 보려면 브라우저의 도움말 페이지를 참조하십시오.

문서 규칙

SageMaker 훈련 작업 훈련 전 자습서(GPU)

기본 구성

이 페이지에서

쿠키 기본 설정 선택

쿠키 기본 설정 사용자 지정

필수

성능

기능

광고

쿠키 기본 설정을 저장할 수 없음

Trainium SageMaker 훈련 작업 훈련 전 자습서

사전 조건

ml.trn1.32xlarge 인스턴스에 대한 서비스 할당량 증가를 요청하려면

Trainium SageMaker 훈련 작업을 위한 환경 설정

Jupyter Notebook으로 훈련 작업 시작

레시피 시작 관리자를 사용하여 훈련 작업 시작

이 페이지에서

Related resources

페이지 내용이 도움이 되었습니까?

Related resources

다음 주제:

이전 주제:

도움이 필요하십니까?