SageMaker 训练作业预训练教程 (GPU)

PDF

RSS

聚焦模式

SageMaker 训练作业预训练教程 (GPU) - 亚马逊 SageMaker AI

GPU SageMaker 训练作业环境设置使用 Jupyter 笔记本启动训练作业使用食谱启动器启动训练作业

本教程将指导您完成使用带有 GPU 实例的训练作业设置和运行预 SageMaker 训练作业的过程。

设置您的环境
使用 SageMaker HyperPod 食谱启动训练作业

在开始之前，请确保您具备以下先决条件。

先决条件

在开始设置环境之前，请确保：

Amazon FSx 文件系统或 Amazon S3 存储桶，您可以在其中加载数据和输出训练项目。
在亚马逊 AI 上申请 1x ml.p4d.24xlarge 和 1x ml.p5.48xlarge 的服务配额。 SageMaker 要申请增加服务配额，请执行以下操作：
1. 在 S AWS ervice Quotas 控制台上，导航到 AWS 服务，
2. 选择亚马逊 SageMaker AI。
3. 选择一个 ml.p4d.24xlarge 和一个 ml.p5.48xlarge 实例。
使用以下托管策略创建 AWS Identity and Access Management(IAM) 角色，以授予 SageMaker AI 运行示例的权限。
- AmazonSageMakerFullAccess
- Amazon EC2 FullAccess
以下格式之一的数据：
- JSON
- JSONGZ（压缩 JSON）
- 箭头
（可选）如果您使用中的模型权重进行预训练或微调，则必须获得 HuggingFace 代币。 HuggingFace 有关获取令牌的更多信息，请参阅用户访问令牌。

GPU SageMaker 训练作业环境设置

在运行 SageMaker 训练作业之前，请运行aws configure命令配置您的 AWS 凭证和首选区域。作为配置命令的替代方案，您可以通过环境变量（例如、）提供凭证。有关更多信息 AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEY，请参阅 SageMaker AI Python SDK。AWS_SESSION_TOKEN.

我们强烈建议在 AI 中 SageMaker 使用 A SageMaker I Jupyter 笔记本 JupyterLab 来启动 SageMaker 训练作业。有关更多信息，请参阅 SageMaker JupyterLab。

（可选）设置虚拟环境和依赖关系。如果您在 Amazon SageMaker Studio 中使用 Jupyter 笔记本电脑，则可以跳过此步骤。确保你使用的是 Python 3.9 或更高版本。


# set up a virtual environment
python3 -m venv ${PWD}/venv
source venv/bin/activate
# install dependencies after git clone.

git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
cd sagemaker-hyperpod-recipes
pip3 install -r requirements.txt
# Set the aws region.

aws configure set <your_region>

安装 SageMaker AI Python SDK
```
pip3 install --upgrade sagemaker
```
Container：GPU 容器由 SageMaker AI Python SDK 自动设置。您也可以提供自己的容器。

注意
如果您正在运行 Llama 3.2 多模态训练作业，则transformers版本必须等于4.45.2 或更高。

source_dir仅当你使用 SageMaker AI Python SDK 时才会追加transformers==4.45.2到requirements.txt中。例如，如果您在 SageMaker AI JupyterLab 的笔记本中使用它，请将其追加。

如果您使用 HyperPod 配方使用集群类型启动sm_jobs，则此操作将自动完成。

使用 Jupyter 笔记本启动训练作业

您可以使用以下 Python 代码使用您的配方运行 SageMaker 训练作业。它利用 AI SageMaker Python SDK 中的 PyTorch 估算器来提交配方。以下示例在 AI 训练平台上启动 llama3-8b 配方。 SageMaker


import os
import sagemaker,boto3
from sagemaker.debugger import TensorBoardOutputConfig

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket = sagemaker_session.default_bucket() 
output = os.path.join(f"s3://{bucket}", "output")
output_path = "<s3-URI"

overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "exp_dir": "",
        "explicit_log_dir": "/opt/ml/output/tensorboard",
        "checkpoint_dir": "/opt/ml/checkpoints",
    },   
    "model": {
        "data": {
            "train_dir": "/opt/ml/input/data/train",
            "val_dir": "/opt/ml/input/data/val",
        },
    },
}

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=os.path.join(output, 'tensorboard'),
    container_local_output_path=overrides["exp_manager"]["explicit_log_dir"]
)

estimator = PyTorch(
    output_path=output_path,
    base_job_name=f"llama-recipe",
    role=role,
    instance_type="ml.p5.48xlarge",
    training_recipe="training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
    recipe_overrides=recipe_overrides,
    sagemaker_session=sagemaker_session,
    tensorboard_output_config=tensorboard_output_config,
)

estimator.fit(inputs={"train": "s3 or fsx input", "val": "s3 or fsx input"}, wait=True)

前面的代码使用训练配方创建一个 PyTorch 估计器对象，然后使用该fit()方法拟合模型。使用 training_recipe 参数指定要用于训练的配方。

注意

如果你正在运行 Llama 3.2 多模态训练作业，则变形金刚版本必须为 4.45.2 或更高版本。

source_dir只有当你直接使用 SageMaker AI Python SDK 时，才会追加transformers==4.45.2到requirements.txt中。例如，在使用 Jupyter 笔记本时，必须将版本附加到文本文件中。

在为 SageMaker 训练作业部署终端节点时，必须指定正在使用的图像 URI。如果不提供图像 URI，则估算器将使用训练图像作为部署的图像。 SageMaker HyperPod 提供的训练图像不包含推理和部署所需的依赖关系。以下是如何使用推理图像进行部署的示例：


from sagemaker import image_uris
container=image_uris.retrieve(framework='pytorch',region='us-west-2',version='2.0',py_version='py310',image_scope='inference', instance_type='ml.p4d.24xlarge')
predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.p4d.24xlarge',image_uri=container)

注意

在 Sagemaker 笔记本实例上运行上述代码可能需要超过 AI 提供的默认 5GB 存储空间。 SageMaker JupyterLab 如果您遇到空间不足的问题，请创建一个新的笔记本实例，在其中使用不同的笔记本实例，并增加笔记本的存储空间。

使用食谱启动器启动训练作业

将./recipes_collection/cluster/sm_jobs.yaml文件更新为如下所示：


sm_jobs_config:
  output_path: <s3_output_path>
  tensorboard_config:
    output_path: <s3_output_path>
    container_logs_path: /opt/ml/output/tensorboard  # Path to logs on the container
  wait: True  # Whether to wait for training job to finish
  inputs:  # Inputs to call fit with. Set either s3 or file_system, not both.
    s3:  # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation.
      train: <s3_train_data_path>
      val: null
  additional_estimator_kwargs:  # All other additional args to pass to estimator. Must be int, float or string.
    max_run: 180000
    enable_remote_debug: True
  recipe_overrides:
    exp_manager:
      explicit_log_dir: /opt/ml/output/tensorboard
    data:
      train_dir: /opt/ml/input/data/train
    model:
      model_config: /opt/ml/input/data/train/config.json
    compiler_cache_url: "<compiler_cache_url>"

更新./recipes_collection/config.yaml为在cluster和sm_jobs中指定cluster_type。


defaults:
  - _self_
  - cluster: sm_jobs  # set to `slurm`, `k8s` or `sm_jobs`, depending on the desired cluster
  - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain
cluster_type: sm_jobs  # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above.

使用以下命令启动作业


python3 main.py --config-path recipes_collection --config-name config

有关配置 SageMaker 训练作业的更多信息，请参阅在训练作业上运行 SageMaker 训练作业。

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

Trainium Kubernetes 集群预训练教程

Trainium SageMaker 训练作业预训练教程

本页内容

选择您的 Cookie 首选项

自定义 Cookie 首选项

关键

性能

功能

广告

无法保存 Cookie 首选项

SageMaker 训练作业预训练教程 (GPU)

先决条件

GPU SageMaker 训练作业环境设置

注意

使用 Jupyter 笔记本启动训练作业

注意

注意

使用食谱启动器启动训练作业

本页内容

Related resources

此页内容对您是否有帮助？

Related resources

下一主题：

上一主题：

需要帮助吗？