TorchServe를 사용한 딥 러닝 컨테이너 시작 모델 아티팩트 준비하기 SageMaker Python SDK를 사용하여 모델 배포하기

TorchServe를 사용한 대형 추론 모델 배포

이 자습서에서는 GPUs의 TorchServe를 사용하여 Amazon SageMaker AI에서 대규모 모델을 배포하고 추론을 제공하는 방법을 보여줍니다. 이 예제에서는 OPT-30b 모델을 ml.g5 인스턴스에 배포합니다. 기타 모델 및 인스턴스 유형과 작동하도록 이를 수정할 수 있습니다. 이 예제의 italicized placeholder text을 사용자 고유의 정보로 바꿉니다.

TorchServe는 대규모 분산형 모델 추론을 위한 강력한 개방형 플랫폼입니다. PyTorch, native PiPPy, DeepSpeed, HuggingFace Accelerate와 같은 인기 라이브러리를 지원하여 분산형 대규모 모델 및 비분산형 모델 추론 시나리오에서 일관성을 유지하는 균일한 핸들러 API를 제공합니다. 자세한 내용은 TorchServe의 대형 모델 추론 설명서를 참고하세요.

TorchServe를 사용한 딥 러닝 컨테이너

SageMaker AI에서 TorchServe를 사용하여 대규모 모델을 배포하려면 SageMaker AI 딥 러닝 컨테이너(DLCs. 기본적으로 TorchServe는 모든 AWS PyTorch DLCs. 모델을 로드하는 동안 TorchServe는 PiPPy, Deepspeed, Accelerate와 같은 대형 모델에 맞게 조정된 특수 라이브러리를 설치할 수 있습니다.

다음 표에는 TorchServe를 사용하는 모든 SageMaker AI DLCs가 나열되어 있습니다.

DLC 카테고리	프레임워크	하드웨어	예제 URL
SageMaker AI 프레임워크 컨테이너	PyTorch 2.0.0+	CPU, GPU	763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker
SageMaker AI 프레임워크 Graviton 컨테이너	PyTorch 2.0.0+	CPU	763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-graviton:2.0.1-cpu-py310-ubuntu20.04-sagemaker
StabilityAI Inference Containers	PyTorch 2.0.0+	GPU	763104351884.dkr.ecr.us-east-1.amazonaws.com/stabilityai-pytorch-inference:2.0.1-sgm0.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker
Neuron Containers	PyTorch 1.13.1	Neuronx	763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference-neuron:1.13.1-neuron-py310-sdk2.12.0-ubuntu20.04

시작

모델을 배포하기 전에 필수 조건을 완료합니다. 모델 매개변수를 구성하고 핸들러 코드를 사용자 지정할 수도 있습니다.

사전 조건

시작하려면 다음과 같은 필수 조건이 있어야 합니다.

AWS 계정에 액세스할 수 있는지 확인합니다. 가 AWS IAM 사용자 또는 IAM 역할을 통해 계정에 액세스할 AWS CLI 수 있도록 환경을 설정합니다. IAM 역할을 사용하는 것을 추천합니다. 사용자의 개인 계정에서 테스트를 수행하기 위해 다음과 같은 관리형 권한 정책을 IAM 역할에 연결할 수 있습니다.
IAM 정책을 역할에 연결하는 방법에 대한 자세한 내용은 AWS IAM 사용 설명서의 IAM 자격 증명 권한 추가 및 제거를 참고하세요.

아래의 예제와 같이 종속성을 로컬에서 구성합니다.

의 버전 2를 설치합니다 AWS CLI.


# Install the latest AWS CLI v2 if it is not installed
!curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" !unzip awscliv2.zip
#Follow the instructions to install v2 on the terminal
!cat aws/README.md

SageMaker AI 및 Boto3 클라이언트를 설치합니다.


# If already installed, update your client
#%pip install sagemaker pip --upgrade --quiet
!pip install -U sagemaker
!pip install -U boto
!pip install -U botocore
!pip install -U boto3

모델 설정 및 파라미터 구성

TorchServe는 torchrun을 사용하여 모델 병렬 처리를 위한 분산형 환경을 설정합니다. TorchServe는 대형 모델의 경우 여러 작업자를 지원할 수 있습니다. 기본적으로 TorchServe는 라운드 로빈 알고리즘을 사용하여 호스트의 작업자에게 GPU를 할당합니다. 대규모 모델 추론의 경우, 각 워커에 할당된 GPU 수는 model_config.yaml 파일에 지정된 GPU 수에 따라 자동으로 계산됩니다. 주어진 시간에 표시되는 GPU 장치 ID를 지정하는 환경 변수 CUDA_VISIBLE_DEVICES은 이 숫자를 기반으로 설정됩니다.

예를 들어 한 노드에 8개의 GPU가 있고, 하나의 워커에는 노드(nproc_per_node=4)에 4개의 GPU가 필요하다고 가정해 보겠습니다. 이 경우 TorchServe는 첫 번째 워커(CUDA_VISIBLE_DEVICES="0,1,2,3")에 4개의 GPU를 할당하고 두 번째 워커(CUDA_VISIBLE_DEVICES="4,5,6,7”)에 4개의 GPU를 할당합니다.

이 기본 동작 외에도 TorchServe는 사용자가 워커에 대해 GPU를 지정할 수 있는 유연성을 제공합니다. 예를 들어 모델 구성 YAML 파일에서 변수 deviceIds: [2,3,4,5]를 설정하고 nproc_per_node=2를 설정하면, TorchServe는 첫 번째 워커에 CUDA_VISIBLE_DEVICES=”2,3”를 할당하고 두 번째 워커에 CUDA_VISIBLE_DEVICES="4,5”를 할당합니다.

아래의 model_config.yaml 예제에서는 OPT-30b 모델의 프런트엔드 매개변수와 백엔드 매개변수를 모두 구성합니다. 구성된 프런트엔드 매개변수는 parallelType, deviceType, deviceIds , torchrun입니다. 구성할 수 있는 프론트엔드 매개변수에 대한 자세한 내용은 PyTorch GitHub 설명서를 참고하세요. 백엔드 구성은 자유로운 스타일로 사용자 지정이 가능한 YAML 맵을 기반으로 합니다. 백엔드 매개변수의 경우, DeepSpeed 구성과 사용자 지정 핸들러 코드에서 사용하는 추가 매개변수를 정의합니다.


# TorchServe front-end parameters
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
responseTimeout: 1200
parallelType: "tp"
deviceType: "gpu"
# example of user specified GPU deviceIds
deviceIds: [0,1,2,3] # sets CUDA_VISIBLE_DEVICES

torchrun:
    nproc-per-node: 4

# TorchServe back-end parameters
deepspeed:
    config: ds-config.json
    checkpoint: checkpoints.json

handler: # parameters for custom handler code
    model_name: "facebook/opt-30b"
    model_path: "model/models--facebook--opt-30b/snapshots/ceea0a90ac0f6fae7c2c34bcb40477438c152546"
    max_length: 50
    max_new_tokens: 10
    manual_seed: 40

핸들러 사용자 지정

TorchServe는 인기 라이브러리로 빌드된 대규모 모델 추론을 위한 기본 핸들러 및 핸들러 유틸리티를 제공합니다. 다음 예제는 사용자 지정 핸들러 클래스 TransformersSeqClassifierHandler가 BaseDeepSpeedHandler를 확장하고 핸들러 유틸리티를 사용하는 방법을 설명합니다. 전체 코드 예제는 PyTorch GitHub 설명서의 custom_handler.py 코드를 참고하세요.


class TransformersSeqClassifierHandler(BaseDeepSpeedHandler, ABC):
    """
    Transformers handler class for sequence, token classification and question answering.
    """

    def __init__(self):
        super(TransformersSeqClassifierHandler, self).__init__()
        self.max_length = None
        self.max_new_tokens = None
        self.tokenizer = None
        self.initialized = False

    def initialize(self, ctx: Context):
        """In this initialize function, the HF large model is loaded and
        partitioned using DeepSpeed.
        Args:
            ctx (context): It is a JSON Object containing information
            pertaining to the model artifacts parameters.
        """
        super().initialize(ctx)
        model_dir = ctx.system_properties.get("model_dir")
        self.max_length = int(ctx.model_yaml_config["handler"]["max_length"])
        self.max_new_tokens = int(ctx.model_yaml_config["handler"]["max_new_tokens"])
        model_name = ctx.model_yaml_config["handler"]["model_name"]
        model_path = ctx.model_yaml_config["handler"]["model_path"]
        seed = int(ctx.model_yaml_config["handler"]["manual_seed"])
        torch.manual_seed(seed)

        logger.info("Model %s loading tokenizer", ctx.model_name)

        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        config = AutoConfig.from_pretrained(model_name)
        with torch.device("meta"):
            self.model = AutoModelForCausalLM.from_config(
                config, torch_dtype=torch.float16
            )
        self.model = self.model.eval()

        ds_engine = get_ds_engine(self.model, ctx)
        self.model = ds_engine.module
        logger.info("Model %s loaded successfully", ctx.model_name)
        self.initialized = True

    def preprocess(self, requests):
        """
        Basic text preprocessing, based on the user's choice of application mode.
        Args:
            requests (list): A list of dictionaries with a "data" or "body" field, each
                            containing the input text to be processed.
        Returns:
            tuple: A tuple with two tensors: the batch of input ids and the batch of
                attention masks.
        """

    def inference(self, input_batch):
        """
        Predicts the class (or classes) of the received text using the serialized transformers
        checkpoint.
        Args:
            input_batch (tuple): A tuple with two tensors: the batch of input ids and the batch
                                of attention masks, as returned by the preprocess function.
        Returns:
            list: A list of strings with the predicted values for each input text in the batch.
        """
        
    def postprocess(self, inference_output):
        """Post Process Function converts the predicted response into Torchserve readable format.
        Args:
            inference_output (list): It contains the predicted response of the input text.
        Returns:
            (list): Returns a list of the Predictions and Explanations.
        """

모델 아티팩트 준비하기

SageMaker AI에 모델을 배포하기 전에 모델 아티팩트를 패키징해야 합니다. 대형 모델의 경우, 모델 아티팩트 압축을 건너뛰는 PyTorch torch-model-archiver 도구를 --archive-format no-archive 인수와 함께 사용하는 것이 좋습니다. 다음 예제는 모든 모델 아티팩트를 opt/라는 새 폴더에 저장합니다.


torch-model-archiver --model-name opt --version 1.0 --handler custom_handler.py --extra-files ds-config.json -r requirements.txt --config-file opt/model-config.yaml --archive-format no-archive

opt/ 폴더가 생성되면 PyTorch Download_model 도구를 사용하여 폴더에 OPT-30b 모델을 다운로드합니다.


cd opt
python path_to/Download_model.py --model_path model --model_name facebook/opt-30b --revision main

마지막으로 모델 아티팩트를 Amazon S3 버킷에 업로드합니다.


aws s3 cp opt {your_s3_bucket}/opt --recursive

이제 SageMaker AI 엔드포인트에 배포할 준비가 된 모델 아티팩트가 Amazon S3에 저장되어 있어야 합니다.

SageMaker Python SDK를 사용하여 모델 배포하기

모델 아티팩트를 준비한 후 SageMaker AI 호스팅 엔드포인트에 모델을 배포할 수 있습니다. 이 섹션에서는 단일 대형 모델을 엔드포인트에 배포하고 스트리밍 응답을 예측하는 방법을 설명합니다. 엔드포인트의 스트리밍 응답에 대한 자세한 내용은 실시간 엔드포인트 호출을 참고하세요.

모델을 배포하려면 다음 단계를 완료합니다.

다음 예제와 같이 SageMaker AI 세션을 생성합니다.


import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

boto3_session=boto3.session.Session(region_name="us-west-2")
smr = boto3.client('sagemaker-runtime-demo')
sm = boto3.client('sagemaker')
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess= sagemaker.session.Session(boto3_session, sagemaker_client=sm, sagemaker_runtime_client=smr)  # SageMaker AI session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio Classic environment
account = sess.account_id()  # account_id of the current SageMaker Studio Classic environment

# Configuration:
bucket_name = sess.default_bucket()
prefix = "torchserve"
output_path = f"s3://{bucket_name}/{prefix}"
print(f'account={account}, region={region}, role={role}, output_path={output_path}')

다음 예제와 같이 SageMaker AI에서 압축되지 않은 모델을 생성합니다.


from datetime import datetime

instance_type = "ml.g5.24xlarge"
endpoint_name = sagemaker.utils.name_from_base("ts-opt-30b")
s3_uri = {your_s3_bucket}/opt

model = Model(
    name="torchserve-opt-30b" + datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
    # Enable SageMaker uncompressed model artifacts
    model_data={
        "S3DataSource": {
                "S3Uri": s3_uri,
                "S3DataType": "S3Prefix",
                "CompressionType": "None",
        }
    },
    image_uri=container,
    role=role,
    sagemaker_session=sess,
    env={"TS_INSTALL_PY_DEP_PER_MODEL": "true"},
)
print(model)

다음 예제와 같이 모델을 Amazon EC2 인스턴스에 배포합니다.


model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    volume_size=512, # increase the size to store large model
    model_data_download_timeout=3600, # increase the timeout to download large model
    container_startup_health_check_timeout=600, # increase the timeout to load large model
)

다음 예제와 같이 스트리밍 응답을 처리하도록 클래스를 초기화합니다.


import io

class Parser:
    """
    A helper class for parsing the byte stream input. 
    
    The output of the model will be in the following format:
    ```
    b'{"outputs": [" a"]}\n'
    b'{"outputs": [" challenging"]}\n'
    b'{"outputs": [" problem"]}\n'
    ...
    ```
    
    While usually each PayloadPart event from the event stream will contain a byte array 
    with a full json, this is not guaranteed and some of the json objects may be split across
    PayloadPart events. For example:
    ```
    {'PayloadPart': {'Bytes': b'{"outputs": '}}
    {'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
    ```
    
    This class accounts for this by concatenating bytes written via the 'write' function
    and then exposing a method which will return lines (ending with a '\n' character) within
    the buffer via the 'scan_lines' function. It maintains the position of the last read 
    position to ensure that previous bytes are not exposed again. 
    """
    
    def __init__(self):
        self.buff = io.BytesIO()
        self.read_pos = 0
        
    def write(self, content):
        self.buff.seek(0, io.SEEK_END)
        self.buff.write(content)
        data = self.buff.getvalue()
        
    def scan_lines(self):
        self.buff.seek(self.read_pos)
        for line in self.buff.readlines():
            if line[-1] != b'\n':
                self.read_pos += len(line)
                yield line[:-1]
                
    def reset(self):
        self.read_pos = 0

다음 예제와 같이 스트리밍 응답 예측을 테스트합니다.


import json

body = "Today the weather is really nice and I am planning on".encode('utf-8')
resp = smr.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=body, ContentType="application/json")
event_stream = resp['Body']
parser = Parser()
for event in event_stream:
    parser.write(event['PayloadPart']['Bytes'])
    for line in parser.scan_lines():
        print(line.decode("utf-8"), end=' ')

이제 모델을 SageMaker AI 엔드포인트에 배포했으며 응답을 위해 호출할 수 있어야 합니다. SageMaker AI 실시간 엔드포인트에 대한 자세한 내용은 섹션을 참조하세요단일 모델 엔드포인트.

javascript가 브라우저에서 비활성화되거나 사용이 불가합니다.

AWS 설명서를 사용하려면 Javascript가 활성화되어야 합니다. 지침을 보려면 브라우저의 도움말 페이지를 참조하십시오.

문서 규칙

압축되지 않은 모델 배포하기

배포 가드레일