모델 파라미터를 디버깅하기 위한 Debugger 규칙을 구성하려면,시스템 및 프레임워크 지표를 프로파일링하기 위한 디버거 내장 규칙을 구성하려면 UpdateTrainingJob API를 사용하여 Debugger 프로파일링 구성 업데이트 CreateTrainingJob API에 Debugger 사용자 지정 규칙 구성 추가

JSON(AWS CLI)

Amazon SageMaker Debugger 기본 제공 규칙은 SageMaker AI CreateTrainingJob API 작업을 통해 DebugHookConfig, DebugRuleConfiguration, ProfilerConfig 및 ProfilerRuleConfiguration 객체를 사용하여 훈련 작업에 대해 구성할 수 있습니다. RuleEvaluatorImage 파라미터에 올바른 이미지 URI를 지정해야 하며, 다음 예제는 CreateTrainingJob을 요청하기 위해 JSON 문자열을 설정하는 방법을 안내합니다.

다음 코드는 필수 설정 및 Debugger 구성으로 훈련 작업을 실행하기 위한 완전한 JSON 템플릿을 보여줍니다. 템플릿을 작업 디렉터리에 JSON 파일로 저장하고 AWS CLI를 사용하여 훈련 작업을 실행합니다. 예를 들어, 다음 코드를 debugger-training-job-cli.json으로 저장합니다.

참고

올바른 Docker 컨테이너 이미지를 사용해야 합니다. AWS 딥 러닝 컨테이너 이미지를 찾으려면 사용 가능한 딥 러닝 컨테이너 이미지를 참조하세요. Debugger 규칙을 사용하는 데 사용할 수 있는 도커 이미지의 전체 목록을 찾으려면 Debugger 규칙용 도커 이미지을 참고하세요.


{
   "TrainingJobName": "debugger-aws-cli-test",
   "RoleArn": "arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-YYYYMMDDT123456",
   "AlgorithmSpecification": {
      // Specify a training Docker container image URI (Deep Learning Container or your own training container) to TrainingImage.
      "TrainingImage": "763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04",
      "TrainingInputMode": "File",
      "EnableSageMakerMetricsTimeSeries": false
   },
   "HyperParameters": {
      "sagemaker_program": "entry_point/tf-hvd-train.py",
      "sagemaker_submit_directory": "s3://sagemaker-us-west-2-111122223333/debugger-boto3-profiling-test/source.tar.gz"
   },
   "OutputDataConfig": { 
      "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/output"
   },
   "DebugHookConfig": { 
      "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/debug-output",
      "CollectionConfigurations": [
         {
            "CollectionName": "losses",
            "CollectionParameters" : {
                "train.save_interval": "50"
            }
         }
      ]
   },
   "DebugRuleConfigurations": [ 
      { 
         "RuleConfigurationName": "LossNotDecreasing",
         "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
         "RuleParameters": {"rule_to_invoke": "LossNotDecreasing"}
      }
   ],
   "ProfilerConfig": { 
      "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/profiler-output",
      "ProfilingIntervalInMilliseconds": 500,
      "ProfilingParameters": {
          "DataloaderProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, \"MetricsRegex\": \".*\", }",
          "DetailedProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, }",
          "PythonProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName\": \"cprofile\", \"cProfileTimer\": \"total_time\"}",
          "LocalPath": "/opt/ml/output/profiler/" 
      }
   },
   "ProfilerRuleConfigurations": [ 
      { 
         "RuleConfigurationName": "ProfilerReport",
         "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
         "RuleParameters": {"rule_to_invoke": "ProfilerReport"}
      }
   ],
   "ResourceConfig": { 
      "InstanceType": "ml.p3.8xlarge",
      "InstanceCount": 1,
      "VolumeSizeInGB": 30
   },
   
   "StoppingCondition": { 
      "MaxRuntimeInSeconds": 86400
   }
}

JSON 파일을 저장한 후, 터미널에서 다음 명령을 실행합니다. (Jupyter notebook을 사용하는 경우 줄의 시작 부분에 !를 사용하세요.)


aws sagemaker create-training-job --cli-input-json file://debugger-training-job-cli.json

모델 파라미터를 디버깅하기 위한 Debugger 규칙을 구성하려면,

다음 코드 샘플은 이러한 SageMaker API를 사용하여 기본 제공되는 VanishingGradient 규칙을 구성하는 방법을 보여 줍니다.

Debugger가 출력 텐서를 수집할 수 있도록 하려면

Debugger 후크 구성을 다음과 같이 지정하세요.


"DebugHookConfig": {
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/debug-output",
    "CollectionConfigurations": [
        {
            "CollectionName": "gradients",
            "CollectionParameters" : {
                "save_interval": "500"
            }
        }
    ]
}

이렇게 지정하면 500단계의 gradients마다 훈련 작업에 텐서 모음, save_interval가 저장됩니다. 사용 가능한 CollectionName 값을 찾으려면 SMDebug 클라이언트 라이브러리 설명서의 Debugger 기본 제공 모음을 참고하세요. 사용 가능한 CollectionParameters 파라미터 키와 값을 찾으려면 SageMaker Python SDK 설명서에서 sagemaker.debugger.CollectionConfig 클래스를 참고하세요.

출력 텐서를 디버깅하기 위한 Debugger 규칙을 활성화하려면

다음 DebugRuleConfigurations API 예제는 저장된 gradients 모음에서 기본 제공된 VanishingGradient 규칙을 실행하는 방법을 보여줍니다.


"DebugRuleConfigurations": [
    {
        "RuleConfigurationName": "VanishingGradient",
        "RuleEvaluatorImage": "503895931360.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rules:latest",
        "RuleParameters": {
            "rule_to_invoke": "VanishingGradient",
            "threshold": "20.0"
        }
    }
]

이 샘플 구성과 동일한 구성을 갖춘 Debugger는 gradients 텐서 모음에서 VanishingGradient 규칙을 사용하여 훈련 작업에 대한 규칙 평가 작업을 시작합니다. Debugger 규칙을 사용하는 데 사용할 수 있는 도커 이미지의 전체 목록을 찾으려면 Debugger 규칙용 도커 이미지을 참고하세요. RuleParameters에 대한 키-값 쌍을 찾으려면 Debugger 기본 제공 규칙 목록을 참고하세요.

시스템 및 프레임워크 지표를 프로파일링하기 위한 디버거 내장 규칙을 구성하려면

다음 예제 코드는 ProfilerConfig API 작업을 지정하여 시스템 및 프레임워크 지표를 수집할 수 있도록 하는 방법을 보여줍니다.

Debugger 프로파일링을 활성화하여 시스템 및 프레임워크 지표를 수집하도록 하려면

Target Step


"ProfilerConfig": { 
    // Optional. Path to an S3 bucket to save profiling outputs
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/profiler-output", 
    // Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    "ProfilingIntervalInMilliseconds": 500, 
    "ProfilingParameters": {
        "DataloaderProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3, \"MetricsRegex\": \".*\" }",
        "DetailedProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3 }",
        // For PythonProfilingConfig,
        // available ProfilerName options: cProfile, Pyinstrument
        // available cProfileTimer options only when using cProfile: cpu, off_cpu, total_time
        "PythonProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName\": \"cProfile\", \"cProfileTimer\": \"total_time\" }",
        // Optional. Local path for profiling outputs
        "LocalPath": "/opt/ml/output/profiler/" 
    }
}

Target Time Duration


"ProfilerConfig": { 
    // Optional. Path to an S3 bucket to save profiling outputs
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/profiler-output", 
    // Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    "ProfilingIntervalInMilliseconds": 500,
    "ProfilingParameters": {
        "DataloaderProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10, \"MetricsRegex\": \".*\" }",
        "DetailedProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10 }",
        // For PythonProfilingConfig,
        // available ProfilerName options: cProfile, Pyinstrument
        // available cProfileTimer options only when using cProfile: cpu, off_cpu, total_time
        "PythonProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10, \"ProfilerName\": \"cProfile\", \"cProfileTimer\": \"total_time\" }",
        // Optional. Local path for profiling outputs
        "LocalPath": "/opt/ml/output/profiler/"  
    }
}

지표를 프로파일링하기 위한 Debugger 규칙을 활성화하려면

다음 예제 코드에서는 ProfilerReport 규칙을 구성하는 방법을 보여줍니다.


"ProfilerRuleConfigurations": [ 
    {
        "RuleConfigurationName": "ProfilerReport",
        "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
        "RuleParameters": {
            "rule_to_invoke": "ProfilerReport",
            "CPUBottleneck_cpu_threshold": "90",
            "IOBottleneck_threshold": "90"
        }
    }
]

Debugger 규칙을 사용하는 데 사용할 수 있는 도커 이미지의 전체 목록을 찾으려면 Debugger 규칙용 도커 이미지을 참고하세요. RuleParameters에 대한 키-값 쌍을 찾으려면 Debugger 기본 제공 규칙 목록을 참고하세요.

`UpdateTrainingJob` API를 사용하여 Debugger 프로파일링 구성 업데이트

UpdateTrainingJob API 작업을 사용하여 훈련 작업이 실행되는 동안 Debugger 프로파일링 구성을 업데이트할 수 있습니다. 새 ProfilerConfig 및 ProfilerRuleConfiguration 객체를 구성하고 TrainingJobName 파라미터에 훈련 작업 이름을 지정합니다.


{
    "ProfilerConfig": { 
        "DisableProfiler": boolean,
        "ProfilingIntervalInMilliseconds": number,
        "ProfilingParameters": { 
            "string" : "string" 
        }
    },
    "ProfilerRuleConfigurations": [ 
        { 
            "RuleConfigurationName": "string",
            "RuleEvaluatorImage": "string",
            "RuleParameters": { 
                "string" : "string" 
            }
        }
    ],
    "TrainingJobName": "your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS"
}

`CreateTrainingJob` API에 Debugger 사용자 지정 규칙 구성 추가

CreateTrainingJob API 작업에서 DebugHookConfig 및 DebugRuleConfiguration 객체를 사용하여 훈련 작업에 대한 사용자 지정 규칙을 구성할 수 있습니다. 다음 코드 샘플에서는 이 SageMaker API 작업을 사용하여 smdebug 라이브러리로 작성된 사용자 지정 ImproperActivation 규칙의 구성 방법을 보여 줍니다. 이 예제에서는 custom_rules.py 파일에 사용자 지정 규칙을 작성하여 Amazon S3 버킷에 업로드했다고 가정합니다. 그리고 사용자가 사용자 지정 규칙을 실행하는 데 사용할 수 있는 사전 빌드된 도커 이미지를 보여줍니다. 이는 사용자 지정 규칙 평가기를 위한 Amazon SageMaker Debugger 이미지 URI에 나열되어 있습니다. RuleEvaluatorImage 파라미터에 사전 구축된 도커 이미지의 URL 레지스트리 주소를 지정합니다.


"DebugHookConfig": {
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/debug-output",
    "CollectionConfigurations": [
        {
            "CollectionName": "relu_activations",
            "CollectionParameters": {
                "include_regex": "relu",
                "save_interval": "500",
                "end_step": "5000"
            }
        }
    ]
},
"DebugRulesConfigurations": [
    {
        "RuleConfigurationName": "improper_activation_job",
        "RuleEvaluatorImage": "552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest",
        "InstanceType": "ml.c4.xlarge",
        "VolumeSizeInGB": 400,
        "RuleParameters": {
           "source_s3_uri": "s3://bucket/custom_rules.py",
           "rule_to_invoke": "ImproperActivation",
           "collection_names": "relu_activations"
        }
    }
]

javascript가 브라우저에서 비활성화되거나 사용이 불가합니다.

AWS 설명서를 사용하려면 Javascript가 활성화되어야 합니다. 지침을 보려면 브라우저의 도움말 페이지를 참조하십시오.

문서 규칙

SageMaker API를 사용한 Debugger 구성

SDK for Python(Boto3)