故障診斷 Neo 推論錯誤 - Amazon SageMaker

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

故障診斷 Neo 推論錯誤

本節包含如何防止和解決部署和/或調用端點時可能遇到的一些常見錯誤之資訊。本節適用於 PyTorch 1.4.0 或更新版本以 MXNet 7.0 版或更新版本。

  • 如果您在推論指令碼中定義了一個 model_fn,請務必透過 model_fn()在有效輸入資料中完成第一個推論 (熱身推論),否則在呼叫 predict API時可能會在終端機上看到以下錯誤訊息:

    An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from <users-sagemaker-endpoint> with message "Your invocation timed out while waiting for a response from container model. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again."
  • 請務必設定環境變數,如下表所示。如果未設置這些變數,則系統可能會顯示以下錯誤訊息:

    在終端機上:

    An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (503) from <users-sagemaker-endpoint> with message "{ "code": 503, "type": "InternalServerException", "message": "Prediction failed" } ".

    在 CloudWatch:

    W-9001-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - AttributeError: 'NoneType' object has no attribute 'transform'
    金鑰
    SAGEMAKER_PROGRAM inference.py
    SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/model/code
    SAGEMAKER_CONTAINER_LOG_LEVEL 20
    SAGEMAKER_REGION <your region>
  • 確保在創建 Amazon SageMaker 模型時將MMS_DEFAULT_RESPONSE_TIMEOUT環境變量設置為 500 或更高的值;否則,終端機上可能會看到以下錯誤消息:

    An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from <users-sagemaker-endpoint> with message "Your invocation timed out while waiting for a response from container model. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again."