

本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。

# 处理从 Neptune 导出的图形数据以用于训练
<a name="machine-learning-on-graphs-processing"></a>

数据处理步骤采用导出过程创建的 Neptune 图形数据，并创建[深度图表库 (DGL)](https://www.dgl.ai/) 在训练期间使用的信息。这包括执行各种数据映射和转换：
+ 解析节点和边缘以构造 DGL 所需的图形映射和 ID 映射文件。
+ 将节点和边缘属性转换为 DGL 所需的节点和边缘特征。
+ 将数据拆分为训练集、验证集和测试集。

## 管理 Neptune ML 的数据处理步骤
<a name="machine-learning-on-graphs-processing-managing"></a>

从 Neptune 中导出要用于模型训练的数据后，可以使用如下命令启动数据处理作业：

------
#### [ AWS CLI ]

```
aws neptunedata start-ml-data-processing-job \
  --endpoint-url https://your-neptune-endpoint:port \
  --input-data-s3-location "s3://(S3 bucket name)/(path to your input folder)" \
  --id "(a job ID for the new job)" \
  --processed-data-s3-location "s3://(S3 bucket name)/(path to your output folder)" \
  --config-file-name "training-job-configuration.json"
```

有关更多信息，请参阅《 AWS CLI 命令参考》中的 [start-ml-data-processing-job](https://docs.aws.amazon.com/cli/latest/reference/neptunedata/start-ml-data-processing-job.html)。

------
#### [ SDK ]

```
import boto3
from botocore.config import Config

client = boto3.client(
    'neptunedata',
    endpoint_url='https://your-neptune-endpoint:port',
    config=Config(read_timeout=None, retries={'total_max_attempts': 1})
)

response = client.start_ml_data_processing_job(
    inputDataS3Location='s3://(S3 bucket name)/(path to your input folder)',
    id='(a job ID for the new job)',
    processedDataS3Location='s3://(S3 bucket name)/(path to your output folder)',
    configFileName='training-job-configuration.json'
)

print(response)
```

------
#### [ awscurl ]

```
awscurl https://your-neptune-endpoint:port/ml/dataprocessing \
  --region us-east-1 \
  --service neptune-db \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
        "inputDataS3Location" : "s3://(S3 bucket name)/(path to your input folder)",
        "id" : "(a job ID for the new job)",
        "processedDataS3Location" : "s3://(S3 bucket name)/(path to your output folder)",
        "configFileName" : "training-job-configuration.json"
      }'
```

**注意**  
此示例假设您的 AWS 证书是在您的环境中配置的。*us-east-1*替换为 Neptune 集群的区域。

------
#### [ curl ]

```
curl \
  -X POST https://your-neptune-endpoint:port/ml/dataprocessing \
  -H 'Content-Type: application/json' \
  -d '{
        "inputDataS3Location" : "s3://(S3 bucket name)/(path to your input folder)",
        "id" : "(a job ID for the new job)",
        "processedDataS3Location" : "s3://(S3 bucket name)/(path to your output folder)",
        "configFileName" : "training-job-configuration.json"
      }'
```

------

[dataprocessing 命令](machine-learning-api-dataprocessing.md)中解释了如何使用此命令的详细信息，以及有关如何获取正在运行的任务的状态、如何停止正在运行的任务以及如何列出所有正在运行的任务的信息。

## 处理 Neptune ML 的更新图形数据
<a name="machine-learning-on-graphs-processing-updated"></a>

您也可以向 API 提供 `previousDataProcessingJobId`，以确保新的数据处理任务使用与先前任务相同的处理方法。当您想通过在新数据上重新训练旧模型，或者在新数据上重新计算模型构件，来获得对 Neptune 中更新的图形数据的预测时，这是必需的。

你可以使用这样的命令来做到这一点：

------
#### [ AWS CLI ]

```
aws neptunedata start-ml-data-processing-job \
  --endpoint-url https://your-neptune-endpoint:port \
  --input-data-s3-location "s3://(Amazon S3 bucket name)/(path to your input folder)" \
  --id "(a job ID for the new job)" \
  --processed-data-s3-location "s3://(Amazon S3 bucket name)/(path to your output folder)" \
  --previous-data-processing-job-id "(the job ID of the previous data-processing job)"
```

有关更多信息，请参阅《 AWS CLI 命令参考》中的 [start-ml-data-processing-job](https://docs.aws.amazon.com/cli/latest/reference/neptunedata/start-ml-data-processing-job.html)。

------
#### [ SDK ]

```
import boto3
from botocore.config import Config

client = boto3.client(
    'neptunedata',
    endpoint_url='https://your-neptune-endpoint:port',
    config=Config(read_timeout=None, retries={'total_max_attempts': 1})
)

response = client.start_ml_data_processing_job(
    inputDataS3Location='s3://(Amazon S3 bucket name)/(path to your input folder)',
    id='(a job ID for the new job)',
    processedDataS3Location='s3://(Amazon S3 bucket name)/(path to your output folder)',
    previousDataProcessingJobId='(the job ID of the previous data-processing job)'
)

print(response)
```

------
#### [ awscurl ]

```
awscurl https://your-neptune-endpoint:port/ml/dataprocessing \
  --region us-east-1 \
  --service neptune-db \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
        "inputDataS3Location" : "s3://(Amazon S3 bucket name)/(path to your input folder)",
        "id" : "(a job ID for the new job)",
        "processedDataS3Location" : "s3://(Amazon S3 bucket name)/(path to your output folder)",
        "previousDataProcessingJobId" : "(the job ID of the previous data-processing job)"
      }'
```

**注意**  
此示例假设您的 AWS 证书是在您的环境中配置的。*us-east-1*替换为 Neptune 集群的区域。

------
#### [ curl ]

```
curl \
  -X POST https://your-neptune-endpoint:port/ml/dataprocessing \
  -H 'Content-Type: application/json' \
  -d '{
        "inputDataS3Location" : "s3://(Amazon S3 bucket name)/(path to your input folder)",
        "id" : "(a job ID for the new job)",
        "processedDataS3Location" : "s3://(Amazon S3 bucket name)/(path to your output folder)",
        "previousDataProcessingJobId" : "(the job ID of the previous data-processing job)"
      }'
```

------

将 `previousDataProcessingJobId` 参数的值设置为与训练后的模型对应的先前数据处理任务的任务 ID。

**注意**  
目前不支持在更新的图形中删除节点。如果在更新的图形中移除了节点，则必须启动一个全新的数据处理任务，而不是使用 `previousDataProcessingJobId`。