Using the neptune-export
tool or Neptune-Export service to export data from Neptune for Neptune ML
Neptune ML requires that you provide training data for the Deep Graph Library (DGL)
You can export data from Neptune using either the Neptune-Export service, or neptune-export utility. Both the
service and the command line tool publish data to Amazon Simple Storage Service (Amazon S3) in a CSV format,
encrypted using Amazon S3 server-side encryption (SSE-S3
).
See Files exported by Neptune-Export
and neptune-export.
In addition, when you configure an export of training data for Neptune ML
the export job creates and publishes an encrypted model-training configuration
file along with the exported data. By default, this file is named
training-data-configuration.json
.
Examples of using the Neptune-Export service to export training data for Neptune ML
This request exports property-graph training data for a node classification task:
curl \
(your NeptuneExportApiUri)
\ -X POST \ -H 'Content-Type: application/json' \ -d '{ "command": "export-pg", "outputS3Path": "s3://(your Amazon S3 bucket)
/neptune-export", "params": { "endpoint": "(your Neptune endpoint DNS name)
", "profile": "neptune_ml" }, "additionalParams": { "neptune_ml": { "version": "v2.0", "targets": [ { "node": "Movie", "property": "genre", "type": "classification" } ] } } }'
This request exports RDF training data for a node classification task:
curl \
(your NeptuneExportApiUri)
\ -X POST \ -H 'Content-Type: application/json' \ -d '{ "command": "export-rdf", "outputS3Path": "s3://(your Amazon S3 bucket)
/neptune-export", "params": { "endpoint": "(your Neptune endpoint DNS name)
", "profile": "neptune_ml" }, "additionalParams": { "neptune_ml": { "version": "v2.0", "targets": [ { "node": "http://aws.amazon.com/neptune/csv2rdf/class/Movie", "predicate": "http://aws.amazon.com/neptune/csv2rdf/datatypeProperty/genre", "type": "classification" } ] } } }'
Fields to set in the params
object when
exporting training data
The params
object in an export request can contain various fields, as
described in the params
documentation. The following ones are most relevant for exporting machine-learning
training data:
-
endpoint
– Useendpoint
to specify an endpoint of a Neptune instance in your DB cluster that the export process can query to extract data. -
profile
– Theprofile
field in theparams
object must be set toneptune-ml
.This causes the export process to format the exported data appropriately for Neptune ML model training, in a CSV format for property-graph data or as N-Triples for RDF data. It also causes a
training-data-configuration.json
file to be created and written to the same Amazon S3 location as the exported training data. -
cloneCluster
– If set totrue
, the export process clones your DB cluster, exports from the clone, and then deletes the clone when it is finished. -
useIamAuth
– If your DB cluster has IAM authentication enabled, you must include this field set totrue
.
The export process also provides several ways to filter the data you export (see these examples).
Using the additionalParams
object to tune the export of model-training information
The additionalParams
object contains fields that you can use
to specify machine-learning class labels and features for training purposes and
guide the creation of a training data configuration file.
The export process cannot automatically infer which node and edge properties
should be the machine learning class labels to serve as examples for training
purposes. It also cannot automatically infer the best feature encoding for numeric,
categorical and text properties, so you need to supply hints using fields in the
additionalParams
object to specify these things, or to override the
default encoding.
For property-graph data, the top-level structure of additionalParams
in an export request might look like this:
{ "command": "export-pg", "outputS3Path": "s3://
(your Amazon S3 bucket)
/neptune-export", "params": { "endpoint": "(your Neptune endpoint DNS name)
", "profile": "neptune_ml" }, "additionalParams": { "neptune_ml": { "version": "v2.0", "targets": [(an array of node and edge class label targets)
], "features": [(an array of node feature hints)
] } } }
For RDF data, its top-level structure might look like this:
{ "command": "export-rdf", "outputS3Path": "s3://
(your Amazon S3 bucket)
/neptune-export", "params": { "endpoint": "(your Neptune endpoint DNS name)
", "profile": "neptune_ml" }, "additionalParams": { "neptune_ml": { "version": "v2.0", "targets": [(an array of node and edge class label targets)
] } } }
You can also supply multiple export configurations, using the jobs
field:
{ "command": "export-pg", "outputS3Path": "s3://
(your Amazon S3 bucket)
/neptune-export", "params": { "endpoint": "(your Neptune endpoint DNS name)
", "profile": "neptune_ml" }, "additionalParams" : { "neptune_ml" : { "version": "v2.0", "jobs": [ { "name" : "(training data configuration name)
", "targets": [(an array of node and edge class label targets)
], "features": [(an array of node feature hints)
] }, { "name" : "(another training data configuration name)
", "targets": [(an array of node and edge class label targets)
], "features": [(an array of node feature hints)
] } ] } } }
Top-level elements in the
neptune_ml
field in additionalParams
The version
element in neptune_ml
Specifies the version of training data configuration to generate.
(Optional), Type: string, Default: "v2.0".
If you do include version
, set it to v2.0
.
The jobs
field in neptune_ml
Contains an array of training-data configuration objects, each of which defines a data processing job, and contains:
-
name
– The name of the training data configuration to be created.For example, a training data configuration with the name "job-number-1" results in a training data configuration file named
job-number-1.json
. -
targets
– A JSON array of node and edge class label targets that represent the machine-learning class labels for training purposes. See The targets field in a neptune_ml object. -
features
– A JSON array of node property features. See The features field in neptune_ml.