

# Processing the graph data exported from Neptune for training
<a name="machine-learning-on-graphs-processing"></a>

The data-processing step takes the Neptune graph data created by the export process and creates the information that is used by the [Deep Graph Library (DGL)](https://www.dgl.ai/) during training. This includes performing various data mappings and transformations:
+ Parsing nodes and edges to construct the graph- and ID-mapping files required by DGL.
+ Converting node and edge properties into the node and edge features required by DGL.
+ Splitting the data into training, validation, and test sets.

## Managing the data-processing step for Neptune ML
<a name="machine-learning-on-graphs-processing-managing"></a>

After you have exported the data from Neptune that you want to use for model training, you can start a data-processing job using a command like the following:

------
#### [ AWS CLI ]

```
aws neptunedata start-ml-data-processing-job \
  --endpoint-url https://your-neptune-endpoint:port \
  --input-data-s3-location "s3://(S3 bucket name)/(path to your input folder)" \
  --id "(a job ID for the new job)" \
  --processed-data-s3-location "s3://(S3 bucket name)/(path to your output folder)" \
  --config-file-name "training-job-configuration.json"
```

For more information, see [start-ml-data-processing-job](https://docs.aws.amazon.com/cli/latest/reference/neptunedata/start-ml-data-processing-job.html) in the AWS CLI Command Reference.

------
#### [ SDK ]

```
import boto3
from botocore.config import Config

client = boto3.client(
    'neptunedata',
    endpoint_url='https://your-neptune-endpoint:port',
    config=Config(read_timeout=None, retries={'total_max_attempts': 1})
)

response = client.start_ml_data_processing_job(
    inputDataS3Location='s3://(S3 bucket name)/(path to your input folder)',
    id='(a job ID for the new job)',
    processedDataS3Location='s3://(S3 bucket name)/(path to your output folder)',
    configFileName='training-job-configuration.json'
)

print(response)
```

------
#### [ awscurl ]

```
awscurl https://your-neptune-endpoint:port/ml/dataprocessing \
  --region us-east-1 \
  --service neptune-db \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
        "inputDataS3Location" : "s3://(S3 bucket name)/(path to your input folder)",
        "id" : "(a job ID for the new job)",
        "processedDataS3Location" : "s3://(S3 bucket name)/(path to your output folder)",
        "configFileName" : "training-job-configuration.json"
      }'
```

**Note**  
This example assumes that your AWS credentials are configured in your environment. Replace *us-east-1* with the Region of your Neptune cluster.

------
#### [ curl ]

```
curl \
  -X POST https://your-neptune-endpoint:port/ml/dataprocessing \
  -H 'Content-Type: application/json' \
  -d '{
        "inputDataS3Location" : "s3://(S3 bucket name)/(path to your input folder)",
        "id" : "(a job ID for the new job)",
        "processedDataS3Location" : "s3://(S3 bucket name)/(path to your output folder)",
        "configFileName" : "training-job-configuration.json"
      }'
```

------

The details of how to use this command are explained in [The dataprocessing command](machine-learning-api-dataprocessing.md), along with information about how to get the status of a running job, how to stop a running job, and how to list all running jobs.

## Processing updated graph data for Neptune ML
<a name="machine-learning-on-graphs-processing-updated"></a>

You can also supply a `previousDataProcessingJobId` to the API to ensure that the new data processing job uses the same processing method as a previous job. This is required when you want to get predictions for updated graph data in Neptune, either by retraining the old model on the new data, or by recomputing the model artifacts on the new data.

You do this by using a command like this:

------
#### [ AWS CLI ]

```
aws neptunedata start-ml-data-processing-job \
  --endpoint-url https://your-neptune-endpoint:port \
  --input-data-s3-location "s3://(Amazon S3 bucket name)/(path to your input folder)" \
  --id "(a job ID for the new job)" \
  --processed-data-s3-location "s3://(Amazon S3 bucket name)/(path to your output folder)" \
  --previous-data-processing-job-id "(the job ID of the previous data-processing job)"
```

For more information, see [start-ml-data-processing-job](https://docs.aws.amazon.com/cli/latest/reference/neptunedata/start-ml-data-processing-job.html) in the AWS CLI Command Reference.

------
#### [ SDK ]

```
import boto3
from botocore.config import Config

client = boto3.client(
    'neptunedata',
    endpoint_url='https://your-neptune-endpoint:port',
    config=Config(read_timeout=None, retries={'total_max_attempts': 1})
)

response = client.start_ml_data_processing_job(
    inputDataS3Location='s3://(Amazon S3 bucket name)/(path to your input folder)',
    id='(a job ID for the new job)',
    processedDataS3Location='s3://(Amazon S3 bucket name)/(path to your output folder)',
    previousDataProcessingJobId='(the job ID of the previous data-processing job)'
)

print(response)
```

------
#### [ awscurl ]

```
awscurl https://your-neptune-endpoint:port/ml/dataprocessing \
  --region us-east-1 \
  --service neptune-db \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
        "inputDataS3Location" : "s3://(Amazon S3 bucket name)/(path to your input folder)",
        "id" : "(a job ID for the new job)",
        "processedDataS3Location" : "s3://(Amazon S3 bucket name)/(path to your output folder)",
        "previousDataProcessingJobId" : "(the job ID of the previous data-processing job)"
      }'
```

**Note**  
This example assumes that your AWS credentials are configured in your environment. Replace *us-east-1* with the Region of your Neptune cluster.

------
#### [ curl ]

```
curl \
  -X POST https://your-neptune-endpoint:port/ml/dataprocessing \
  -H 'Content-Type: application/json' \
  -d '{
        "inputDataS3Location" : "s3://(Amazon S3 bucket name)/(path to your input folder)",
        "id" : "(a job ID for the new job)",
        "processedDataS3Location" : "s3://(Amazon S3 bucket name)/(path to your output folder)",
        "previousDataProcessingJobId" : "(the job ID of the previous data-processing job)"
      }'
```

------

Set the value of the `previousDataProcessingJobId` parameter to the job ID of the previous-data processing job that corresponds to the trained model.

**Note**  
Node deletions in the updated graph are currently not supported. If nodes have been removed in an updated graph, you have to start a completely new data processing job rather than use `previousDataProcessingJobId`.

# Feature encoding in Neptune ML
<a name="machine-learning-feature-encoding"></a>

Property values come in different formats and data types. To achieve good performance in machine learning, it is essential to convert those values to numerical encodings known as *features*.

Neptune ML performs feature extraction and encoding as part of the data-export and data-processing steps, using feature-encoding techniques described here.

**Note**  
If you plan to implement your own feature encoding in a custom model implementation, you can disable the automatic feature encoding in the data preprocessing stage by selecting `none` as the feature encoding type. No feature encoding then occurs on that node or edge property, and instead the raw property values are parsed and saved in a dictionary. Data preprocessing still creates the DGL graph from the exported dataset, but the constructed DGL graph doesn't have the pre-processed features for training.  
You should use this option only if you plan to perform your custom feature encoding as part of custom model training. See [Custom models in Neptune ML](machine-learning-custom-models.md) for details.

## Categorical features in Neptune ML
<a name="machine-learning-categorical-features"></a>

A property that can take one or more distinct values from a fixed list of possible values is a categorical feature. In Neptune ML, categorical features are encoded using [one-hot encoding](https://en.wikipedia.org/wiki/One-hot). The following example shows how the property name of different foods is one-hot encoded according to its category:

```
    Food        Veg.   Meat   Fruit    Encoding
   ---------    ----   ----   -----    --------
    Apple         0      0      1         001
    Chicken       0      1      0         010
    Broccoli      1      0      0         100
```

**Note**  
The maximum number of categories in any categorical feature is 100. If a property has more than 100 categories of value, only the most common 99 of them are placed in distinct categories, and the rest are placed in a special category named `OTHER`.

## Numerical features in Neptune ML
<a name="machine-learning-numerical-features"></a>

Any property whose values are real numbers can be encoded as a numerical feature in Neptune ML. Numerical features are encoded using floating-point numbers.

You can specify a data-normalization method to use when encoding numerical features, like this: `"norm": "normalization technique"`. The following normalization techniques are supported:
+ **"none"**   –   Don't normalize the numerical values during encoding.
+ **"min-max"**   –   Normalize each value by subtracting the minimum value from it and then dividing it by the difference between the maximum value and the minimum.
+ **"standard"**   –   Normalize each value by dividing it by the sum of all the values.

## Bucket-numerical features in Neptune ML
<a name="machine-learning-bucket_numerical-features"></a>

Rather than representing a numerical property using raw numbers, you can condense numerical values into categories. For example, you could divide people's ages into categories such as kids (0-20), young adults (20-40), middle-aged people (40-60) and elders (from 60 on). Using these numerical buckets, you would be transforming a numerical property into a kind of categorical feature.

In Neptune ML, you can cause a numerical property to be encoded as a bucket-numerical feature, you must provide two things:
+ A numerical range in the form, ` "range": [a, b] `, where `a` and `b` are integers.
+ A bucket count, in the form ` "bucket_cnt": c `, where `c` is the number of buckets, also an integer.

Neptune ML then calculates the size of each bucket as ` ( b - a ) / c `, and encodes each numeric value as the number of whatever bucket it falls into. Any value less than `a` is considered to belong in the first bucket, and any value greater than `b` is considered to belong in the last bucket.

You can also, optionally, make numeric values fall into more than one bucket, by specifying a slide-window size, like this: ` "slide_window_size": s `, where `s` is a number. Neptune ML then transforms each numeric value `v` of the property into a range from ` v - s/2 ` through ` v + s/2 `, and assigns the value `v` to every bucket that the range covers.

Finally, you can also optionally provide a way of filling in missing values for numerical features and bucket-numerical features. You do this using ` "imputer": "imputation technique "`, where the imputation technique is one of `"mean"`, `"median"`, or `"most-frequent"`. If you don't specify an imputer, a missing value can cause processing to halt.

## Text feature encoding in Neptune ML
<a name="machine-learning-text-features"></a>

For free-form text, Neptune ML can use several different models to convert the sequence of tokens in a property value string into a fixed-size real-value vector:
+ [`text_fasttext`](#machine-learning-fasttext-features)   –   Uses [fastText](https://fasttext.cc/) encoding. This is the recommended encoding for features that use one and only one of the five languages that fastText supports.
+ [`text_sbert`](#machine-learning-sbert-features)   –   Uses the [Sentence BERT](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models) (SBERT) encoding models. This is the recommended encoding for text that `text_fasttext` does not support.
+ [`text_word2vec`](#machine-learning-word2vec-features)   –   Uses the [Word2Vec](https://wikipedia.org/wiki/Word2vec) algorithms originally published by [Google](https://code.google.com/archive/p/word2vec/) to encode text. Word2Vec only supports English.
+ [`text_tfidf`](#machine-learning-tfidf-features)   –   Uses a [term frequency–inverse document frequency](https://wikipedia.org/wiki/Tf-idf) (TF-IDF) vectorizer for encoding text. TF-IDF encoding supports statistical features that the other encodings do not.

### *fastText* encoding of text property values in Neptune ML
<a name="machine-learning-fasttext-features"></a>

Neptune ML can use the [fastText](https://fasttext.cc/) models to convert text property values into fixed-size real-value vectors. This is the recommended encoding method for text property values in any one of the five languages that fastText supports:
+ `en`   (English)
+ `zh`   (Chinese)
+ `hi`   (Hindi)
+ `es`   (Spanish)
+ `fr`   (French)

Note that fastText cannot handle sentences containing words in more than one language.

The `text_fasttext` method can optionally take `max_length` field that specifies the maximum number of tokens in a text property value that will be encoded, after which the string is truncated. This can improve performance when text property values contain long strings, because if `max_length` is not specified, fastText encodes all the tokens regardless of the string length.

This example specifies that French movie titles are encoded using fastText:

```
{
    "file_name" : "nodes/movie.csv",
    "separator" : ",",
    "node" : ["~id", "movie"],
    "features" : [
      {
        "feature": ["title", "title", "text_fasttext"],
        "language": "fr",
        "max_length": 1024
      }
    ]
  }
```

### Sentence BERT (SBERT) encoding of text features in Neptune ML
<a name="machine-learning-sbert-features"></a>

Neptune ML can convert the sequence of tokens in a string property value into a fixed-size real-value vector using [Sentence BERT](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models) (SBERT) models. Neptune supports two SBERT methods: `text_sbert128`, which is the default if you just specify `text_sbert`, and `text_sbert512`. The difference between the two is the maximum length of a text property value string that is encoded. The `text_sbert128` encoding truncates text strings after encoding 128 tokens, while `text_sbert512` truncates text strings after encoding 512 tokens. As a result, `text_sbert512` requires more processing time than `text_sbert128`. Both methods are slower than `text_fasttext`.

SBERT encoding is multilingual, so there is no need to specify a language for the property value text you are encoding. SBERT supports many languages, and can encode a sentence that contains more than one language. If you are encoding property values containing text in a language or languages that fastText does not support, SBERT is the recommended encoding method.

The following example specifies that movie titles are encoded as SBERT up to a maximum of 128 tokens:

```
{
    "file_name" : "nodes/movie.csv",
    "separator" : ",",
    "node" : ["~id", "movie"],
    "features" : [
      { "feature": ["title", "title", "text_sbert128"] }
    ]
  }
```

### Word2Vec encoding of text features in Neptune ML
<a name="machine-learning-word2vec-features"></a>

Neptune ML can encode string property values as a Word2Vec feature ([Word2Vec algorithms](https://wikipedia.org/wiki/Word2vec) were originally published by [Google](https://code.google.com/archive/p/word2vec/)). The `text_word2vec` method encodes the tokens in a string as a dense vector using one of the [spaCy trained models](https://spacy.io/models). This only supports the English language using the [en\$1core\$1web\$1lg](https://spacy.io/models/en#en_core_web_lg) model).

The following example specifies that movie titles are encoded using Word2Vec:

```
{
    "file_name" : "nodes/movie.csv",
    "separator" : ",",
    "node" : ["~id", "movie"],
    "features" : [
      {
        "feature": ["title", "title", "text_word2vec"],
        "language": "en_core_web_lg"
      }
    ]
  }
```

Note that the language field is optional, since the English `en_core_web_lg` model is the only one that Neptune supports.

### TF-IDF encoding of text features in Neptune ML
<a name="machine-learning-tfidf-features"></a>

Neptune ML can encode text property values as `text_tfidf` features. This encoding converts the sequence of words in the text into a numeric vector using a [term frequency–inverse document frequency](https://wikipedia.org/wiki/Tf-idf) (TF-IDF) vectorizer, followed by a dimensionality-reduction operation.

[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (term frequency – inverse document frequency) is a numerical value intended to measure how important a word is in a document set. It is calculated by dividing the number of times a word appears in a given property value by the total number of such property values that it appears in.

For example, if the word "kiss" appears twice in a given movie title (say, "kiss kiss bang bang"), and "kiss" appears in the title of 4 movies in all, then the TF-IDF value of "kiss" in the "kiss kiss bang bang" title would be ` 2 / 4 `.

The vector that is initially created has ***d*** dimensions, where ***d*** is the number of unique terms in all property values of that type. The dimensionality-reduction operation uses a random sparse projection to reduce that number to a maximum of 100. The vocabulary of a graph is then generated by merging all the `text_tfidf` features in it.

You can control the TF-IDF vectorizer in several ways:
+ **`max_features`**   –   Using the `max_features` parameter, you can limit the number of terms in `text_tfidf` features to the most common ones. For example, if you set `max_features` to 100, only the top 100 most commonly used terms are included. The default value for `max_features` if you don't explicitly set it is 5,000.
+ **`min_df`**   –   Using the `min_df` parameter, you can limit the number of terms in `text_tfidf` features to ones having at least a specified document frequency. For example, if you set `min_df` to 5, only terms that appear in at least 5 different property values are used. The default value for `min_df` if you don't explicitly set it is 2.
+ **`ngram_range`**   –   The `ngram_range` parameter determines what combinations of words are treated as terms. For example, if you set `ngram_range` to `[2, 4]`, the following 6 terms would be found in the "kiss kiss bang bang" title:
  + *2-word terms*:  "kiss kiss", "kiss bang", and "bang bang".
  + *3-word terms*:  "kiss kiss bang" and "kiss bang bang".
  + *4-word terms*:  "kiss kiss bang bang".

  The default setting for `ngram_range` is `[1, 1]`.

## Datetime features in Neptune ML
<a name="machine-learning-datetime-features"></a>

Neptune ML can convert parts of `datetime` property values into categorical features by encoding them as [one-hot arrays](https://en.wikipedia.org/wiki/One-hot). Use the `datetime_parts` parameter to specify one or more of the following parts to encode: `["year", "month", "weekday", "hour"]`. If you don't set `datetime_parts`, by default all four parts are encoded.

For example, if the range of datetime values spans the years 2010 through 2012, the four parts of the datetime entry `2011-04-22 01:16:34` are as follows:
+ **year**   –   `[0, 1, 0]`.

  Since there are only 3 years in the span (2010, 2011, and 2012), the one-hot array has three entries, one for each year.
+ **month**   –   `[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]`.

  Here, the one-hot array has an entry for each month of the year.
+ **weekday**   –   `[0, 0, 0, 0, 1, 0, 0]`.

  The ISO 8601 standard states that Monday is the first day of the week, and since April 22, 2011 was a Friday, the corresponding one-hot weekday array is hot in the fifth position. 
+ **hour**   –   `[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]`.

  The hour 1 AM is set in a 24-member one-hot array.

Day of the month, minute, and second are not encoded categorically.

If the total `datetime` range in question only includes dates within a single year, no `year` array is encoded.

You can specify an imputation strategy to fill in missing `datetime` values, using the `imputer` parameter and one of the strategies available for numerical features.

## Auto feature encoding in Neptune ML
<a name="machine-learning-auto-encoding"></a>

Instead of manually specifying the feature encoding methods to use for the properties in your graph, you can set `auto` as a feature encoding method. Neptune ML then attempts to infer the best feature encoding for each property based on its underlying data type.

Here are some of the heuristics that Neptune ML uses in selecting the appropriate feature encodings:
+ If the property has only numeric values and can be cast into numeric data types, then Neptune ML generally encodes it as a numeric value. However, if the number of unique values for the property is less than 10% of the total number of values and the cardinality of those unique values is less than 100, then Neptune ML uses a categorical encoding.
+ If the property values can be cast to a `datetime` type, then Neptune ML encodes them as a `datetime` feature.
+ If the property values can be coerced to booleans (1/0 or True/False), then Neptune ML uses category encoding.
+ If the property is a string with more than 10% of its values unique, and the average number of tokens per value is greater than or equal to 3, the Neptune ML infers the property type to be text and automatically detects the language being used. If the language detected is one of the ones supported by [fastText](#machine-learning-fasttext-features), namely English, Chinese, Hindi, Spanish and French, then Neptune ML uses `text_fasttext` to encode the text. Othewise, Neptune ML uses [`text_sbert`](#machine-learning-sbert-features).
+ If the property is a string not classified as a text feature then Neptune ML presumes it to be a categorical feature and uses category encoding.
+ If each node has its own unique value for a property that is inferred to be a category feature, Neptune ML drops the property from the training graph because it is probably an ID that would not be informative for learning.
+ If the property is known to contain valid Neptune separators such as semicolons (";"), then Neptune ML can only treat the property as `MultiNumerical` or `MultiCategorical`.
  + Neptune ML first tries to encode the values as numeric features. if this succeeds, Neptune ML uses numerical encoding to create numeric vector features.
  + Otherwise, Neptune ML encodes the values as multi-categorical.
+ If Neptune ML cannot infer the data type of a property's values, Neptune MLdrops the property from the training graph.

# Editing a training data configuration file
<a name="machine-learning-processing-training-config-file"></a>

The Neptune export process exports Neptune ML data from a Neptune DB cluster into an S3 bucket. It exports nodes and edges separately into a `nodes/` and an `edges/` folder. It also creates a JSON training data configuration file, named `training-data-configuration.json` by default. This file contains information about the schema of the graph, the types of its features, feature transformation and normalization operations, and the target feature for a classification or regression task.

There might be cases when you want to modify the configuration file directly. One such case is when you want to change the way features are processed or how the graph is constructed, without needing to rerun the export every time you want to modify the specification for the machine learning task you're solving.

**To edit the training data configuration file**

1. **Download the file to your local machine.**

   Unless you specified one or more named jobs in the `additionalParams/neptune_ml` parameter passed to the export process, the file will have the default name, which is `training-data-configuration.json`. You can use an AWS CLI command like this to download the file:

   ```
   aws s3 cp \
     s3://(your Amazon S3 bucket)/(path to your export folder)/training-data-configuration.json \
     ./
   ```

1. **Edit the file using a text editor.**

1. **Upload the modified file.** Upload the modified file back to the same location in Amazon S3 from which you downloaded it, using use an AWS CLI command like this:

   ```
   aws s3 cp \
     training-data-configuration.json \
     s3://(your Amazon S3 bucket)/(path to your export folder)/training-data-configuration.json
   ```

# Example of a JSON training data configuration file
<a name="machine-learning-processing-training-config-file-example"></a>

Here is a sample training data configuration file that describes a graph for a node-classification task:

```
{
  "version" : "v2.0",
  "query_engine" : "gremlin",
  "graph" : [
    {
      "edges" : [
        {
          "file_name" : "edges/(movie)-included_in-(genre).csv",
          "separator" : ",",
          "source" : ["~from", "movie"],
          "relation" : ["", "included_in"],
          "dest" : [ "~to", "genre" ]
        },
        {
          "file_name" : "edges/(user)-rated-(movie).csv",
          "separator" : ",",
          "source" : ["~from", "movie"],
          "relation" : ["rating", "prefixname"], # [prefixname#value]
          "dest" : ["~to", "genre"],
          "features" : [
            {
              "feature" : ["rating", "rating", "numerical"],
              "norm" : "min-max"
            }
          ]
        }
      ],
      "nodes" : [
        {
          "file_name" : "nodes/genre.csv",
          "separator" : ",",
          "node" : ["~id", "genre"],
          "features" : [
            {
              "feature": ["name", "genre", "category"],
              "separator": ";"
            }
          ]
        },
        {
          "file_name" : "nodes/movie.csv",
          "separator" : ",",
          "node" : ["~id", "movie"],
          "features" : [
            {
              "feature": ["title", "title", "word2vec"],
              "language": ["en_core_web_lg"]
            }
          ]
        },
        {
          "file_name" : "nodes/user.csv",
          "separator" : ",",
          "node" : ["~id", "user"],
          "features" : [
            {
              "feature": ["age", "age", "numerical"],
              "norm" : "min-max",
              "imputation": "median",
            },
            {
              "feature": ["occupation", "occupation", "category"],
            }
          ],
          "labels" : [
            {
              "label": ["gender", "classification"],
              "split_rate" : [0.8, 0.2, 0.0]
            }
          ]
        }
      ]
    },
    "warnings" : [ ]
  ]
}
```

# The structure of JSON training data configuration files
<a name="machine-learning-processing-training-config-file-structure"></a>

The training configuration file refers to CSV files saved by the export process in the `nodes/` and `edges/` folders.

Each file under `nodes/` stores information about nodes that have the same property-graph node label. Each column in a node file stores either the node ID or the node property. The first line of the file contains a header that specifies the `~id` or property name for each column.

Each file under `edges/` stores information about nodes that have the same property-graph edge label. Each column in a node file stores either the source node ID, the destination node ID, or the edge property. The first line of the file contains a header specifying the `~from`, `~to`, or property name for each column.

The training data configuration file has three top-level elements:

```
{
  "version" : "v2.0",
  "query_engine" : "gremlin",
  "graph" : [ ... ]
}
```
+ `version`   –   (String) The version of configuration file being used.
+ `query_engine`   –   (String) The query language used for exporting the graph data. Currently, only "gremlin" is valid.
+ `graph`   –   (JSON array) lists one or more configuration objects that contain model parameters for each of the nodes and edges that will be used.

  The configuration objects in the graph array have the structure described in the next section.

## Contents of a configuration object listed in the `graph` array
<a name="machine-learning-graph-training-config-object"></a>

A configuration object in the `graph` array can contain three top-level nodes:

```
    {
      "edges"    : [ ... ],
      "nodes"    : [ ... ],
      "warnings" : [ ... ],
    }
```
+ `edges`   –   (array of JSON objects) Each JSON object specifies a set of parameters to define how an edge in the graph will be treated during the model processing and training. This is only used with the Gremlin engine.
+ `nodes`   –   (array of JSON objects) Each JSON object specifies a set of parameters to define how a node in the graph will be treated during the model processing and training. This is only used with the Gremlin engine.
+ `warnings`   –   (array of JSON objects) Each object contains a warning generated during the data export process.

## Contents of an edge configuration object listed in an `edges` array
<a name="machine-learning-graph-edges-config"></a>

An edge configuration object listed in an `edges` array can contain the following top-level fields:

```
      {
        "file_name" : "(path to a CSV file)",
        "separator" : "(separator character)",
        "source"    : ["(column label for starting node ID)", "(starting node type)"],
        "relation"  : ["(column label for the relationship name)", "(the prefix name for the relationship name)"],
        "dest"      : ["(column label for ending node ID)", "(ending node type)"],
        "features"  : [(array of feature objects)],
        "labels"    : [(array of label objects)]
      }
```
+ **`file_name`**   –   A string specifying the path to a CSV file that stores information about edges having the same property-graph label.

  The first line of that file contains a header line of column labels.

  The first two column labels are `~from` and `~to`. The first column (the `~from` column) stores the ID of the edge's starting node, and the second (the `~to` column) stores the ID of the edge's ending node.

  The remaining column labels in the header line specify, for each remaining column, the name of the edge property whose values have been exported into that column.
+ **`separator`**   –   A string containing the delimiter that separates columns in that CSV file.
+ **`source`**   –   A JSON array containing two strings that specify the starting node of the edge. The first string contains the header name of the column that the starting node ID is stored in. The second string specifies the node type.
+ **`relation`**   –   A JSON array containing two strings that specify the edge's relation type. The first string contains the header name of the column that the relation name (`relname`) is stored in. The second string contains the prefix for the relation name (`prefixname`).

  The full relation type consists of the two strings combined, with a hyphen character between them, like this: `prefixname-relname`.

  If the first string is empty, all edges have the same relation type, namely the `prefixname` string.
+ **`dest`**   –   A JSON array containing two strings that specify the ending node of the edge. The first string contains the header name of the column that the node ID is stored in. The second string specifies the node type.
+ **`features`**   –   A JSON array of property-value feature objects. Each property-value feature object contains the following fields:
  + **feature**   –   A JSON array of three strings. The first string contains the header name of the column that contains the property value. The second string contains the feature name. The third string contains the feature type.
  + **norm**   –   (*Optional*) Specifies a normalization method to apply to the property values.

    
+ **`labels`**   –   A JSON array of objects. Each of the objects defines a target feature of the edges, and specifies the proportions of the edges that the training and validation stages should take. Each object contains the following fields:
  + **label**   –   A JSON array of two strings. The first string contains the header name of the column that contains the target feature property value. The second string specifies one of the following target task types:
    + `"classification"`   –   An edge classification task. The property values provided in the column identified by the first string in the `label` array are treated as categorical values. For an edge classification task, the first string in the `label` array can't be empty.
    + `"regression"`   –   An edge regression task. The property values provided in the column identified by the first string in the `label` array are treated as numerical values. For an edge regression task, the first string in the `label` array can't be empty.
    + `"link_prediction"`   –   A link prediction task. No property values are required. For a link prediction task, the first string in the `label` array is ignored.
  + **`split_rate`**   –   A JSON array containing three numbers between zero and one that add up to one and that represent an estimate of the proportions of nodes that the training, validation, and test stages will use, respectively. Either this field or the `custom_split_filenames` can be defined, but not both. See [split\$1rate](machine-learning-neptune_ml-targets.md#machine-learning-property-graph-neptune_ml-targets-split_rate).
  + **`custom_split_filenames`**   –   A JSON object that specifies the file names for the files that define the training, validation and test populations. Either this field or `split_rate` can be defined, but not both. See [Custom train-validation-test proportions](#machine-learning-custom-stages-splits) for more information.

## Contents of a node configuration object listed in a `nodes` array
<a name="machine-learning-graph-nodes-config"></a>

A node configuration object listed in a `nodes` array can contain the following fields:

```
      {
        "file_name" : "(path to a CSV file)",
        "separator" : "(separator character)",
        "node"      : ["(column label for the node ID)", "(node type)"],
        "features"  : [(feature array)],
        "labels"    : [(label array)],
      }
```
+ **`file_name`**   –   A string specifying the path to a CSV file that stores information about nodes having the same property-graph label.

  The first line of that file contains a header line of column labels.

  The first column label is `~id`, and the first column (the `~id` column) stores the node ID.

  The remaining column labels in the header line specify, for each remaining column, the name of the node property whose values have been exported into that column.
+ **`separator`**   –   A string containing the delimiter that separates columns in that CSV file.
+ **`node`**   –   A JSON array containing two strings. The first string contains the header name of the column that stores node IDs. The second string specifies the node type in the graph, which corresponds to a property-graph label of the node.
+ **`features`**   –   A JSON array of node feature objects. See [Contents of a feature object listed in a `features` array for a node or edge](#machine-learning-graph-node-features-config).
+ **`labels`**   –   A JSON array of node label objects. See [Contents of a node label object listed in a node `labels` array](#machine-learning-graph-node-labels-config).

## Contents of a feature object listed in a `features` array for a node or edge
<a name="machine-learning-graph-node-features-config"></a>

A node feature object listed in a node `features` array can contain the following top-level fields:
+ **`feature`**   –   A JSON array of three strings. The first string contains the header name of the column that contains the property value for the feature. The second string contains the feature name.

  The third string contains the feature type. Valid feature types are listed in [Possible values of the type field for features](machine-learning-neptune_ml-features.md#machine-learning-neptune_ml-feature-types). 
+ **`norm`**   –   This field is required for numerical features. It specifies a normalization method to use on numeric values. Valid values are `"none"`, `"min-max"`, and "standard". See [The norm field](machine-learning-neptune_ml-features.md#machine-learning-neptune_ml-features-norm) for details.
+ **`language`**   –   The language field specifies the language being used in text property values. Its usage depends on the text encoding method:
  + For [`text_fasttext`](machine-learning-feature-encoding.md#machine-learning-fasttext-features) encoding, this field is required, and must specify one of the following languages:
    + `en`   (English)
    + `zh`   (Chinese)
    + `hi`   (Hindi)
    + `es`   (Spanish)
    + `fr`   (French)

    However, `text_fasttext` cannot handle more than one language at a time.
  + For [`text_sbert`](machine-learning-feature-encoding.md#machine-learning-fasttext-features) encoding, this field is not used, since SBERT encoding is multilingual.
  + For [`text_word2vec`](machine-learning-feature-encoding.md#machine-learning-word2vec-features) encoding, this field is optional, since `text_word2vec` only supports English. If present, it must specify the name of the English language model:

    ```
    "language" : "en_core_web_lg"
    ```
  + For [`tfidf`](machine-learning-feature-encoding.md#machine-learning-tfidf-features) encoding, this field is not used.
+ **`max_length`**   –   This field is optional for [`text_fasttext`](machine-learning-feature-encoding.md#machine-learning-fasttext-features) features, where it specifies the maximum number of tokens in an input text feature that will be encoded. Input text after `max_length` is reached is ignored. For example, setting max\$1length to 128 indicates that any tokens after the 128th in a text sequence are ignored.
+ **`separator`**  –   This field is used optionally with `category`, `numerical` and `auto` features. It specifies a character that can be used to split a property value into multiple categorical values or numerical values.

  See [The separator field](machine-learning-neptune_ml-features.md#machine-learning-neptune_ml-features-separator).
+ **`range`**  –   This field is required for `bucket_numerical` features. It specifies the range of numerical values that are to be divided into buckets.

  See [The range field](machine-learning-neptune_ml-features.md#machine-learning-neptune_ml-features-range).
+ **`bucket_cnt`**  –   This field is required for `bucket_numerical` features. It specifies the number of buckets that the numerical range defined by the `range` parameter should be divided into.

  See [Bucket-numerical features in Neptune ML](machine-learning-feature-encoding.md#machine-learning-bucket_numerical-features).
+ **`slide_window_size`**  –   This field is used optionally with `bucket_numerical` features to assign values to more than one bucket.

  See [The slide\$1window\$1size field](machine-learning-neptune_ml-features.md#machine-learning-neptune_ml-features-slide_window_size).
+ **`imputer`**  –   This field is used optionally with `numerical`, `bucket_numerical`, and `datetime` features to provide an imputation technique for filling in missing values. The supported imputation techniques are `"mean"`, `"median"`, and `"most_frequent"`.

  See [The imputer field](machine-learning-neptune_ml-features.md#machine-learning-neptune_ml-features-imputer).
+ **`max_features`**  –   This field is used optionally by `text_tfidf` features to specify the maximum number of terms to encode.

  See [The max\$1features field](machine-learning-neptune_ml-features.md#machine-learning-neptune_ml-features-max_features).
+ **`min_df`**  –   This field is used optionally by `text_tfidf` features to specify the minimum document frequency of terms to encode

  See [The min\$1df field](machine-learning-neptune_ml-features.md#machine-learning-neptune_ml-features-min_df).
+ **`ngram_range`**  –   This field is used optionally by `text_tfidf` features to specify a range of numbers of words or tokens to considered as potential individual terms to encode

  See [The ngram\$1range field](machine-learning-neptune_ml-features.md#machine-learning-neptune_ml-features-ngram_range).
+ **`datetime_parts`**  –   This field is used optionally by `datetime` features to specify which parts of the datetime value to encode categorically.

  See [The datetime\$1parts field](machine-learning-neptune_ml-features.md#machine-learning-neptune_ml-features-datetime_parts).

## Contents of a node label object listed in a node `labels` array
<a name="machine-learning-graph-node-labels-config"></a>

A label object listed in a node `labels` array defines a node target feature and specifies the proportions of nodes that the training, validation, and test stages will use. Each object can contain the following fields:

```
      {
        "label"      : ["(column label for the target feature property value)", "(task type)"],
        "split_rate" : [(training proportion), (validation proportion), (test proportion)],
        "custom_split_filenames" : {"train": "(training file name)", "valid": "(validation file name)", "test": "(test file name)"},
        "separator"  : "(separator character for node-classification category values)",
      }
```
+ **`label`**   –   A JSON array containing two strings. The first string contains the header name of the column that stores the property values for the feature. The second string specifies the target task type, which can be:
  + `"classification"`   –   A node classification task. The property values in the specified column are used to create a categorical feature.
  + `"regression"`   –   A node regression task. The property values in the specified column are used to create a numerical feature.
+ **`split_rate`**   –   A JSON array containing three numbers between zero and one that add up to one and represent an estimate of the proportions of nodes that the training, validation, and test stages will use, respectively. See [split\$1rate](machine-learning-neptune_ml-targets.md#machine-learning-property-graph-neptune_ml-targets-split_rate).
+ **`custom_split_filenames`**   –   A JSON object that specifies the file names for the files that define the training, validation and test populations. Either this field or `split_rate` can be defined, but not both. See [Custom train-validation-test proportions](#machine-learning-custom-stages-splits) for more information.
+ **`separator`**   –   A string containing the delimiter that separates categorical feature values for a classification task.

**Note**  
If no label object is provided for both edges and nodes, the task is automatically assumed to be link prediction, and edges are randomly split into 90% for training and 10% for validation.

## Custom train-validation-test proportions
<a name="machine-learning-custom-stages-splits"></a>

By default, the `split_rate` parameter is used by Neptune ML to split the graph randomly into training, validation and test populations using the proportions defined in this parameter. To have more precise control over which entities are used in these different populations, files can be created that explicitly define them, and then the [training data configuration file can be edited](machine-learning-processing-training-config-file.md) to map these indexing files to the populations. This mapping is specified by a JSON object for the [`custom_split_filesnames`](#custom_split_filenames) key in the training configuration file. If this option is used, filenames must be provided for the `train` and `validation` keys, and is optional for the `test` key.

The formatting of these files should match the [Gremlin data format](bulk-load-tutorial-format-gremlin.md#bulk-load-tutorial-format-gremlin-systemheaders). Specifically, for node-level tasks, each file should contain a column with the `~id` header that lists the node IDs, and for edge-level tasks, the files should specify `~from` and `~to` to indicate the source and destination nodes of the edges, respectively. These files need to be placed in the same Amazon S3 location as the exported data that is used for data processing (see: [`outputS3Path`](export-parameters.md#export-parameters-outputS3Path)).

For property classification or regression tasks, these files can optionally define the labels for the machine-learning task. In that case the files need to have a property column with the same header name as is [defined in the training data configuration file](#machine-learning-graph-node-labels-config). If property labels are defined in both the exported node and edge files and the custom-split files, priority is given to the custom-split files.