Neptune Graph Data Model - Amazon Neptune

Neptune Graph Data Model

The basic unit of Amazon Neptune graph data is a four-position (quad) element, which is similar to a Resource Description Framework (RDF) quad. The following are the four positions of a Neptune quad:

  • subject    (S)

  • predicate  (P)

  • object     (O)

  • graph      (G)

Each quad is a statement that makes an assertion about one or more resources. A statement can assert the existence of a relationship between two resources, or it can attach a property (key-value pair) to a resource. You can think of the quad predicate value generally as the verb of the statement. It describes the type of relationship or property that's being defined. The object is the target of the relationship, or the value of the property. The following are examples:

  • A relationship between two vertices can be represented by storing the source vertex identifier in the S position, the target vertex identifier in the O position, and the edge label in the P position.

  • A property can be represented by storing the element identifier in the S position, the property key in the P position, and the property value in the O position.

The graph position G is used differently in the different stacks. For RDF data in Neptune, the G position contains a named graph identifier. For property graphs in Gremlin, it is used to store the edge ID value in the case of an edge. In all other cases, it defaults to a fixed value.

A set of quad statements with shared resource identifiers creates a graph.

Dictionary of user-facing values

Neptune does not store most user-facing values directly in the various indexes it maintains. Instead, it stores them separately in a dictionary and replaces them in the indexes with 8-byte identifiers.

  • All user-facing values that would go in S, P, or G indexes are stored in the dictionary in this way.

  • In the O index, numeric values are stored directly in the index (inlined). This includes date and datetime values (represented as milliseconds from the epoch).

  • All other user-facing values that would go in the O index are stored in the dictionary and represented in the index by IDs.

The dictionary contains a forward mapping of user-facing values to 8-byte IDs in a value_to_id index.

It stores the reverse mapping of 8-byte IDs to values in one of two indexes, depending on the size of the values:

  • An id_to_value index maps IDs to user-facing values that are smaller than 767 bytes after internal encoding.

  • An id_to_blob index maps IDs to user-facing values that are larger.

How Statements Are Indexed in Neptune

When you query a graph of quads, for each quad position, you can either specify a value constraint, or not. The query returns all the quads that match the value constraints that you specified.

Neptune uses indexes to resolve queries. In the 2005 paper, Optimized Index Structures for Querying RDF from the Web, Andreas Harth and Stefan Decker observed that there are 16 (24) possible access patterns for the four quad positions. You can query all 16 patterns efficiently without having to scan and filter by using six quad statement indexes. Each quad statement index uses a key that is composed of the four position values concatenated in a different order.

Access Pattern Index key order ---------------------------------------------------- --------------- 1. ???? (No constraints; returns every quad) SPOG 2. SPOG (Every position is constrained) SPOG 3. SPO? (S, P, and O are constrained; G is not) SPOG 4. SP?? (S and P are constrained; O and G are not) SPOG 5. S??? (S is constrained; P, O, and G are not) SPOG 6. S??G (S and G are constrained; P and O are not) SPOG 7. ?POG (P, O, and G are constrained; S is not) POGS 8. ?PO? (P and O are constrained; S and G are not) POGS 9. ?P?? (P is constrained; S, O, and G are not) POGS 10. ?P?G (P and G are constrained; S and O are not) GPSO 11. SP?G (S, P, and G are constrained; O is not) GPSO 12. ???G (G is constrained; S, P, and O are not) GPSO 13. S?OG (S, O, and G are constrained; P is not) OGSP 14. ??OG (O and G are constrained; S and P are not) OGSP 15. ??O? (O is constrained; S, P, and G are not) OGSP 16. S?O? (S and O are constrained; P and G are not) OSGP

Neptune creates and maintains only three out of those six indexes by default:

  • SPOG –   Uses a key composed of Subject + Predicate + Object + Graph.

  • POGS –   Uses a key composed of Predicate + Object + Graph + Subject.

  • GPSO –   Uses a key composed of Graph + Predicate + Subject + Object.

These three indexes handle many of the most common access patterns. Maintaining only three full statement indexes instead of six greatly reduces the resources that you need to support rapid access without scanning and filtering. For example, the SPOG index allows efficient lookup whenever a prefix of the positions, such as the vertex or vertex and property identifier, is bound. The POGS index allows efficient access when only the edge or property label stored in P position is bound.

The low-level API for finding statements takes a statement pattern in which some positions are known and the rest are left for discovery by index search. By composing the known positions into a key prefix according to the index key order for one of the statement indexes, Neptune performs a range scan to retrieve all the statements matching the known positions.

However, one of the statement indexes that Neptune does not create by default is a reverse traversal OSGP index, which can gather predicates across objects and subjects. Instead, Neptune by default tracks distinct predicates in a separate index that it uses to do a union scan of {all P x POGS}. When you are working with Gremlin, a predicate corresponds to a property or an edge label.

If the number of distinct predicates in a graph becomes large, the default Neptune access strategy can become inefficient. In Gremlin, for example, an in() step where no edge labels are given, or any step that uses in() internally such as both() or drop(), may become quite inefficient.

Enabling OSGP Index Creation Using Lab Mode

If your data model creates a large number of distinct predicates, you may experience reduced performance and higher operational costs that can be dramatically improved by using Lab Mode to enable the OSGP index in addition to the three indexes that Neptune maintains by default.

Note

This feature is available starting in Neptune engine release 1.0.1.0.200463.0.

Enabling the OSGP index can have a few down-sides:

  • The insert rate may slow by up to 23%.

  • Storage increases by up to 20%.

  • Read queries that touch all indexes equally (which is quite rare) may have increased latencies.

In general, however, it is worth enabling the OSGP index for DB Clusters with a large number of distinct predicates. Object-based searches become highly efficient (for example, finding all incoming edges to a vertex, or all subjects connected to a given object), and as a result dropping vertices becomes much more efficient too.

Important

You can only enable the OSGP index in an empty DB cluster, before you load any data into it.

 

Gremlin statements in the Neptune data model

Gremlin property-graph data is expressed in the SPOG model using three classes of statements, namely:

For an explanation of how these are used in Gremlin queries, see Understanding how Gremlin queries work in Neptune.