openCypher query best practices - Neptune Analytics

openCypher query best practices

Use the SET clause to remove multiple properties at once

When using the openCypher language, REMOVE is used to remove properties from an entity. In Neptune Analytics, each property being removed requires a separate operation, adding query latency. You can instead use SET with a map to set all property values to null, which in Neptune Analytics is equivalent to removing properties. Neptune Analytics will have increased performance when multiple properties on a single entity are required to be removed.

Use:

WITH {prop1: null, prop2: null, prop3: null} as propertiesToRemove MATCH (n) SET n += propertiesToRemove

Instead of:

MATCH (n) REMOVE n.prop1, n.prop2, n.prop3

Use parameterized queries

It is recommended to always use parameterized queries when querying using openCypher. The query engine can leverage repeated parameterized queries for features like query plan cache, where repeated invocation of the same parameterized structure with different parameters can leverage the cached plans. The query plan generated for parameterized queries is cached and reused only when it completes within 100ms and the parameter types are either NUMBER, BOOLEAN or STRING.

Use:

MATCH (n:foo) WHERE id(n) = $id RETURN n

With parameters:

parameters={"id": "first"} parameters={"id": "second"} parameters={"id": "third"}

Instead of:

MATCH (n:foo) WHERE id(n) = "first" RETURN n MATCH (n:foo) WHERE id(n) = "second" RETURN n MATCH (n:foo) WHERE id(n) = "third" RETURN n

You can determine if the query is using a cached plan by observing the plan cache hits: value in the output of the openCypher explain endpoint.

Use flattened maps instead of nested maps in UNWIND clause

Deep nested structure can restrict the ability of the query engine to generate an optimal query plan. To partially alleviate this issue, the following defined patterns will create optimal plans for the following scenarios:

  • Scenario 1: UNWIND with a list of cypher literals, which includes NUMBER, STRING and BOOLEAN.

  • Scenario 2: UNWIND with a list of flattened maps, which includes only cypher literals (NUMBER, STRING, BOOLEAN) as values.

When writing a query containing UNWIND clause, use the above recommendation to improve performance.

Scenario 1 example:

UNWIND $ids as x MATCH(t:ticket {`~id`: x})

With parameters:

parameters={ "ids": [1, 2, 3] }

An example for Scenario 2 is to generate a list of nodes to CREATE or MERGE. Instead of issuing multiple statements, use the following pattern to define the properties as a set of flattened maps:

UNWIND $props as p CREATE(t:ticket {title: p.title, severity:p.severity})

With parameters:

parameters={ "props": [ {"title": "food poisoning", "severity": "2"}, {"title": "Simone is in office", "severity": "3"} ] }

Instead of nested node objects like:

UNWIND $nodes as n CREATE(t:ticket n.properties)

With parameters:

parameters={ "nodes": [ {"id": "ticket1", "properties": {"title": "food poisoning", "severity": "2"}}, {"id": "ticket2", "properties": {"title": "Simone is in office", "severity": "3"}} ] }

Place more restrictive nodes on the left side in Variable-Length Path (VLP) expressions

In Variable-Length Path (VLP) queries, the query engine optimizes the evaluation by choosing to start the traversal on the left or right side of the expression. The decision is based on the cardinality of the patterns on the left and right side. Cardinality is the number of nodes matching the specified pattern.

  • If the right pattern has a cardinality of one, then the right side will be the starting point.

  • If the left and the right side have cardinality of one, the expansion is checked on both sides and starts on the side with the smaller expansion. Expansion is the number of outgoing or incoming edges for the node on the left and the node on the right side of the VLP expression. This part of the optimization is only used if the VLP relationship is unidirectional and the relationship type is provided.

  • Otherwise, the left side will be the starting point.

For a chain of VLP expressions, this optimization can only be applied to the first expression. The other VLPs are evaluated starting with the left side. As an example, let the cardinality of (a), (b) be one, and the cardinality of (c) be greater than one.

  • (a)-[*1..]->(c): Evaluation starts with (a).

  • (c)-[*1..]->(a): Evaluation starts with (a).

  • (a)-[*1..]-(c): Evaluation starts with (a).

  • (c)-[*1..]-(a): Evaluation starts with (a).

Now let the incoming edges of (a) be two, and the outgoing edges of (a) be three, the incoming edges of (b) be four, and the outgoing edges of (b) be five.

  • (a)-[*1..]->(b): Evaluation starts with (a) as the outgoing edges of (a) are less than the incoming edges of (b).

  • (a)<-[*1..]-(b): Evaluation starts with (a) as the incoming edges of (a) are less than the outgoing edges of (b).

As a general rule, place the more restrictive pattern on the left side of a VLP expression.

Avoid redundant node label checks by using granular relationship names

When optimizing for performance, using relationship labels that are exclusive to node patterns allows the removal of label filtering on nodes. Consider a graph model where the relationship likes is only used to define a relationship between two person nodes. We could write the following query to find this pattern:

MATCH (n:person)-[:likes]->(m:person) RETURN n, m

The person label check on n and m is redundant, as we defined the relationship to only appear when both are of the type person. To optimize on performance, we can write the query as follows:

MATCH (n)-[:likes]->(m) RETURN n, m

This pattern can also apply when properties are exclusive to a single node label. Assume that only person nodes have the property email, therefore verifying the node label matches person is redundant. Writing this query as:

MATCH (n:person) WHERE n.email = 'xxx@gmail.com' RETURN n

Is less efficient than writing this query as:

MATCH (n) WHERE n.email = 'xxx@gmail.com' RETURN n

You should only adopt this pattern when performance is important and you have checks in your modeling process to ensure these edge labels are not reused for patterns involving other node labels. If you later introduce an email property on another node label such as company, then the results will differ between these two versions of the query.

Specify edge labels where possible

It is recommended to provide an edge label where possible when specifying an edge in a pattern. Consider the following example query, which is used to link all of the people living in a city with all of the people who visited that city.

MATCH (person)-->(city {country: "US"})-->(anotherPerson) RETURN person, anotherPerson

If your graph model links people to nodes other than just cities using multiple edge labels, by not specifying the end label, Neptune will need to evaluate additional paths that will later be discarded. In the above query, as an edge label was not given, the engine does more work first and then filters out values to obtain the correct result. A better version of above query might be:

MATCH (person)-[:livesIn]->(city {country: "US"})-[:visitedBy]->(anotherPerson) RETURN person, anotherPerson

This not only helps in evaluation, but enables the query planner to create better plans. You could even combine this best practice with redundant node label checks to remove the city label check and write the query as:

MATCH (person)-[:livesIn]->({country: "US"})-[:visitedBy]->(anotherPerson) RETURN person, anotherPerson

Avoid using the WITH clause when possible

The WITH clause in openCypher acts as a boundary where everything before it executes, and then the resulting values are passed to the remaining portions of the query. The WITH clause is needed when you require interim aggregation or want to limit the number of results, but aside from that you should try to avoid using the WITH clause. The general guidance is to remove these simple WITH clauses (without aggregation, order by or limit) to enable the query planner to work on the entire query to create a globally optimal plan. As an example, assume you wrote a query to return all people living in India:

MATCH (person)-[:lives_in]->(city) WITH person, city MATCH (city)-[:part_of]->(country {name: 'India'}) RETURN collect(person) AS result

In the above version, the WITH clause restricts the placement of the pattern (city)-[:part_of]->(country {name: 'India'}) (which is more restrictive) before (person)-[:lives_in]->(city). This makes the plan sub-optimal. An optimization on this query would be to remove the WITH clause and let the planner compute the best plan.

MATCH (person)-[:lives_in]->(city) MATCH (city)-[:part_of]->(country {name: 'India'}) RETURN collect(person) AS result

Place restrictive filters as early in the query as possible

In all scenarios, early placement of filters in the query helps in reducing the intermediate solutions a query plan must consider. This means less memory and fewer compute resources are needed to execute the query.

The following example helps you understand these impacts. Suppose you write a query to return all of the people who live in India. One version of the query could be:

MATCH (n)-[:lives_in]->(city)-[:part_of]->(country) WITH country, collect(n.firstName + " " + n.lastName) AS result WHERE country.name = 'India' RETURN result

The above version of the query is not the most optimal way to achieve this use case. The filter country.name = 'India' appears later in the query pattern. It will first collect all persons and where they live, and group them by country, then filter for only the group for country.name = India. The optimal way to query for only people living in India and then perform the collect aggregation.

MATCH (n)-[:lives_in]->(city)-[:part_of]->(country) WHERE country.name = 'India' RETURN collect(n.firstName + " " + n.lastName) AS result

A general rule is to place a filter as soon as possible after the variable is introduced.

Explicitly check whether properties exist

Based on openCypher semantics, when a property is accessed it is equivalent to an optional join and must retain all rows even if the property does not exist. If you know based on your graph schema that a particular property will always exist for that entity, explicitly checking that property for existence allows the query engine to create optimal plans and improve performance.

Consider a graph model where nodes of type person always have a property name. Instead of doing this:

MATCH (n:person) RETURN n.name

Explicitly verify the property existence in the query with an IS NOT NULL check:

MATCH (n:person) WHERE n.name IS NOT NULL RETURN n.name

Do not use named path (unless it is required)

Named path in a query always comes at an additional cost, which can add penalties in terms of higher latency and memory usage. Consider the following query:

MATCH p = (n)-[:commentedOn]->(m) WITH p, m, n, n.score + m.score as total WHERE total > 100 MATCH (m)-[:commentedON]->(o) WITH p, m, n, distinct(o) as o1 RETURN p, m.name, n.name, o1.name

In the above query, assuming we only want to know the properties of the nodes, the use of path ā€œpā€ is unnecessary. By specifying the named path as a variable, the aggregation operation using DISTINCT will get expensive both in terms of time and memory usage. A more optimized version of above query could be:

MATCH (n)-[:commentedOn]->(m) WITH m, n, n.score + m.score as total WHERE total > 100 MATCH (m)-[:commentedON]->(o) WITH m, n, distinct(o) as o1 RETURN m.name, n.name, o1.name

Avoid COLLECT(DISTINCT())

COLLECT(DISTINCT()) is used whenever a list is to be formed containing distinct values. COLLECT is an aggregation function, and grouping is done based on additional keys being projected in the same statement. When distinct is used, the input is split in multiple chunks where each chunk denotes one group for reduction. Performance will be impacted as the number of groups increases. In Neptune Analytics, it is much more efficient to perform DISTINCT before actually collecting/forming the list. This allows grouping to be done directly on the grouping keys for the whole chunk.

Consider the following query:

MATCH (n:Person)-[:commented_on]->(p:Post) WITH n, collect(distinct(p.post_id)) as post_list RETURN n, post_list

A more optimal way of writing this query is:

MATCH (n:Person)-[:commented_on]->(p:Post) WITH DISTINCT n, p.post_id as postId WITH n, collect(postId) as post_list RETURN n, post_list

Prefer the properties function over individual property lookup when retrieving all property values

The properties() function is used to return a map containing all properties for an entity, and is much more efficient than returning properties individually.

Assuming your Person nodes contain 5 properties, firstName, lastName, age, dept, and company, the following query would be preferred:

MATCH (n:Person) WHERE n.dept = 'AWS' RETURN properties(n) as personDetails

Rather than using:

MATCH (n:Person) WHERE n.dept = 'AWS' RETURN n.firstName, n.lastName, n.age, n.dept, n.company === OR === MATCH (n:Person) WHERE n.dept = 'AWS' RETURN {firstName: n.firstName, lastName: n.lastName, age: n.age, department: n.dept, company: n.company} as personDetails

Perform static computations outside of the query

It is recommended to resolve static computations (simple mathematical/string operations) on the client-side. Consider this example where you want to find all people one year older or less than the author:

MATCH (m:Message)-[:HAS_CREATOR]->(p:person) WHERE p.age <= ($age + 1) RETURN m

Here, $age is injected into the query via parameters, and is then added to a fixed value. This value is then compared with p.age. Instead, a better approach would be doing the addition on the client-side and passing the calculated value as a parameter $ageplusone. This helps the query engine to create optimized plans, and avoids static computation for each incoming row. Following these guidelines, a more efficient verson of the query would be:

MATCH (m:Message)-[:HAS_CREATOR]->(p:person) WHERE p.age <= $ageplusone RETURN m

Batch inputs using UNWIND instead of individual statements

Whenever the same query needs to be executed for different inputs, instead of executing one query per input, it would be much more performant to run a query for a batch of inputs.

If you want to merge on a set of nodes, one option is to run a merge query per input:

MERGE (n:Person {`~id`: $id}) SET n.name = $name, n.age = $age, n.employer = $employer

With parameters:

params = {id: '1', name: 'john', age: 25, employer: 'Amazon'}

The above query needs to be executed for every input. While this approach works, it may require many queries to be executed for a large set of input. In this scenario, batching may help reduce the number of queries executed on the server, as well as improve the overall throughput.

Use the following pattern:

UNWIND $persons as person MERGE (n:Person {`~id`: person.id}) SET n += person

With parameters:

params = {persons: [{id: '1', name: 'john', age: 25, employer: 'Amazon'}, {id: '2', name: 'jack', age: 28, employer: 'Amazon'}, {id: '3', name: 'alice', age: 24, employer: 'Amazon'}...]}

Experimentation with different batch sizes is recommended to determine what works best for your workload.

Prefer using custom IDs for node

Neptune Analytics allows users to explicitly assign IDs on nodes. The ID must be globally unique in the dataset and deterministic to be useful. A deterministic ID can be used as a lookup or a filtering mechanism just like properties; however, using an ID is much more optimized from query execution perspective than using properties. There are several benefits to using custom IDs -

  • Properties can be null for an existing entity, but the ID must exist. This allows the query engine to use an optimized join during execution.

  • When concurrent mutation queries are executed, the chances of concurrent modification exceptions (CMEs) are reduced significantly when IDs are used to access nodes because fewer locks are taking on IDs than properties due to their enforced uniqueness.

  • Using IDs avoids the chance of creating duplicate data as Neptune enforces uniqueness on IDs, unlike properties.

The following query example uses a custom ID:

Note

The property ~id is used to specify the ID, whereas id is just stored as any other property.

CREATE (n:Person {`~id`: '1', name: 'alice'})

Without using a custom ID:

CREATE (n:Person {id: '1', name: 'alice'})

If using the latter mechanism, there is no uniqueness enforcement and you could later execute the query:

CREATE (n:Person {id: '1', name: 'john'})

This creates a second node with id=1 named john. In this scenario, you would now have two nodes with id=1, each having a different name - (alice and john).

Avoid doing ~id computations in the query

When using custom IDs in the queries, always perform static computations outside the queries and provide these values in the parameters. When static values are provided, the engine is better able to optimize lookups and avoid scanning and filtering these values.

If you want to create edges between nodes that are existing in the database, one option could be:

UNWIND $sections as section MATCH (s:Section {`~id`: 'Sec-' + section.id}) MERGE (s)-[:IS_PART_OF]->(g:Group {`~id`: 'g1'})

With parameters:

parameters={sections: [{id: '1'}, {id: '2'}]}

In the query above, the id of the section is being computed in the query. Since the computation is dynamic, the engine cannot statically inline ids and ends up scanning all section nodes. The engine then performs post-filtering for required nodes. This can be costly if there are many section nodes in the database.

A better way to achieve this is to have Sec- prepended in the ids being passed into the database:

UNWIND $sections as section MATCH (s:Section {`~id`: section.id}) MERGE (s)-[:IS_PART_OF]->(g:Group {`~id`: 'g1'})

With parameters:

parameters={sections: [{id: 'Sec-1'}, {id: 'Sec-2'}]}