PageRank centrality algorithm - Neptune Analytics

PageRank centrality algorithm

PageRank is an algorithm originally developed by Larry Page and Sergey Brin, co-founders of Google. It was originally developed to rank web pages in search engine results. The PageRank score for a given node is calculated based on the number and quality of the edges pointing to that node, as well as the importance of the nodes that are connected to it. The PageRank algorithm assigns a higher score to nodes that are linked to other high-scoring nodes, and a lower score to nodes that are linked to low-scoring nodes.

The output of PageRank can be visualized as a ranking metric for the importance of a node within a given graph, with the most important nodes having the highest score, and the least important node having the lowest score. PageRank is used in search engines to rank web pages based on their importance and influence, in citation networks to identify highly cited scientific papers, and in recommendation systems to suggest popular and relevant content to users.

The space complexity of the .pageRank algorithm is:

O(|V|) + [storage of data O(|V|+|E|)]

In more detail, this is:

|V| * (3 * sizeof(local id type) + 2 * sizeof(double) + sizeof(global id type)) + O(|1DData Parts|)

.pageRank  syntax

CALL neptune.algo.pageRank( [list of source nodes (required)], { numOfIterations: a small positive integer like 20 (optional), dampingFactor: a positive float less than 1.0, like 0.85 (optional) edgeLabels: [a list of edge labels for filtering (optional)], vertexLabel: a node label for filtering (optional), concurrency: number of threads to use (optional) } ) YIELD node, rank RETURN node, rank

.pageRank  inputs

  • a source vertex list   (required)   –   type: Node[] or NodeId[];   default: none.

    The node or nodes to use as the starting locations for the algorithm. Each node in the list triggers an execution of the algorithm. If an empty list is provided, the query result is also empty.

    If the algorithm is called following a MATCH clause (query algo integration), the source query list is the result returned by the MATCH clause.

  • a configuration object that contains:
    • edgeLabels   (optional)   –   type: a list of edge label strings;   example: ["route", ...];   default: no edge filtering.

      To filter on one more edge labels, provide a list of the ones to filter on. If no edgeLabels field is provided then all edge labels are processed during traversal.

    • vertexLabel (optional)   –   type: string;   default: none.

      A vertex label for vertex filtering. If a vertex label is provided, vertices matching the label are the only vertices that are included, including vertices in the input list.

    • traversalDirection (optional)   –   type: string;   default: "outbound".

      The direction of edge to follow. Must be one of: "oubound" or "inbound".

    • numOfIterations (optional)   –   type: a positive integer greater than zero;   default: 20.

      The number of iterations to perform to reach convergence. A number between 10 and 20 is recommended.

    • dampingFactor (optional)   –   type: a positive floating-point number less than 1.0;   default: 0.85.

      A positive floating-point damping factor between 0.0 and 1.0 that expresses the probability, at any step, that the node will continue.

    • concurrency   (optional)   –   type: 0 or 1;   default: 0.

      Controls the number of concurrent threads used to run the algorithm.

      If set to 0, uses all available threads to complete execution of the individual algorithm invocation. If set to 1, uses a single thread. This can be useful when requiring the invocation of many algorithms concurrently.

Outputs for the .pageRank algorithm

  • node   –   A key column of the input nodes.

  • rank   –   A key column of the corresponding page-rank scores for those nodes.

If the input nodes list is empty, the output is empty.

Query examples for .pageRank

This is a standalone example, where the source vertex list is explicitly specified in the query.

CALL neptune.algo.pageRank( ["101"], { numOfIterations: 1, dampingFactor: 0.85, edgeLabels: ["route"] } )

This is a query integration examples, where .pageRank follows a MATCH clause and uses frontier injection to take the output of the MATCH clause as its list of source nodes:

MATCH (n) CALL neptune.algo.pageRank( n, { dampingFactor: 0.85, numOfIterations: 1, edgeLabels: ["route"] } ) YIELD rank RETURN n, rank

This query is an example of constraining the results of .pageRank based on the PageRank values, and returning them in ascending order:

MATCH (n) CALL neptune.algo.pageRank( n, { numOfIterations: 10, dampingFactor: 0.85, vertexLabel: "airport", edgeLabels: ["route"] } ) YIELD rank WHERE rank > 0.004 RETURN n, rank ORDER BY rank

Sample   .pageRank   output

Here is an example of the output returned by .pageRank when run against the sample air-routes dataset [nodes], and sample air-routes dataset [edges], when using the following query:

aws neptune-graph execute-query \ --graph-identifier ${graphIdentifier} \ --query-string "CALL neptune.algo.pageRank(n) YIELD node, rank RETURN node, rank LIMIT" \ --language open_cypher \ /tmp/out.txt cat /tmp/out.txt { "results": [ { "node": { "~id": "2709", "~entityType": "node", "~labels": ["airport"], "~properties": { "lat": 65.4809036254883, "elev": 49, "longest": 8711, "city": "Nadym", "type": "airport", "region": "RU-YAN", "desc": "Nadym Airport", "code": "NYM", "lon": 72.6988983154297, "country": "RU", "icao": "USMM", "runways": 1 } }, "rank": 0.00016044313088059425 }, { "node": { "~id": "3747", "~entityType": "node", "~labels": ["continent"], "~properties": { "code": "AN", "type": "continent", "desc": "Antarctica" } }, "rank": 0.0000404242 } ] }