Neo4J Cypher query lookup for long but (almost) unique paths

We have a Neo4J database representing an evolutionary process with about 100K nodes and 200K relationships. Nodes are people in generations, and edges are parent-child relationships. The main goal is to be able to take one or the nodes that are of interest in the final generation and explore their evolutionary history (roughly, "how did we get here?").

The “obvious” first query to search for all of its ancestors does not work, because there are too many possible ancestors and paths in this space:

match (a)-[:PARENT_OF*]->(c {is_interesting: true}) return distinct a; 

So, we pre-processed the data so that some edges are marked as “special”, so that almost every node has at most one “special” parent edge, although sometimes both parent edges are marked as “special”, Then I was hoping this query (efficiently) create a (almost) unique path along the "special" edges:

 match (a)-[r:PARENT_OF* {special: true}]->(c {is_interesting: true}) return distinct a; 

This, however, still does not work slowly.

This is disappointing because “as a person” the logic is simple: start with a small number of “interesting” nodes (often 1, no more than a few dozen) and chase almost always unique “special” edges. Assuming a very small number of nodes with two “special” parents, this should be something like O (N), where N is the number of generations ago.

In Neo4J, however, going back 25 steps from a unique “interesting” node, where each step is unique, however, takes 30 seconds, and after there is one bifurcation (where both parents are “special”), it gets much faster than step function. 28 steps (which leads us to the first bifurcation) takes 2 minutes, 30 (where there is still only one bifurcation) takes 6 minutes, and I did not even think to try 100 steps to start the simulation.

Some similar work last year seemed to work better, but we used a lot of border labels (for example, (a)-[:SPECIAL_PARENT_OF*]->(c) , as well as (a)-[:PARENT_OF*]->(c) ) instead of using edge data fields. Is querying the values ​​of relationship fields just not a good idea? We have quite a few different values ​​related to the relationships in this model (some logical, some numerical), and we hoped / suggested that we could use them to effectively limit the search, but maybe it wasn’t.

Suggestions for customizing our model or queries are welcome.

Update I would say this is all with Neo4J 2.1.7. I'm going to give 2.2 a try according to Brian Underwood's suggestion and report back.

+5
source share
2 answers

After learning things with the profiling tools in Neo4J 2.2 (thanks to Brian Underwood for the tip), it’s pretty clear that (at the moment) Neo4J does not pre-filter on the edge properties, which leads to unpleasant combinatorial explosions with long tracks.

For example, the original request:

 match (a)-[r:PARENT_OF* {special: true}]->(c {is_interesting: true}) return distinct a; 

finds all paths from a to c , and then removes those that have edges that are not special . Since there are many millions of paths from a to c , this is completely impossible.

If I instead add the IS_SPECIAL edge wherever there is an edge PARENT_OF with {special: true} , then the queries become very fast, which allows me to drop about 100 generations per second.

This query creates all new edges:

 match (a)-[r:PARENT_OF {special: true}]->(b) create (a)-[:IS_SPECIAL]->(b); 

and takes less than a second to add a 91K relationship to our schedule.

Then

 match (c {is_interesting: true}) with c match (a)-[:IS_SPECIAL*]->(c) return distinct a; 

it takes less than a second to find 112 nodes along a “special” path back from the unique node c target. Matching c first and restricting the set of nodes using with c also important, since Neo4J doesn't seem to pre-filter node properties, and if there are several “interesting” target nodes, it’s much slower.

+2
source

I was lucky to specify a limit on the length of the path. Therefore, if you know that this is not more than 30 jumps, you can try:

 MATCH (c {is_interesting: true}) WITH c MATCH (a)-[:PARENT_OF*1..30]->c RETURN DISTINCT a 

Also, is there an index in the is_interesting property? This can also cause slowness.

What version of Neo4j are you using? If you are using or upgrading to 2.2.0, you can use the new query profiling tools:

http://neo4j.com/docs/2.2.0/how-do-i-profile-a-query.html

Also, if you use them in the web console, you get a beautiful graphical tree (technical term) showing each step.

+2
source

All Articles