We have a Neo4J database representing an evolutionary process with about 100K nodes and 200K relationships. Nodes are people in generations, and edges are parent-child relationships. The main goal is to be able to take one or the nodes that are of interest in the final generation and explore their evolutionary history (roughly, "how did we get here?").
The “obvious” first query to search for all of its ancestors does not work, because there are too many possible ancestors and paths in this space:
match (a)-[:PARENT_OF*]->(c {is_interesting: true}) return distinct a;
So, we pre-processed the data so that some edges are marked as “special”, so that almost every node has at most one “special” parent edge, although sometimes both parent edges are marked as “special”, Then I was hoping this query (efficiently) create a (almost) unique path along the "special" edges:
match (a)-[r:PARENT_OF* {special: true}]->(c {is_interesting: true}) return distinct a;
This, however, still does not work slowly.
This is disappointing because “as a person” the logic is simple: start with a small number of “interesting” nodes (often 1, no more than a few dozen) and chase almost always unique “special” edges. Assuming a very small number of nodes with two “special” parents, this should be something like O (N), where N is the number of generations ago.
In Neo4J, however, going back 25 steps from a unique “interesting” node, where each step is unique, however, takes 30 seconds, and after there is one bifurcation (where both parents are “special”), it gets much faster than step function. 28 steps (which leads us to the first bifurcation) takes 2 minutes, 30 (where there is still only one bifurcation) takes 6 minutes, and I did not even think to try 100 steps to start the simulation.
Some similar work last year seemed to work better, but we used a lot of border labels (for example, (a)-[:SPECIAL_PARENT_OF*]->(c) , as well as (a)-[:PARENT_OF*]->(c) ) instead of using edge data fields. Is querying the values of relationship fields just not a good idea? We have quite a few different values related to the relationships in this model (some logical, some numerical), and we hoped / suggested that we could use them to effectively limit the search, but maybe it wasn’t.
Suggestions for customizing our model or queries are welcome.
Update I would say this is all with Neo4J 2.1.7. I'm going to give 2.2 a try according to Brian Underwood's suggestion and report back.