High performance hierarchical text search

Now I am at the final stage of modernizing the design of the hierarchy in a large transactional system, and for a while I looked at this 150-line query (which I will spare you all the boredom of reading) and thinking that there should be a better way.

A summary of the question is as follows:

How to implement a hierarchical search that matches several search conditions at different levels of the hierarchy, optimized for quick search?


I found a somewhat related question , but this is really only about 20% of the answer that I really need. Here is the full script / specification:

  • The ultimate goal is to search for one or more arbitrary elements in arbitrary positions in the hierarchy.
  • The complete hierarchy is about 80,000 nodes, which in a few years will grow to 1 million.
  • The full text of the entire path down the hierarchy is unique and descriptive; however, the text of the individual node may not be. This is a business reality, not a decision that was made frivolously.
  • Example: a node may have a name such as “Door”, which is meaningless in itself, but the full context of “Aaron”> “Home”> “Living room”> “Liquor cabinet”> “Door” has a clear meaning, it describes a specific door in a specific place. (Note that this is just an example , the actual design is much less trivial)
  • To find this particular door, the user can enter the “Aaron's lemon door”, which is likely to have only one result. The request is translated as a sequence: an element containing the text "door", below the element containing the text "liquor", under another element containing the text "aaron".
  • Or the user can simply type “home liquor” to list all the cabinets for alcohol in people's homes (that would be nice). I mention this example explicitly to indicate that the search should not match a specific root or leaf level. This user knows exactly what door he is looking for, but cannot remember who owns it, and remembers whether a name has appeared in front of him.
  • All terms must be matched in the indicated sequence, but, as the above examples show, the levels in the hierarchy can be “skipped”. The term "aaron booze cabinet" will not correspond to this node.
  • The platform is SQL Server 2008, but I believe this is a platform-specific issue, and would prefer not to limit the responses to that platform.
  • The hierarchy itself is based on a hierarchyid (materialized path) indexed both in width and in depth. Each node / record hierarchy has a Name column that should be requested. Node-based hierarchy queries are extremely fast, so don't worry about them.
  • There is no strict hierarchy — the root may not contain nodes at all, or it may contain 30 subtrees branching up to 10,000 leaf nodes.
  • The maximum investment is arbitrary, but in practice it tends to be no more than 4-8 levels.
  • The hierarchy can and does change, albeit infrequently. Any node can be moved to any other node with obvious exceptions (a parent cannot be moved to its own child, etc.).
  • If this was not already implied: I have control over the design and you can add indexes, fields, tables, which may be required to get the best results.

My “dream” is to provide instant user feedback, as in a progressive search / filter, but I understand that this can be impossible or extremely difficult. I would be pleased with any significant improvement over the current method, which usually takes from 0.5 to 1 s depending on the number of results.

For completeness, an existing query (stored procedure) begins with the collection of all leaf nodes containing the final term, then joins up and excludes any paths that do not meet the previous conditions. If it seems to someone else, be sure that it is much more effective than starting from the roots and bloating. This was the "old" way and could easily take a few seconds to search.

So my question is again: Is there a better (more efficient) way to do this search?

I'm not necessarily looking for code, just fit. I looked at several possibilities, but all of them have some problems:

  • Create a separator column "text path" and index it using full-text search. The problem is that a search in this column would also return all child nodes; "aaron house" also matches "aaron house kitchen" and "aaron house cellar".
  • Created a NamePath column, which is actually a nested string sequence using a CLR type similar to hierarchyid itself. The problem is that I have no idea how Microsoft can "translate" queries of this type to indexing operations, and I'm not even sure that this can be done on UDT. If the net result is just a full index scan, I won nothing with this approach.

It is not the end of the world if I cannot do better than what I already have; Searching "pretty fast," no one complained about it. But I bet someone solved this problem earlier and has some ideas. Please share them!

+7
performance database search database-design hierarchy
source share
2 answers

take a look at Apache Lucene. You can implement very flexible but effective search queries using Lucene. This may be helpful.

Also take a look at the search patterns. What you describe can fit into a faceted search pattern.

It is very simple to implement the trivial Aaron House Live Door algorithm, but I’m not sure that conventional SVM / classification / entropy algorithms will scale to a large data set. You can also take a look at the Motwani and Raghavan “approximate search” concepts.

Please write what you find, if possible :-)

+3
source share

Hi Aarron, I have the following idea:
From your description, I have the following image:

  Aaron / \ / \ / \ House Cars | / \ Living Room Ferrari Mercedes | Liquor Cabinet / | \ Table Door Window 

Here's what your search tree looks like. Now I would sort the nodes at each level:

  Aaron / \ / \ / \ Cars House / \ / / \ / / \ / / \ / / X / / \ / / \ / / \ / / \ | / \ | / \ Ferrari Living Room Mercedes | Liquor Cabinet / | \ Door Table Window 

Now to process the request should be easy and fast:

  • Start with the last word in the query and the lowest level node (sheets)
  • Since all nodes are sorted at the same level, you can use binary search and therefore find a match in O (log N), where N is the node counter.
  • Do it for each level. There are O (log N) levels in the tree.
  • Once you find a match, process all the parent nodes to see if the path matches your query. The path has a length of O (log N). If it matches, save it in the results that should be shown to the user.

Let M be the number of common matches (the number of nodes corresponding to the last word in the query). Then our processing time: O ((log N) ^ 2 + M * (log N)):
Binary search takes O (log N) times per level and O (log N) levels, so we should spend at least O ((log N) ^ 2) time. Now for each match, we need to check whether the full path from our node match to the root matches the full query. The path has a length of O (log N). Thus, given that M is the same as a whole, we spend different time M * O (log N), so the result of the execution is O ((log N) ^ 2 + M * (log N)).

When you have few matches, the processing time approaches O ((log N) ^ 2), which is pretty good. On the contrary, if the worst case occurs (each single path corresponds to a query (M = N)), the processing time approaches O (N log N), which is not too good, but also not too likely.

Implementation: You said that you only need an idea. Further, my knowledge of databases is very limited, so I won’t write much here, just sketch some ideas.
The node table might look like this:

  • ID: int
  • Text: string
  • Parent: int -> node ID
  • Level: int // I do not expect this to change too often, so you can save it and update as the database changes.

This table should be sorted by the "Text" column. Using the algorithm described above, the sql query inside the loop may look like this:
SELECT ID FROM node WHERE Level = $i AND Text LIKE $text
Hope I can understand.

You can accelerate even more by not only sorting the table by the "Text" column, but also the combined "Text" and "Level" columns, that is, all the entries in the "Sort Level" = "20" section, all records within the Level = 19 sorted etc. (but general sorting across the entire table is not required). However, the node PER LEVEL counter is in O (N), so there is no asymptotic improvement in runtime, but I think it's worth a try, given the lower constants that you get in reality.

Edit: Improvement

I just noticed that an iterative algorithm is completely unnecessary (while level information can be left). This is quite enough:

  • Save all nodes sorted by their text value
  • Find all matches for the last query word at once using a binary search on all nodes.
  • In each match, trace the path to the root and check if the path matches the entire query.

This improves the runtime to O (log N + M * (log N)).

+3
source share

All Articles