Search and ranking of short phrases (e.g. movie titles)

Question

Search and ranking of short phrases (e.g. movie titles)

I'm trying to improve our search capabilities for short phrases (in our case, movie titles), and I'm currently looking at SQL Server 2008 full text search, which provides some of the features we would like:

The phrase (for example, "saw" also means "see", "see", etc.).
Synonyms (for example, "6" are synonyms for "VI")

However, the ranking algorithm seems problematic using FREETEXTTABLE with a search term and extracting the RANK field. For example, when the user enters a "saw", then the results that we get without a directory:

 RANK | Title --------------------------------------------------------------------- 180 | The Exorcist: The version you've never seen 180 | Saw IV 180 | Saw V 180 | Anybody Here Seen Jeannie? 180 | Seeing Red

All of them have the same rank, although it would be clear to a person that the second and third entries better match other terms.

Similarly, entering “moon” gives the following results:

 RANK | Title --------------------------------------------------------------------- 144 | Pink Floyd - The Dark Side of the Moon 144 | Fly Me To The Moon 3D 144 | Twilight: New Moon 144 | Moon

And here, although there are no matching matches, it would be clear to a person that the best match for the “moon” is “Moon”, and not longer captions that contain it only as part of the name, but the FTS rates them the same way.

I suppose this is probably due to the way SQL Server evaluates results that process words and synonyms with equal weight to the original term and take into account the word density for ranking, which would be nice with long passages of text, but not really applied with short phrases like these. Therefore, I start with the fact that, unfortunately, the FTS is not suitable for this work.

I really don't want to reinvent the wheel, so are there any existing search solutions that will work for the titles and give a good rating plus stem / thesaurus functionality? It would also be nice if he had a spell check to implement "you mean ..." functionality like Google, so "saww" would be fixed to "see" and "mon" on "moon " etc.

+6

sql-server-2008 full-text-search

Greg beech Nov 18 '09 at 16:57

source share

3 answers

I know that you are not interested in reinventing the wheel, but I wanted to contribute to something that, at least, could cause your wheels to turn.

' How to make a match is one of my favorite posts on this topic. In it, the author compares strings based on the similarity of successive doublets between words.

For example, “search” and “smirch” are divided into doublets by the letter: se, ea, ar, rc, ch for search and sm, mi, ir, rc, ch for smirch. Then the number of matching doublets is multiplied by two (rc and ch, respectively, so 2 * 2) and is divided by the total number of doublets (5 + 5 = 10 in this case). 4/10 = 40% match between search and smirch.

This punishes long, unrelated strings because they increase the denominator without increasing the numerator.

In your second example, this algorithm would single out the moon as the best example, but could not save the Dark side of the Moon, etc. - they would just go down. In the first example, you will need to apply some lexical transformation before calling this algorithm, because it will not be able to find similar words that can change (for example, see / see / see), although this will work well with the non-table changer (France / French).

I have not considered how to implement this directly in an SQL application.

+2

carbocation Nov 26 '09 at 17:22

source share

When working with SQL Server (2005) FullText and Lucene (.NET) in a production environment, I really think Lucene is the best choice:

SQL Server FTS is nice and fast; but you cannot manipulate how indexes are generated. In addition, you cannot just “see” index tables. The entire implementation is hidden and, as such, this tool is great for a ready-made universal FTS, but more complex for specific applications.

Lucene, on the other hand, has been used and tested in many scenarios (I highly recommend Lucene in Action if you decide to accept this route). Even if existing implementations do not meet your needs, you can always create a “new” specific implementation (write your own analyzer / tokenizer / filter - stockmer !! - 1), although the amount of parameterization of lucene is many (2) and you can always check the contents of the index using Luke (3). You also get a search application that is independent of the data warehouse (4), and it works equally well for Java & & .NET (5). In addition, and if it makes you tick, there is also Hibernate && NHibernate ( Hibernate Search - 6).

0

Jaguar Nov 30 '09 at 8:22

source share

Justin grant · Accepted Answer · 2009-11-23T18:49:59+0000

It seems that the SQL FTS ranking is close, but not quite what you are looking for, and that you have narrowed the “not quite” cases to three:

inflections are ranked identically with non-inflected forms of the Word
rank same with their synonyms
exact matches (or short names) are rated the same as single word matches in longer credits

What all three of them have in common is that a very simple automated post-processor based on the results can use these rules to break the links between identically-ranked results: if there is an exact match, rank it above the inaccurate match, and rank the shorter names before the more long. You might want to consider saving FTS and just put some code (either in a stored procedure or in your application) on top of FTS, which sorts groups of related results according to your criteria. This is likely to be easier than switching to Lucene or another full-text search implementation other than Microsoft.

If I were in your place, since you already have something that works with FTS, I would try to hack after processing above and see if this is enough to satisfy your needs, as this will probably be the easiest thing .

If this is not good enough, I would start by looking at Lucene.NET (free), Solr (free), and dtSearch ($$$). Please note: none of them will be as simple as FTS, although, especially Lucene.NET, which is AFAIK the most popular and very full-featured, but requires enough encoding, configuration, maintenance, etc. You can see this SO stream for some other opinions, there is a possibility that there are more such streams on SO and in other places if you want more opinions.

If you are looking for the "did you mean ..." function that offers spelling. Here's an example of building these kinds of functions on top of FTS in Full Text Search Pro in SQL Server 2008 (the link contains some excerpts from Google Books). Will this fit your needs? If not, there are many other options and is free.

Search and ranking of short phrases (e.g. movie titles)

More articles: