Question

I m trying to improve our search capabilities for short phrases (in our case movie titles) and am currently looking at SQL Server 2008 Full Text Search, which provides some of the functionality we would like:

Word stemming (e.g. "saw" also means "see", "seen", etc.)
Synonyms (e.g. "6" is synonymous with "VI")

However the ranking algorithm seems to be proving problematic, using FREETEXTTABLE with the search term and extracting the RANK field. For example when the user enters "saw" then the results we get with out catalogue are:

RANK | Title
---------------------------------------------------------------------
180  | The Exorcist: The version you ve never seen
180  | Saw IV
180  | Saw V
180  | Anybody Here Seen Jeannie?
180  | Seeing Red

All of these have the same rank, even though it would be clear to a person that the second and third entries are a better match than the other stemmed terms.

Similarly entering "moon" gives the following results:

RANK | Title
---------------------------------------------------------------------
144  | Pink Floyd - The Dark Side of the Moon
144  | Fly Me To The Moon 3D
144  | Twilight: New Moon
144  | Moon

And here although there are no stemming matches, it would be clear to a person that the best match for "moon" is "Moon" rather than longer titles which contain it only as part of the title, yet FTS ranks them equally.

I m guessing that it s probably something to do with the way SQL Server ranks results, which treats stemmed words and synonyms with equal weight to the original term, and takes into account word density for ranking which would be good with long passages of text, but doesn t really apply with short phrases like these. So I m starting to thing that FTS isn t suitable for this job, unfortunately.

I don t really want to re-invent the wheel, so are there any existing search solutions that would work for titles and give good rankings plus the stemming/thesaurus functionality? It would also be nice if it had a spell checker to implement "did you mean..." functionality like Google, so "saww" would be corrected to "saw" and "mon" to "moon", etc.

Answer 1

Sounds like SQL FTS s ranking is close, but not exactly, what you re looking for, and that you ve narrowed down the "not exactly" cases to three:

inflections are ranked identically to non-inflected forms
words are ranked identically to their synonyms
exact matches (or short titles) are ranked identically as one-word matches within longer titles

What all three of these have in common is that a very simple, automated post-processor on results could use these rules to break ties between identically-ranked results: if there s an exact match, rank it above a non-exact match, and rank shorter titles ahead of longer ones. You may want to consider keeping FTS, and simply putting some code (either in a stored proc or in your app) on top of FTS which sorts groups of tied results by the criteria you mentioned. This would probably be easier than switching to Lucene or another non-Microsoft full-text search implementation.

If I were in your shoes, since you ve already got something working with FTS, I d try the post-processing hack above and see if it s "good enough" to meet your needs, since it d probably be the easiest thing to do.

If it s not good enough, I d start by looking at Lucene.NET (free), Solr (free), and dtSearch ($$$). Note that none will be as easy as FTS is, though-- especially Lucene.NET which is AFAIK the most popular and is very full-featured but requires a fair amount of coding, configuration, maintenance, etc. You can see this SO thread for some other opinions, there are probabaly more threads like this on SO and elsewhere if you want more opinions.

If you re looking for a "did you mean..." spelling-suggestion feature. There s an example of building this kind of feature on top of FTS in Pro Full-Text Search in SQL Server 2008 (link contains some excerpts from Google Books). Would this meet your needs? If not, there are lots of other options both free and not.

Answer 2

I know you re not interested in reinventing the wheel, but I wanted to contribute something that might at least get your wheels turning.

How to Strike a Match is one of my favorite posts on this topic. In it, the author matches strings based on the similarity of sequential doublets between words.

For example, search and smirch are both broken into doublets by letter: se, ea, ar, rc, ch for search and sm, mi, ir, rc, ch for smirch. Then the number of matching doublets is multiplied by two ( rc and ch match, so 2 * 2) and divided by the total number of doublets (5 + 5 = 10 in this case). 4/10 = 40% match between search and smirch.

This penalizes long, unrelated strings because they increase the denominator without increasing the numerator.

In your second example, this algorithm would highlight moon as the best example, but would not fail to retain Dark Side of the Moon, etc - they would just rank lower. In your first example, you would have to apply some sort of lexical conversion before invoking this algorithm, because it will fail to find similar words that stem-change (like see / saw / seen) although it will do well with non-stem-changers (France / French).

I have not contemplated how to implement this directly in an SQL application.

Answer 3

Having worked with both SQL Server (2005) FullText and Lucene(.NET) in a production environment, i really think that Lucene is the better choice:

SQL Server FTS is nice and quick; but you can t really manipulate the way that the indexes are generated. Also, you can t just "see" the index tables. All the implementation is hidden away and as such this tool is great for out-of-the-box generic FTS but harder for specific applications.

Lucene on the other hand has been used and tested in a multitude of scenarios (I highly recommend Lucene in Action if you decide to take that route). Even if the existing implementations do not fit exactly your needs you can always create a "new" specific implementation(write your own analyzer/tokenizer/filter - stemmer!! - 1), although the amount of parametrization of lucene is plenty (2) and you can always check the contents of the index using Luke (3). You also get an search application that is independent of the datastore (4), and it works equally well for Java && .NET (5). Additionally, and if that makes you tick, there is also Hibernate && NHibernate implementation (Hibernate Search - 6).

友情链接