Question

I m currently working on a Lucene.NET full-text search implementation. For the most part it s going quite well but I m having a few issues revolving around acronyms in the data...

As an example of what s going on if I had "N.A.S.A." in the field I indexed I m able to match it with n.a.s.a. or nasa, but n.a.s.a doesn t match it, not even if I put a fuzzy-search (n.a.s.a~).

The first thought that comes to mind for me is to rip out all the . s before indexing/searching, but it seems a bit more like a workaround than a solution and I was hoping to get a cleaner solution.

Can anyone suggest any changes or a different analyzer (using StandardAnalyzer currently) that may be more suited to matching this kind of data?

Answer 1

The StandardAnalyzer uses the StandardTokenizer which tokenizes N.A.S.A. as nasa , but won t do this to N.A.S.A . That s why your original query matches both the input N.A.S.A which are processed into nasa , and the input nasa which matches the already tokenized value. This also explains why N.A.S.A wont match anything since the index only contains the token nasa .

This can be seen when outputting the value from the token stream directly.

public static void Main(string[] args) {
    var analyzer = new StandardAnalyzer(Version.LUCENE_30);
    var stream = analyzer.TokenStream("f", new StringReader("N.A.S.A. N.A.S.A"));

    var termAttr = stream.GetAttribute<ITermAttribute>();
    while (stream.IncrementToken()) {
        Console.WriteLine(termAttr.Term);
    }

    Console.ReadLine();
}

Outputs:

nasa
n.a.s.a

You would probably need to write a custom analyzer to handle this scenario. One solution would be to keep the original token so n.a* would work, but you would also need to build a better detection of acronyms.

友情链接