English 中文(简体)
Trouble searching for acronyms in Lucene.NET
原标题:

I m currently working on a Lucene.NET full-text search implementation. For the most part it s going quite well but I m having a few issues revolving around acronyms in the data...

As an example of what s going on if I had "N.A.S.A." in the field I indexed I m able to match it with n.a.s.a. or nasa, but n.a.s.a doesn t match it, not even if I put a fuzzy-search (n.a.s.a~).

The first thought that comes to mind for me is to rip out all the . s before indexing/searching, but it seems a bit more like a workaround than a solution and I was hoping to get a cleaner solution.

Can anyone suggest any changes or a different analyzer (using StandardAnalyzer currently) that may be more suited to matching this kind of data?

问题回答

The StandardAnalyzer uses the StandardTokenizer which tokenizes N.A.S.A. as nasa , but won t do this to N.A.S.A . That s why your original query matches both the input N.A.S.A which are processed into nasa , and the input nasa which matches the already tokenized value. This also explains why N.A.S.A wont match anything since the index only contains the token nasa .

This can be seen when outputting the value from the token stream directly.

public static void Main(string[] args) {
    var analyzer = new StandardAnalyzer(Version.LUCENE_30);
    var stream = analyzer.TokenStream("f", new StringReader("N.A.S.A. N.A.S.A"));

    var termAttr = stream.GetAttribute<ITermAttribute>();
    while (stream.IncrementToken()) {
        Console.WriteLine(termAttr.Term);
    }

    Console.ReadLine();
}

Outputs:

nasa
n.a.s.a

You would probably need to write a custom analyzer to handle this scenario. One solution would be to keep the original token so n.a* would work, but you would also need to build a better detection of acronyms.





相关问题
Anyone feel like passing it forward?

I m the only developer in my company, and am getting along well as an autodidact, but I know I m missing out on the education one gets from working with and having code reviewed by more senior devs. ...

NSArray s, Primitive types and Boxing Oh My!

I m pretty new to the Objective-C world and I have a long history with .net/C# so naturally I m inclined to use my C# wits. Now here s the question: I feel really inclined to create some type of ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

How to Use Ghostscript DLL to convert PDF to PDF/A

How to user GhostScript DLL to convert PDF to PDF/A. I know I kind of have to call the exported function of gsdll32.dll whose name is gsapi_init_with_args, but how do i pass the right arguments? BTW, ...

Linqy no matchy

Maybe it s something I m doing wrong. I m just learning Linq because I m bored. And so far so good. I made a little program and it basically just outputs all matches (foreach) into a label control. ...

热门标签