English 中文(简体)
Google-like search query tokenization & string splitting
原标题:

I m looking to tokenize a search query similar to how Google does it. For instance, if I have the following search query:

the quick "brown fox" jumps over the "lazy dog"

I would like to have a string array with the following tokens:

the
quick
brown fox
jumps
over
the
lazy dog

As you can see, the tokens preserve the spaces with in double quotes.

I m looking for some examples of how I could do this in C#, preferably not using regular expressions, however if that makes the most sense and would be the most performant, then so be it.

Also I would like to know how I could extend this to handle other special characters, for example, putting a - in front of a term to force exclusion from a search query and so on.

最佳回答

So far, this looks like a good candidate for RegEx s. If it gets significantly more complicated, then a more complex tokenizing scheme may be necessary, but your should avoid that route unless necessary as it is significantly more work. (on the other hand, for complex schemas, regex quickly turns into a dog and should likewise be avoided).

This regex should solve your problem:

("[^"]+"|w+)s*

Here is a C# example of its usage:

string data = "the quick "brown fox" jumps over the "lazy dog"";
string pattern = @"(""[^""]+""|w+)s*";

MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
    string group = m.Groups[0].Value;
}

The real benefit of this method is it can be easily extened to include your "-" requirement like so:

string data = "the quick "brown fox" jumps over " +
              "the "lazy dog" -"lazy cat" -energetic";
string pattern = @"(-""[^""]+""|""[^""]+""|-w+|w+)s*";

MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
    string group = m.Groups[0].Value;
}

Now I hate reading Regex s as much as the next guy, but if you split it up, this one is quite easy to read:

(
-"[^"]+"
|
"[^"]+"
|
-w+
|
w+
)s*

Explanation

  1. If possible match a minus sign, followed by a " followed by everything until the next "
  2. Otherwise match a " followed by everything until the next "
  3. Otherwise match a - followed by any word characters
  4. Otherwise match as many word characters as you can
  5. Put the result in a group
  6. Swallow up any following space characters
问题回答

I was just trying to figure out how to do this a few days ago. I ended up using Microsoft.VisualBasic.FileIO.TextFieldParser which did exactly what I wanted (just set HasFieldsEnclosedInQuotes to true). Sure it looks somewhat odd to have "Microsoft.VisualBasic" in a C# program, but it works, and as far as I can tell it is part of the .NET framework.

To get my string into a stream for the TextFieldParser, I used "new MemoryStream(new ASCIIEncoding().GetBytes(stringvar))". Not sure if this is the best way to do it.

Edit: I don t think this would handle your "-" requirement, so maybe the RegEx solution is better

Go char by char to the string like this: (sort of pseudo code)

array words = {} // empty array
string word = "" // empty word
bool in_quotes = false
for char c in search string:
    if in_quotes:
        if c is  " :
            append word to words
            word = "" // empty word
            in_quotes = false
        else:
            append c to word
   else if c is  " :
        in_quotes = true
   else if c is    : // space
       if not empty word:
           append word to words
           word = "" // empty word
   else:
        append c to word

// Rest
if not empty word:
    append word to words

I was looking for a Java solution to this problem and came up with a solution using @Michael La Voie s. Thought I would share it here despite the question being asked for in C#. Hope that s okay.

public static final List<String> convertQueryToWords(String q) {
    List<String> words = new ArrayList<>();
    Pattern pattern = Pattern.compile("("[^"]+"|\w+)\s*");
    Matcher matcher = pattern.matcher(q);
    while (matcher.find()) {
        MatchResult result = matcher.toMatchResult();
        if (result != null && result.group() != null) {
            if (result.group().contains(""")) {
                words.add(result.group().trim().replaceAll(""", "").trim());
            } else {
                words.add(result.group().trim());
            }
        }
    }
    return words;
}




相关问题
Anyone feel like passing it forward?

I m the only developer in my company, and am getting along well as an autodidact, but I know I m missing out on the education one gets from working with and having code reviewed by more senior devs. ...

NSArray s, Primitive types and Boxing Oh My!

I m pretty new to the Objective-C world and I have a long history with .net/C# so naturally I m inclined to use my C# wits. Now here s the question: I feel really inclined to create some type of ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

How to Use Ghostscript DLL to convert PDF to PDF/A

How to user GhostScript DLL to convert PDF to PDF/A. I know I kind of have to call the exported function of gsdll32.dll whose name is gsapi_init_with_args, but how do i pass the right arguments? BTW, ...

Linqy no matchy

Maybe it s something I m doing wrong. I m just learning Linq because I m bored. And so far so good. I made a little program and it basically just outputs all matches (foreach) into a label control. ...

热门标签