So far, this looks like a good candidate for RegEx s. If it gets significantly more complicated, then a more complex tokenizing scheme may be necessary, but your should avoid that route unless necessary as it is significantly more work. (on the other hand, for complex schemas, regex quickly turns into a dog and should likewise be avoided).
This regex should solve your problem:
("[^"]+"|w+)s*
Here is a C# example of its usage:
string data = "the quick "brown fox" jumps over the "lazy dog"";
string pattern = @"(""[^""]+""|w+)s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
The real benefit of this method is it can be easily extened to include your "-" requirement like so:
string data = "the quick "brown fox" jumps over " +
"the "lazy dog" -"lazy cat" -energetic";
string pattern = @"(-""[^""]+""|""[^""]+""|-w+|w+)s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
Now I hate reading Regex s as much as the next guy, but if you split it up, this one is quite easy to read:
(
-"[^"]+"
|
"[^"]+"
|
-w+
|
w+
)s*
Explanation
- If possible match a minus sign, followed by a " followed by everything until the next "
- Otherwise match a " followed by everything until the next "
- Otherwise match a - followed by any word characters
- Otherwise match as many word characters as you can
- Put the result in a group
- Swallow up any following space characters