English 中文(简体)
传真:如何保护白天空间等标志流时?
原标题:Lucene: how to preserve whitespaces etc when tokenizing stream?

我正试图“翻译”一系列案文。 更具体地说,我需要证明投入流,在专门的字典中逐年研究,并得出相应的标语“翻译”。 然而,我也希望保留投入中的所有原始白色空间、中词等,以便使产出的格式与投入相同,而不是最终成为译文。 因此,如果我的投入是这样的话,那么我的意见就是这样。

Term1: Term2 Stopword! Term3 Term4

那么,我想看一下产出。

Term1 : Term2 Stopword! Term3 Term4

(Termi 译自Termi)

Term1 Term2 Term3 Term4

我目前正在做以下工作:

PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31,
                             PatternAnalyzer.WHITESPACE_PATTERN,
                             false, 
                             WordlistLoader.getWordSet(new File(stopWordFilePath)));
TokenStream ts = pa.tokenStream(null, in);
CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class);

while (ts.incrementToken()) { // loop over tokens
     String termIn = charTermAttribute.toString(); 
     ...
}

但这当然会失去所有白色空间等。 我怎么能够修改,以便重新纳入产出? 感谢!

============ UPDATE!

我试图将原来的行文分为“口号”和“非口号”。 看来,这只是罚款。 不知道它是否最有效率,但:

public ArrayList splitToWords(String sIn) {

if (sIn == null || sIn.length() == 0) {
    return null;
}

char[] c = sIn.toCharArray();
ArrayList<Token> list = new ArrayList<Token>(); 
int tokenStart = 0;
boolean curIsLetter = Character.isLetter(c[tokenStart]);
for (int pos = tokenStart + 1; pos < c.length; pos++) {
    boolean newIsLetter = Character.isLetter(c[pos]);
    if (newIsLetter == curIsLetter) {
        continue;
    }
    TokenType type = TokenType.NONWORD;
    if (curIsLetter == true)
    {
        type = TokenType.WORD;
    }

    list.add(new Token(new String(c, tokenStart, pos - tokenStart),type));
    tokenStart = pos;

    curIsLetter = newIsLetter;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
    type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, c.length - tokenStart),type));

return list;

}

最佳回答

确实,这并没有真正失去白色空间,你仍然有原来的案文:

So I think you should make use of OffsetAttribute, which contains startOffset() and endOffset() of each term into your original text. This is what lucene uses, for example, to highlight snippets of search results from the original text.

I wrote up a quick test (uses EnglishAnalyzer) to demonstrate: The input is:

Just a test of some ideas. Let s see if it works.

产出如下:

just a test of some idea. let see if it work.

// just for example purposes, not necessarily the most performant.
public void testString() throws Exception {
  String input = "Just a test of some ideas. Let s see if it works.";
  EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_35);
  StringBuilder output = new StringBuilder(input);
  // in some cases, the analyzer will make terms longer or shorter.
  // because of this we must track how much we have adjusted the text so far
  // so that the offsets returned will still work for us via replace()
  int delta = 0;

  TokenStream ts = analyzer.tokenStream("bogus", new StringReader(input));
  CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
  OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
  ts.reset();
  while (ts.incrementToken()) {
    String term = termAtt.toString();
    int start = offsetAtt.startOffset();
    int end = offsetAtt.endOffset();
    output.replace(delta + start, delta + end, term);
    delta += (term.length() - (end - start));
  iii
  ts.close();

System.out.println(output.toString());

iii

问题回答

暂无回答




相关问题
Split Strings and arrange db to display products in PHP

I m new in php. Could you please help me to find the way to properly arrange following task: Table "Products" id - details 1 - 1-30,2-134:6:0;;2-7:55:0;;1-2,2-8:25:0 - where this string can be ...

Lucene Query WITHOUT Operators

I am trying to use Lucene to search for names in a database. However, some of the names contain words like "NOT" and "OR" and even "-" minus symbols. I still want the different tokens inside the names ...

Google-like search query tokenization & string splitting

I m looking to tokenize a search query similar to how Google does it. For instance, if I have the following search query: the quick "brown fox" jumps over the "lazy dog" I would like to have a ...

Recursive woes - reducing an input string

I m working on a portion of code that is essentially trying to reduce a list of strings down to a single string recursively. I have an internal database built up of matching string arrays of varying ...

Tokenize from a textfile reading into an array in C

How do you tokenize when you read from a file in C? textfile: PES 2009;Konami;DVD 3;500.25; 6 Assasins Creed;Ubisoft;DVD;598.25; 3 Inferno;EA;DVD 2;650.25; 7 char *tokenPtr; fileT = fopen("DATA2....

tokenize a string keeping delimiters in Python

Is there any equivalent to str.split in Python that also returns the delimiters? I need to preserve the whitespace layout for my output after processing some of the tokens. Example: >>> s="...

C tokenize polynomial coefficients

I m trying to put the coefficients of polynomials from a char array into an int array I have this: char string[] = "-4x^0 + x^1 + 4x^3 - 3x^4"; and can tokenize it by the space into -4x^0 x^1 4x^3 ...

Approaching Text Parsing in Scala

I m making an application that will parse commands in Scala. An example of a command would be: todo get milk for friday So the plan is to have a pretty smart parser break the line apart and ...

热门标签