Question

我的情况是,在卢塞内使用标准Analyzer,将案文拼写成以下索引:

public void indexText(String suffix, boolean includeStopWords)  {        
    StandardAnalyzer analyzer = null;


    if (includeStopWords) {
        analyzer = new StandardAnalyzer(Version.LUCENE_30);
    }
    else {

        // Get Stop_Words to exclude them.
        Set<String> stopWords = (Set<String>) Stop_Word_Listener.getStopWords();      
        analyzer = new StandardAnalyzer(Version.LUCENE_30, stopWords);
    }

    try {

        // Index text.
        Directory index = new RAMDirectory();
        IndexWriter w = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);            
        this.addTextToIndex(w, this.getTextToIndex());
        w.close();

        // Read index.
        IndexReader ir = IndexReader.open(index);
        Text_TermVectorMapper ttvm = new Text_TermVectorMapper();

        int docId = 0;

        ir.getTermFreqVector(docId, PropertiesFile.getProperty(text), ttvm);

        // Set output.
        this.setWordFrequencies(ttvm.getWordFrequencies());
        w.close();
    }
    catch(Exception ex) {
        logger.error("Error message
", ex);
    }
}

private void addTextToIndex(IndexWriter w, String value) throws IOException {
    Document doc = new Document();
    doc.add(new Field(text), value, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
    w.addDocument(doc);
}

这些工作做得很好,但我要把这项工作与利用斯沃加纳利泽来遏制。

这一类别还包含以下构造中所示两种变量:

public Text_Indexer(String textToIndex) {
    this.textToIndex = textToIndex;
    this.wordFrequencies = new HashMap<String, Integer>();
}

谁能告诉我,如何以上述法典实现这一目标?

增编

Morgan先生

Answer 1

Lucene provides the org.apache.lucene.analysis.Analyzer base class which can be used if you want to write your own Analyzer.
You can check out org.apache.lucene.analysis.standard.StandardAnalyzer class that extends Analyzer.

然后,在你Analyzer, 您通过使用过滤器,使用这些分析器,粉碎链标准Analyzer和SnowballAnalyzer。

TokenStream result = new StandardFilter(tokenStream);
result = new SnowballFilter(result, stopSet);

Then, in your existing code, you ll be able to construct IndexWriter with your own Analyzer implementation that chains Standard and Snowball filters.

Totally off-topic:
I suppose you ll eventually need to setup your custom way of handling requests. That is already implemented inside Solr.

第一,在SolrConfig.xml,如:

<searchComponent name="yourQueryComponent" class="org.apache.solr.handler.component.YourQueryComponent"/>

然后通过延长搜捕能力,写你的搜捕手,并在SolrConfig.xml加以界定:

  <requestHandler name="YourRequestHandlerName" class="org.apache.solr.handler.component.YourRequestHandler" default="true">
    <!-- default values for query parameters -->
        <lst name="defaults">
            <str name="echoParams">explicit</str>       
            <int name="rows">1000</int>
            <str name="fl">*</str>
            <str name="version">2.1</str>
        </lst>

        <arr name="components">
            <str>yourQueryComponent</str>
            <str>facet</str>
            <str>mlt</str>
            <str>highlight</str>            
            <str>stats</str>
            <str>debug</str>

        </arr>

  </requestHandler>

然后,当你向索莱尔发出尿道时,仅仅包括额外的参数qt=YourRequestHandlerName,这将导致你的请求书记员被用于这一请求。

More about SearchComponents.
More about RequestHandlers.

Answer 2

The SnowballAnalyzer provided by Lucene already uses the StandardTokenizer, StandardFilter, LowerCaseFilter, StopFilter, and SnowballFilter. So it sounds like it does exactly what you want (everything StandardAnalyzer does, plus the snowball stemming).

如果是这样的话,你可以很容易地把你希望的任何象征性的器具和托肯施特的器具结合起来。

Answer 3

In the end I rearranged the program code to call the SnowBallAnalyzer as an option. The output is then indexed via the StandardAnalyzer.

它运作迅速,但如果我能够只用一个分析器做一切,我就重新讨论我的法典。

Thanks to mbonaci and Avi.

友情链接