English 中文(简体)
找到Lucene的搜索点
原标题:Finding the position of search hits from Lucene
  • 时间:2009-08-21 10:36:40
  •  标签:

与卢塞内一样,在搜索结果中确定匹配的建议办法是什么?

更具体地说,假装指数文件有“完整”领域,其中含有某些文件的简单内容。 此外,假设其中一份文件的内容是“ quick的快速 br升”。 随后对“fox dog”进行搜索。 显然,该文件将受到打击。

在这种情形下,能否利用卢塞恩提供像找到文件的对应区域这样的东西? 因此,为了这一设想,我要提出如下内容:

[{match: "fox", startIndex: 10, length: 3},
 {match: "dog", startIndex: 34, length: 3}]

我怀疑,可以通过在org.apache.lucene.search.highlight包中提供的内容来执行。 我不肯定总体做法,尽管......

问题回答

任期由我使用。 这里是一份工作备忘录,它既印刷了职位,也印刷了起始和结束任期指数:

public class Search {
    public static void main(String[] args) throws IOException, ParseException {
        Search s = new Search();  
        s.doSearch(args[0], args[1]);  
    }  

    Search() {
    }  

    public void doSearch(String db, String querystr) throws IOException, ParseException {
        // 1. Specify the analyzer for tokenizing text.  
        //    The same analyzer should be used as was used for indexing  
        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);  

        Directory index = FSDirectory.open(new File(db));  

        // 2. query  
        Query q = new QueryParser(Version.LUCENE_CURRENT, "contents", analyzer).parse(querystr);  

        // 3. search  
        int hitsPerPage = 10;  
        IndexSearcher searcher = new IndexSearcher(index, true);  
        IndexReader reader = IndexReader.open(index, true);  
        searcher.setDefaultFieldSortScoring(true, false);  
        TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);  
        searcher.search(q, collector);  
        ScoreDoc[] hits = collector.topDocs().scoreDocs;  

        // 4. display term positions, and term indexes   
        System.out.println("Found " + hits.length + " hits.");  
        for(int i=0;i<hits.length;++i) {  

            int docId = hits[i].doc;  
            TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents");  
            TermPositionVector tpvector = (TermPositionVector)tfvector;  
            // this part works only if there is one term in the query string,  
            // otherwise you will have to iterate this section over the query terms.  
            int termidx = tfvector.indexOf(querystr);  
            int[] termposx = tpvector.getTermPositions(termidx);  
            TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);  

            for (int j=0;j<termposx.length;j++) {  
                System.out.println("termpos : "+termposx[j]);  
            }  
            for (int j=0;j<tvoffsetinfo.length;j++) {  
                int offsetStart = tvoffsetinfo[j].getStartOffset();  
                int offsetEnd = tvoffsetinfo[j].getEndOffset();  
                System.out.println("offsets : "+offsetStart+" "+offsetEnd);  
            }  

            // print some info about where the hit was found...  
            Document d = searcher.doc(docId);  
            System.out.println((i + 1) + ". " + d.get("path"));  
        }  

        // searcher can only be closed when there  
        // is no need to access the documents any more.   
        searcher.close();  
    }      
}

这里是第5.2.1条的解决办法。 它只处理一个字眼问题,但应当体现基本原则。

基本思想是:

  1. Get a TokenStream for each document, which matches your query.
  2. Create a QueryScorer and initialize it with the retrieved tokenStream.
  3. Loop over each token of the stream (done by tokenStream.incrementToken()) and check if the token matches the search criteria (done by queryScorer.getTokenScore()).

该守则是:

import java.io.IOException;
import java.util.List;
import java.util.Vector;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.de.GermanAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.TokenSources;

public class OffsetSearcher {

    private IndexReader reader;

    public OffsetSearcher(IndexWriter indexWriter) throws IOException { 
        reader = DirectoryReader.open(indexWriter, true); 
    }

    public OffsetData[] getTermOffsets(Query query) throws IOException, InvalidTokenOffsetsException 
    {
        List<OffsetData> result = new Vector<>();

        IndexSearcher searcher = new IndexSearcher(reader);
        TopDocs topDocs = searcher.search(query, 1000);

        ScoreDoc[] scoreDocs = topDocs.scoreDocs;   

        Document doc;
        TokenStream tokenStream;
        CharTermAttribute termAtt;
        OffsetAttribute offsetAtt;
        QueryScorer queryScorer;
        OffsetData offsetData;
        String txt, tokenText;
        for (int i = 0; i < scoreDocs.length; i++) 
        {
            int docId = scoreDocs[i].doc;
            doc = reader.document(docId);

            txt = doc.get(RunSearch.CONTENT);
            tokenStream = TokenSources.getTokenStream(RunSearch.CONTENT, reader.getTermVectors(docId), txt, new GermanAnalyzer(), -1);

            termAtt = (CharTermAttribute)tokenStream.addAttribute(CharTermAttribute.class);
            offsetAtt = (OffsetAttribute)tokenStream.addAttribute(OffsetAttribute.class);

            queryScorer = new QueryScorer(query);
            queryScorer.setMaxDocCharsToAnalyze(RunSearch.MAX_DOC_CHARS);
            TokenStream newStream  = queryScorer.init(tokenStream);
            if (newStream != null) {
                tokenStream = newStream;
            }
            queryScorer.startFragment(null);

            tokenStream.reset();

            int startOffset, endOffset;
            for (boolean next = tokenStream.incrementToken(); next && (offsetAtt.startOffset() < RunSearch.MAX_DOC_CHARS); next = tokenStream.incrementToken())
            {
                startOffset = offsetAtt.startOffset();
                endOffset = offsetAtt.endOffset();

                if ((endOffset > txt.length()) || (startOffset > txt.length()))
                {
                    throw new InvalidTokenOffsetsException("Token " + termAtt.toString() + " exceeds length of provided text sized " + txt.length());
                }

                float res = queryScorer.getTokenScore();
                if (res > 0.0F && startOffset <= endOffset) {
                    tokenText = txt.substring(startOffset, endOffset);
                    offsetData = new OffsetData(tokenText, startOffset, endOffset, docId);
                    result.add(offsetData);
                }           
            }   
        }

        return result.toArray(new OffsetData[result.size()]);
    }


    public void close() throws IOException {
        reader.close();
    }


    public static class OffsetData {

        public String phrase;
        public int startOffset;
        public int endOffset;
        public int docId;

        public OffsetData(String phrase, int startOffset, int endOffset, int docId) {
            super();
            this.phrase = phrase;
            this.startOffset = startOffset;
            this.endOffset = endOffset;
            this.docId = docId;
        }

    }

}




相关问题
热门标签