I am using Lucene to index and search a small number of large documents. Using the demo from the Lucene site I have indexed the documents and am able to search them. However, the search result is not particularly useful as it points to the file of the document. With very large documents this isn t particularly useful.
I am wondering if Lucene can index these very large documents and create an abstraction over them which provides much more fine-grained results.
An example might better explain what I mean. Consider a very large book, such as the Bible. One file contains the entire text of the Bible, so with the demo, the result of searching for say, Damascus would point to the file. What I would like to do is to retain the large document, but searches would return results pointing to a Book, Chapter or even as precise as a Verse. So a search for Damascus could return (among others) Book 23, Chapter 7, Verse 8.
Is this possible (and best-practice in Lucene usage), or should I instead attempt to section the large document into many small files to index?
If it makes any difference, I am using Java Lucene 2.9.0 and am indexing HTML files approximately 1MB - 4MB in size. Which in terms of file size is not large, but it is large, relative to a person reading it.
I don t think I ve explained this as well as I could. Here goes for another example.
Say I take my large HTML file, and (for arguments sake) the search term Damascus appears 3 times. Once on line 100 within a <div>
tag, on line 2000 within a <p>
tag, and on line 5000 within a <h1>
tag. Is it possible to index with Lucene, such that there will be 3 results, and they can point to the specific element the term was within?
I don t think I want to provide a different document result for the term. So if the term Damascus appeared twice within a specific <div>
, there would only be one match.
It appears from a comment from Kragen that what I would want to do is parse the HTML when Lucene is going through the indexing phase. Then I can decide the chunk I want to consider as one document from what is read in by the parser. So if I see a div with a certain class I can begin a new Lucene document and it will be returned as a separate hit when a word within the div content is searched on.
Does this sound like what I want to do, and is it possible?