Question

I am trying to find out the best way to search/parse a set of large pdf file. I am currently using PDFBox to convert my PDF files to text files. I am then using Lucene to index these text files and search for information. I am facing some problems using this approach. ( Note that I am using both these technologies at a very basic level just to see what they can do) .

考虑我的PDF档案中的以下线,使各栏的总数大体达到。每一栏均包含一面价值,其总额如下所示。

    Grand Total  $3,148.06 $484.80 $13.07 $8.90 $0.00 $69.90 $0.00 $0.00
                 $10.00    $5.15   $25.60 $0.00 $2.69 $0.00  $0.00 $0.00 $3,768.17

When I convert my pdf file to a text file using TextStripper from PDFBox, The above line from the pdf file is converted to the following text in the text file.

    58.20$3,148.06 $484.80 $13.07 $0.00 $0.00 $0.00Grand Total $8.90 $69.90$10.00 $5.15 $25.60 $0.00 $2.69 $0.00 $0.00 $0.00 $3,768.17

从上述文本档案中可以看出,数据分散在大总标签上。因此,由于在文本档案中没有保留国防军档案中的登革热,因此很难检索大面积的全部资料。

因此,我很想知道,是否有办法将人民抵抗力量的档案转换到文字档案中,以便文本档案保持人民抵抗力量档案中的登站/格式。我也想知道,卢塞尼是否是实现我目标的良好想法,还是有更简单、更快捷的方法从一套大规模的国防军档案中检索信息?

Answer 1

http://tika.apache.org/“rel=“nofollow” (当人们从人民抵抗力量获取数据时,他们使用Tika)。

是否有较容易的办法? Solr has strong Integration with Tika, which should make it better better to index PDF documents. (Solr是Lucene周围的一个包裹)

友情链接