English 中文(简体)
利用阿帕奇·卢塞恩整理大量PDF文档
原标题:Using Apache Lucene to parse large PDF files

I am trying to find out the best way to search/parse a set of large pdf file. I am currently using PDFBox to convert my PDF files to text files. I am then using Lucene to index these text files and search for information. I am facing some problems using this approach. ( Note that I am using both these technologies at a very basic level just to see what they can do) .

考虑我的PDF档案中的以下线,使各栏的总数大体达到。 每一栏均包含一面价值,其总额如下所示。

    Grand Total  $3,148.06 $484.80 $13.07 $8.90 $0.00 $69.90 $0.00 $0.00
                 $10.00    $5.15   $25.60 $0.00 $2.69 $0.00  $0.00 $0.00 $3,768.17

When I convert my pdf file to a text file using TextStripper from PDFBox, The above line from the pdf file is converted to the following text in the text file.

    58.20$3,148.06 $484.80 $13.07 $0.00 $0.00 $0.00Grand Total $8.90 $69.90$10.00 $5.15 $25.60 $0.00 $2.69 $0.00 $0.00 $0.00 $3,768.17

从上述文本档案中可以看出,数据分散在大总标签上。 因此,由于在文本档案中没有保留国防军档案中的登革热,因此很难检索大面积的全部资料。

因此,我很想知道,是否有办法将人民抵抗力量的档案转换到文字档案中,以便文本档案保持人民抵抗力量档案中的登站/格式。 我也想知道,卢塞尼是否是实现我目标的良好想法,还是有更简单、更快捷的方法从一套大规模的国防军档案中检索信息?

最佳回答

http://tika.apache.org/“rel=“nofollow” (当人们从人民抵抗力量获取数据时,他们使用Tika)。

是否有较容易的办法? Solr has strong Integration with Tika, which should make it better better to index PDF documents. (Solr是Lucene周围的一个包裹)

问题回答

暂无回答




相关问题
Parse players currently in lobby

I m attempting to write a bash script to parse out the following log file and give me a list of CURRENT players in the room (so ignoring players that left, but including players that may have rejoined)...

How to get instance from string in C#?

Is it possible to get the property of a class from string and then set a value? Example: string s = "label1.text"; string value = "new value"; label1.text = value; <--and some code that makes ...

XML DOM parsing br tag

I need to parse a xml string to obtain the xml DOM, the problem I m facing is with the self closing html tag like <br /> giving me the error of Tag mismatch expected </br>. I m aware this ...

Ruby parser in Java

The project I m doing is written in Java and parsers source code files. (Java src up to now). Now I d like to enable parsing Ruby code as well. Therefore I am looking for a parser in Java that parses ...

Locating specific string and capturing data following it

I built a site a long time ago and now I want to place the data into a database without copying and pasting the 400+ pages that it has grown to so that I can make the site database driven. My site ...

热门标签