I am trying to find out the best way to search/parse a set of large pdf file. I am currently using PDFBox to convert my PDF files to text files. I am then using Lucene to index these text files and search for information. I am facing some problems using this approach. ( Note that I am using both these technologies at a very basic level just to see what they can do) .
考虑我的PDF档案中的以下线,使各栏的总数大体达到。 每一栏均包含一面价值,其总额如下所示。
Grand Total $3,148.06 $484.80 $13.07 $8.90 $0.00 $69.90 $0.00 $0.00
$10.00 $5.15 $25.60 $0.00 $2.69 $0.00 $0.00 $0.00 $3,768.17
When I convert my pdf file to a text file using TextStripper from PDFBox, The above line from the pdf file is converted to the following text in the text file.
58.20$3,148.06 $484.80 $13.07 $0.00 $0.00 $0.00Grand Total $8.90 $69.90$10.00 $5.15 $25.60 $0.00 $2.69 $0.00 $0.00 $0.00 $3,768.17
从上述文本档案中可以看出,数据分散在大总标签上。 因此,由于在文本档案中没有保留国防军档案中的登革热,因此很难检索大面积的全部资料。
因此,我很想知道,是否有办法将人民抵抗力量的档案转换到文字档案中,以便文本档案保持人民抵抗力量档案中的登站/格式。 我也想知道,卢塞尼是否是实现我目标的良好想法,还是有更简单、更快捷的方法从一套大规模的国防军档案中检索信息?