English 中文(简体)
如何将pdf、ppt、xl文档编入电子格式(java基或Python或php),其中任何一种是罚款?
原标题:How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

我也想知道,如何在指数化的同时增加元数据,以便提高一些参数。

最佳回答

卢塞内索引案文不是档案,你需要一些其他程序从档案中提取案文,然后从中操作卢塞内。

问题回答

选取案文有几个框架适合卢塞内从丰富文本文档中索引(pdf,ppt等)。

  • One of them is Apache Tika, a sub-project of Lucene.
  • Apache POI is a more general document handling project inside Apache.
  • There are also some commercial alternatives.

see https://github.com/WolfgangFahl/pdfindexer for a java solution that uses PDFBox and Apache Lucene to split the PDF files page by page to text, index these text-pages and create a resulting html index file that links to the pages in the pdf sources by using a corresponding open parameter.





相关问题
Spring Properties File

Hi have this j2ee web application developed using spring framework. I have a problem with rendering mnessages in nihongo characters from the properties file. I tried converting the file to ascii using ...

Logging a global ID in multiple components

I have a system which contains multiple applications connected together using JMS and Spring Integration. Messages get sent along a chain of applications. [App A] -> [App B] -> [App C] We set a ...

Java Library Size

If I m given two Java Libraries in Jar format, 1 having no bells and whistles, and the other having lots of them that will mostly go unused.... my question is: How will the larger, mostly unused ...

How to get the Array Class for a given Class in Java?

I have a Class variable that holds a certain type and I need to get a variable that holds the corresponding array class. The best I could come up with is this: Class arrayOfFooClass = java.lang....

SQLite , Derby vs file system

I m working on a Java desktop application that reads and writes from/to different files. I think a better solution would be to replace the file system by a SQLite database. How hard is it to migrate ...

热门标签