English 中文(简体)
使用Java/JavaCC解析RTF文档
原标题:
  • 时间:2009-05-12 18:55:32
  •  标签:

Is anybody familiar with the the RTF document format and parsing using any Java libaries. The standard way people have done this is by using the RTFEditorKit in the JDK Swing API:

Swing RTFEditorKit API 的中文翻译为摇摆 RTFEditorKit API。

但是,当涉及到解析 RTF 文档时,它并不那么准确。实际上,API 中有一条注释:

The RTF support was not written by the Swing team. In the future we hope to improve the support provided.

I don t think I m going to wait for this to happen :)

另一种方法是使用JavaCC定义语法并生成解析器。这样做效果更好,但我找不到完整的语法。我已经尝试过:

PMD Applied JavaCC Grammar

which is ok and the following (which is the best so far).

Koders RTFParserDelegate and ETranslate Grammar

有各种不同的 ETranslate 语法实现(我知道 Nutch API 可能会使用它)。有没有人知道哪种语法最准确,或者是否有更好的方法?

I could start ploughing through the JavaCC docs to understand the .jj files and test it against the RTF files... this is my current approach, but it s taking a while... any help would be appreciated

问题回答

Does anybody know which is the most accurate grammar or whether there is a better approach to this?

许多年前,我用 C# 阅读了 RTF(Wikipedia),我说阅读是因为如果你详细理解 RTF 并按照其设计方式使用它,你会意识到 RTF 不是以整体形式阅读并重复解析的。在文档中,你会发现 RTF 的语法,但不要被误导认为你应该使用词汇分析器/解析器。在文档中,他们提供了一个 RTF 的示例读取器。

Remember that RTF was created many ages ago when memory was measured in KB and not MB, and editing long documents of several hundred pages in a conventional way would tax system resources. So RFT has the ability to be edited in smaller subsections without loading or modifying the entire document. This is what gives it the ability to work on such large documents with limited memory. It is also why the syntax may seem odd at first.

大概,OpenOffice的源代码包含你所寻找的内容。





相关问题
热门标签