Question

我想组织我从互联网下载的pdf文档。很显然,其中许多人的名字不准确。我想从档案中抽取真正的所有权。其中许多来自晚宴,我想来自汇编的pdf,我们可以找到关键词或类似的话。然后,我想用这重新命名档案。

我可以读到使用pypdf的元数据。但是,大多数pdf的元数据中没有这一标题。我用我的所有资料对我进行了审判,没有发现任何东西!

Two questions: 1. Is it possible to read pdf title compiled from the pdf compiled from latex. 2. Which library(mainly in C/C++, java, python) can I use to get that information.

提前感谢。

Answer 1

我认为这实际上是不可能的。 LaTeX资料已不在pdf。如果在元数据中没有标题,那么如果标题是“附标的pdf”,你就能够从结构信息中删除标题。然而,大多数的pdf都是n,那些可能提供元数据的人。

This leaves you with layout analysis: try to determine what is the title from the document by looking at layout characteristics. For python, you might want to have a look at pdfminer. The following example uses pdfminer to determine the title using a rather simplistic approach:

we assume that the title is somewhere on the first page
we leave it to pdfminer to recognize "blocks of text" on the first page
we assume that the title is printed "bigger" than the rest of the page. Looking at the height of each line in the text blocks, we determine which block contains the "tallest" line, and assume that that block contains the title
we let pdfminer extract the text from the block,
the text will probably contain newlines (placed by pdfminer) because the title might contain more than one line, and other needless whitespace, so we do some simple whitespace normalization (replace consecutive whitespace by a single space, and strip leading and trailing whitespace), and that s it!

正如我说过的那样:这种做法过于简单,可能或不可能给你的文件带来良好结果,但可能把你放在正确的方向。这里是:

import sys
import re
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox

filename = sys.argv[1]
fp = open(filename,  rb )

parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize()

rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interp = PDFPageInterpreter(rsrcmgr, device)

pages = doc.get_pages()
first_page = pages.next()
interp.process_page(first_page)
layout = device.get_result()
textboxes = [i for i in layout if isinstance(i, LTTextBox)]
box_with_tallest_line = max(textboxes, key=lambda x: max(i.height for i in x))

text = box_with_tallest_line.get_text()
print re.sub( s+ ,    , text).strip()

我将档案重新命名给你(指出标题可能含有你可能不想要的特性,或甚至档案名称不有效)。目前,假冒文件相当稀少,因此,如果你需要了解更多信息,你可能希望寄信名单上。 (很少了解自己的情况,但不能抵制;-) 或者,你可能会尝试采用类似的做法,使用其他的pdf图书馆/其他语文。

Answer 2

在座右铭是:pyPdf(Debian Pack: python-pypdf)。这里有一些法典:

import pyPdf, sys
filename=sys.argv[1]
i=pyPdf.PdfFileReader(open(filename,"rb"))
d=i.getDocumentInfo()
print d["/Title"]

在我的经验中,很少有国防军拥有“/Title”属性,但你的里程可能有所不同。在此情况下,你不得不从内容中推断出标题,这必然容易发生错误。 <代码>pyPdf也可帮助您做到这一点。

Answer 3

Try iText (Java)。我发现这个例子(如果得到支持,你可以增加一般用途):

PdfReader reader = new PdfReader("yourpdf.pdf");
HashMap map= reader.getInfo();
Set keys = map.keySet();
Iterator i = keys.iterator();

while(i.hasNext()) {
    String thiskey = (String)i.next();
    System.out.println(thiskey + ":" + (String)map.get(thiskey));
}

Answer 4

C++的另一个选择是Poppler。

I tried to do something similar in the past (and was asking advice here: Extracting text from PDF with Poppler (C++) ) but never really got it working. At the end of the day I realised that at least for my use, it was easier to manually rename the files.

Answer 5

我找到了最佳解决办法,即使用非定居者的再版PDF文档,但你在pdf档案中所需要的任何文本是:A-PDF 更名为附录,对于我所审理的所有档案来说都非常好。

友情链接