我认为这实际上是不可能的。 LaTeX资料已不在pdf。 如果在元数据中没有标题,那么如果标题是“附标的pdf”,你就能够从结构信息中删除标题。 然而,大多数的pdf都是n,那些可能提供元数据的人。
This leaves you with layout analysis: try to determine what is the title from the document by looking at layout characteristics. For python, you might want to have a look at pdfminer.
The following example uses pdfminer to determine the title using a rather simplistic approach:
- we assume that the title is somewhere on the first page
- we leave it to pdfminer to recognize "blocks of text" on the first page
- we assume that the title is printed "bigger" than the rest of the page. Looking at the height of each line in the text blocks, we determine which block contains the "tallest" line, and assume that that block contains the title
- we let pdfminer extract the text from the block,
- the text will probably contain newlines (placed by pdfminer) because the title might contain more than one line, and other needless whitespace, so we do some simple whitespace normalization (replace consecutive whitespace by a single space, and strip leading and trailing whitespace), and that s it!
正如我说过的那样:这种做法过于简单,可能或不可能给你的文件带来良好结果,但可能把你放在正确的方向。 这里是:
import sys
import re
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox
filename = sys.argv[1]
fp = open(filename, rb )
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize()
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interp = PDFPageInterpreter(rsrcmgr, device)
pages = doc.get_pages()
first_page = pages.next()
interp.process_page(first_page)
layout = device.get_result()
textboxes = [i for i in layout if isinstance(i, LTTextBox)]
box_with_tallest_line = max(textboxes, key=lambda x: max(i.height for i in x))
text = box_with_tallest_line.get_text()
print re.sub( s+ , , text).strip()
我将档案重新命名给你(指出标题可能含有你可能不想要的特性,或甚至档案名称不有效)。 目前,假冒文件相当稀少,因此,如果你需要了解更多信息,你可能希望寄信名单上。 (很少了解自己的情况,但不能抵制;-) 或者,你可能会尝试采用类似的做法,使用其他的pdf图书馆/其他语文。