Question

为多个有多个页面的pdf 运行一个循环,以提取多个表格。问题在于,如果在第1页或第2页有包含基于图像的格式的pdf,以及表格分别从第2页或第3页开始显示循环停止和随后的错误,我是否为多个pdf 运行循环。

Anaconda3libsite-packagescamelotparserslattice.py:411: UserWarning: page-1 is image-based, camelot only works on text-based pages.

Anaconda3libsite-packagescamelotparserslattice.py:411: UserWarning: page-2 is image-based, camelot only works on text-based pages.

我想要一个解决方案,我可以跳过任何基于图像的页面, 继续下一页的循环。

Answer 1

我唯一能为我工作的解决办法是尝试从pdf中提取文本。

def is_page_image_based(pdf_path, page_number):
     doc = fitz.open(pdf_path)
     page = doc.load_page(page_number)
     text = page.get_text()
     doc.close()
     return !bool(text.strip())

友情链接