我先使用表象、PyPDF2和tika模块,但似乎没有一个单元能够发现一个表电池的背景色,这在PDF档案中。
这些有色的囚室在我的问题上意味着重要的信息。 就前例而言,我知道,表象-py是表tab-java的包料,没有提供有色的手机信息。 那里是否存在着一些容易到的解决方案?
提前感谢。
我先使用表象、PyPDF2和tika模块,但似乎没有一个单元能够发现一个表电池的背景色,这在PDF档案中。
这些有色的囚室在我的问题上意味着重要的信息。 就前例而言,我知道,表象-py是表tab-java的包料,没有提供有色的手机信息。 那里是否存在着一些容易到的解决方案?
提前感谢。
否认: 我是图书馆<代码>borb的作者。 本答复所用
<>strong>about PDF: PDF不是“你认为什么是你获得的”格式,因为它是发出指示的集装箱。 这意味着,事实上,表格只是一份内容翔实的指示汇编,其中列出了我们人类认为的表格。 类似:
Whenever a PDF library is extracting tables from a PDF, it s important to keep in mind this is a heuristic. It s based on some assumptions. Such as "tables tend to have a large number of lines that intersect at 90-degree angles".
I suggest you have a look at TableDetectionByLines
in borb
. It s a class that gathers the aforementioned rendering instructions and spits out the locations of tables in the PDF document.
您将利用这一方法:
from borb.pdf.canvas.layout.table.table import Table, TableCell
from borb.pdf.document.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
from borb.toolkit.table.table_detection_by_lines import TableDetectionByLines
doc: typing.Optional[Document] = None
with open(input_file, "rb") as input_pdf_handle:
l: TableDetectionByLines = TableDetectionByLines()
doc = PDF.loads(input_pdf_handle, [l])
assert doc is not None
tables: typing.List[Table] = l.get_tables_for_page(0)
目前,这一类别没有跟踪中风/填充的颜色。 但是,你可以轻而易举地加以分类,并作相应的修改。
为此,我首先请。
I found a solution using pdfplumber. Here is rough sample code.
from typing import Optional
import pdfplumber
from pdfplumber.page import Page, Table
def cmyk_to_rgb(cmyk: tuple[float, float, float, float]):
r = 255 * (1.0 - (cmyk[0] + cmyk[3]))
g = 255 * (1.0 - (cmyk[1] + cmyk[3]))
b = 255 * (1.0 - (cmyk[2] + cmyk[3]))
return r, g, b
def to_bbox(rect: dict) -> tuple[float, float, float, float]:
return (rect["x0"], rect["top"], rect["x1"], rect["bottom"])
def is_included(cell_box: tuple[float, float, float, float], rect_box: tuple[float, float, float, float]):
c_left, c_top, c_right, c_bottom = cell_box
r_left, r_top, r_right, r_bottom = rect_box
return c_left >= r_left and c_top >= r_top and c_right <= r_right and c_bottom <= r_bottom
def find_rect_for_cell(cell: tuple[float, float, float, float], rects: list[dict]) -> Optional[dict]:
return next((r for r in rects if is_included(cell, to_bbox(r))), None)
def get_cell_color(cell: tuple[float, float, float, float], page: Page) -> tuple[float, float, float]:
rect = find_rect_for_cell(cell, page.rects) if cell else None
return cmyk_to_rgb(rect["non_stroking_color"]) if rect else (255, 255, 255)
pdf = pdfplumber.open("/path/to/target.pdf")
page = pdf.pages[0]
tables: list[Table] = page.find_tables()
# get RGB color of first(= top-left) cell of first table
print(get_cell_color(tables[0].rows[0].cells[0], page)) # => (r, g, b)
Some kind user reported my previous solution did not work well.
It s true because pdfplumber s page.rects
does not always detect cells in table correctly.
Sometimes it only detects lines, rows, cols.
So I propose another solution here.
import pdfplumber
from collections import Counter
def get_cell_color(image, cell:tuple[int, int, int, int]):
cropped_image = image.crop(cell)
pixels = list(cropped_image.convert( RGB ).getdata())
color_counts = Counter(pixels)
most_common = color_counts.most_common(1)
return most_common[0][0]
def demo(page):
"""example method: print colored cells information"""
page_image = page.to_image().original
tables = page.find_tables()
for table in tables:
extracted_table = table.extract()
for row_idx, row in enumerate(table.rows):
for cell_idx, cell in enumerate(row.cells):
cell_color = get_cell_color(page_image, cell)
if cell_color != (255, 255, 255):
print(f"cell color: {cell_color}")
print(f"cell location: {cell}")
print(f"cell content: {extracted_table[row_idx][cell_idx]}")
pdf = pdfplumber.open("/path/to/target.pdf")
page = pdf.pages[0]
demo(page)
Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...
I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...
Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...
Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...
I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...
Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...
Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...
I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...