Question

我先使用表象、PyPDF2和tika模块,但似乎没有一个单元能够发现一个表电池的背景色,这在PDF档案中。

这些有色的囚室在我的问题上意味着重要的信息。就前例而言,我知道,表象-py是表tab-java的包料,没有提供有色的手机信息。那里是否存在着一些容易到的解决方案?

提前感谢。

Answer 1

否认: 我是图书馆<代码>borb的作者。本答复所用

<>strong>about PDF: PDF不是“你认为什么是你获得的”格式,因为它是发出指示的集装箱。这意味着,事实上,表格只是一份内容翔实的指示汇编,其中列出了我们人类认为的表格。类似:

go to location x, y
set the current stroke colour to black
set the current fill colour to blue
set the font to Helvetica, size 12
draw a line to x, y
move the pen up
go to x, y
render the string "Hello World"

Whenever a PDF library is extracting tables from a PDF, it s important to keep in mind this is a heuristic. It s based on some assumptions. Such as "tables tend to have a large number of lines that intersect at 90-degree angles".

I suggest you have a look at TableDetectionByLines in borb. It s a class that gathers the aforementioned rendering instructions and spits out the locations of tables in the PDF document.

您将利用这一方法:

from borb.pdf.canvas.layout.table.table import Table, TableCell
from borb.pdf.document.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
from borb.toolkit.table.table_detection_by_lines import TableDetectionByLines

doc: typing.Optional[Document] = None
with open(input_file, "rb") as input_pdf_handle:
    l: TableDetectionByLines = TableDetectionByLines()
    doc = PDF.loads(input_pdf_handle, [l])

assert doc is not None
tables: typing.List[Table] = l.get_tables_for_page(0)

目前,这一类别没有跟踪中风/填充的颜色。但是,你可以轻而易举地加以分类,并作相应的修改。

为此,我首先请。

Answer 2

I found a solution using pdfplumber. Here is rough sample code.

from typing import Optional

import pdfplumber
from pdfplumber.page import Page, Table


def cmyk_to_rgb(cmyk: tuple[float, float, float, float]):
    r = 255 * (1.0 - (cmyk[0] + cmyk[3]))
    g = 255 * (1.0 - (cmyk[1] + cmyk[3]))
    b = 255 * (1.0 - (cmyk[2] + cmyk[3]))
    return r, g, b


def to_bbox(rect: dict) -> tuple[float, float, float, float]:
    return (rect["x0"], rect["top"], rect["x1"], rect["bottom"])


def is_included(cell_box: tuple[float, float, float, float], rect_box: tuple[float, float, float, float]):
    c_left, c_top, c_right, c_bottom = cell_box
    r_left, r_top, r_right, r_bottom = rect_box
    return c_left >= r_left and c_top >= r_top and c_right <= r_right and c_bottom <= r_bottom


def find_rect_for_cell(cell: tuple[float, float, float, float], rects: list[dict]) -> Optional[dict]:
    return next((r for r in rects if is_included(cell, to_bbox(r))), None)


def get_cell_color(cell: tuple[float, float, float, float], page: Page) -> tuple[float, float, float]:
    rect = find_rect_for_cell(cell, page.rects) if cell else None
    return cmyk_to_rgb(rect["non_stroking_color"]) if rect else (255, 255, 255)


pdf = pdfplumber.open("/path/to/target.pdf")
page = pdf.pages[0]
tables: list[Table] = page.find_tables()

# get RGB color of first(= top-left) cell of first table
print(get_cell_color(tables[0].rows[0].cells[0], page)) # => (r, g, b)

Answer 3

Some kind user reported my previous solution did not work well.
It s true because pdfplumber s page.rects does not always detect cells in table correctly.
Sometimes it only detects lines, rows, cols.
So I propose another solution here.

import pdfplumber
from collections import Counter
    

def get_cell_color(image, cell:tuple[int, int, int, int]):
    cropped_image = image.crop(cell)
    pixels = list(cropped_image.convert( RGB ).getdata())
    color_counts = Counter(pixels)
    most_common = color_counts.most_common(1)
    return most_common[0][0]


def demo(page):
    """example method: print colored cells information"""
    page_image = page.to_image().original
    tables = page.find_tables()
    
    for table in tables:
        extracted_table = table.extract()
        for row_idx, row in enumerate(table.rows):
            for cell_idx, cell in enumerate(row.cells):
                cell_color = get_cell_color(page_image, cell)
                if cell_color != (255, 255, 255):
                    print(f"cell color: {cell_color}")
                    print(f"cell location: {cell}")
                    print(f"cell content: {extracted_table[row_idx][cell_idx]}")


pdf = pdfplumber.open("/path/to/target.pdf")
page = pdf.pages[0]
demo(page)

友情链接