English 中文(简体)
我怎么能从人民抵抗力量档案中抽取一个桌子的底色?
原标题:How can I extract the background color of a table cell within a PDF file using Python?

我先使用表象、PyPDF2和tika模块,但似乎没有一个单元能够发现一个表电池的背景色,这在PDF档案中。

这些有色的囚室在我的问题上意味着重要的信息。 就前例而言,我知道,表象-py是表tab-java的包料,没有提供有色的手机信息。 那里是否存在着一些容易到的解决方案?

提前感谢。

问题回答

否认: 我是图书馆<代码>borb的作者。 本答复所用

<>strong>about PDF: PDF不是“你认为什么是你获得的”格式,因为它是发出指示的集装箱。 这意味着,事实上,表格只是一份内容翔实的指示汇编,其中列出了我们人类认为的表格。 类似:

  • go to location x, y
  • set the current stroke colour to black
  • set the current fill colour to blue
  • set the font to Helvetica, size 12
  • draw a line to x, y
  • move the pen up
  • go to x, y
  • render the string "Hello World"

Whenever a PDF library is extracting tables from a PDF, it s important to keep in mind this is a heuristic. It s based on some assumptions. Such as "tables tend to have a large number of lines that intersect at 90-degree angles".

I suggest you have a look at TableDetectionByLines in borb. It s a class that gathers the aforementioned rendering instructions and spits out the locations of tables in the PDF document.

您将利用这一方法:

from borb.pdf.canvas.layout.table.table import Table, TableCell
from borb.pdf.document.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
from borb.toolkit.table.table_detection_by_lines import TableDetectionByLines

doc: typing.Optional[Document] = None
with open(input_file, "rb") as input_pdf_handle:
    l: TableDetectionByLines = TableDetectionByLines()
    doc = PDF.loads(input_pdf_handle, [l])

assert doc is not None
tables: typing.List[Table] = l.get_tables_for_page(0)

目前,这一类别没有跟踪中风/填充的颜色。 但是,你可以轻而易举地加以分类,并作相应的修改。

为此,我首先请

I found a solution using pdfplumber. Here is rough sample code.

from typing import Optional

import pdfplumber
from pdfplumber.page import Page, Table


def cmyk_to_rgb(cmyk: tuple[float, float, float, float]):
    r = 255 * (1.0 - (cmyk[0] + cmyk[3]))
    g = 255 * (1.0 - (cmyk[1] + cmyk[3]))
    b = 255 * (1.0 - (cmyk[2] + cmyk[3]))
    return r, g, b


def to_bbox(rect: dict) -> tuple[float, float, float, float]:
    return (rect["x0"], rect["top"], rect["x1"], rect["bottom"])


def is_included(cell_box: tuple[float, float, float, float], rect_box: tuple[float, float, float, float]):
    c_left, c_top, c_right, c_bottom = cell_box
    r_left, r_top, r_right, r_bottom = rect_box
    return c_left >= r_left and c_top >= r_top and c_right <= r_right and c_bottom <= r_bottom


def find_rect_for_cell(cell: tuple[float, float, float, float], rects: list[dict]) -> Optional[dict]:
    return next((r for r in rects if is_included(cell, to_bbox(r))), None)


def get_cell_color(cell: tuple[float, float, float, float], page: Page) -> tuple[float, float, float]:
    rect = find_rect_for_cell(cell, page.rects) if cell else None
    return cmyk_to_rgb(rect["non_stroking_color"]) if rect else (255, 255, 255)


pdf = pdfplumber.open("/path/to/target.pdf")
page = pdf.pages[0]
tables: list[Table] = page.find_tables()

# get RGB color of first(= top-left) cell of first table
print(get_cell_color(tables[0].rows[0].cells[0], page)) # => (r, g, b)

Some kind user reported my previous solution did not work well.
It s true because pdfplumber s page.rects does not always detect cells in table correctly.
Sometimes it only detects lines, rows, cols.
So I propose another solution here.

import pdfplumber
from collections import Counter
    

def get_cell_color(image, cell:tuple[int, int, int, int]):
    cropped_image = image.crop(cell)
    pixels = list(cropped_image.convert( RGB ).getdata())
    color_counts = Counter(pixels)
    most_common = color_counts.most_common(1)
    return most_common[0][0]


def demo(page):
    """example method: print colored cells information"""
    page_image = page.to_image().original
    tables = page.find_tables()
    
    for table in tables:
        extracted_table = table.extract()
        for row_idx, row in enumerate(table.rows):
            for cell_idx, cell in enumerate(row.cells):
                cell_color = get_cell_color(page_image, cell)
                if cell_color != (255, 255, 255):
                    print(f"cell color: {cell_color}")
                    print(f"cell location: {cell}")
                    print(f"cell content: {extracted_table[row_idx][cell_idx]}")


pdf = pdfplumber.open("/path/to/target.pdf")
page = pdf.pages[0]
demo(page)




相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签