Question

我正在寻找一种对主要由文本组成的扫描页面进行分类的方法。

以下是我问题的细节。我收集了大量扫描文档，需要检测这些文档中是否存在某些类型的页面。我计划将文档“分解”到它们的组成页中（每一页都是一张单独的图像），并将这些图像中的每一张分类为“A”或“B”。但我想不出最好的办法。

更多详细信息：

I have numerous examples of "A" and "B" images (pages), so I can do supervised learning.
It s unclear to me how to best extract features from these images for the training. E.g. What are those features?
The pages are occasionally rotated slightly, so it would be great if the classification was somewhat insensitive to rotation and (to a lesser extent) scaling.
I d like a cross-platform solution, ideally in pure python or using common libraries.
I ve thought about using OpenCV, but this seems like a "heavy weight" solution.

编辑：

The "A" and "B" pages differ in that the "B" pages have forms on them with the same general structure, including the presence of a bar code. The "A" pages are free text.

Answer 1

我将分三部分回答，因为你的问题显然很大，如果页数不超过1000页，我强烈建议使用廉价劳动力的手动方法。

Part 1: Feature Extraction - You have a very large array of features to choose from in the object detection field. Since one of your requirements is rotation invariance, I would recommend the SIFT/SURF class of features. You might also find Harris corners etc. suitable. Deciding which features to use can require expert knowledge and if you have computing power I would recommend creating a nice melting pot of features and passing it through a classifier training based importance estimator.

Part 2: Classifier Selection - I am a great fan of the Random Forest classifier. The concept is very simple to grasp and it is highly flexible and non-parametric. Tuning requires very few parameters and you can also run it in a parameter selection mode during supervised training.

Part 3: Implementation - Python in essence is a glue language. Pure python implementations for image processing are never going to be very fast. I recommend using a combination of OpenCV for feature detection and R for statistical work and classifiers.

解决方案可能看起来设计过度，但机器学习从来都不是一项简单的任务，即使页面之间的差异只是一本书的左右页。

Answer 2

首先，我想说，在我看来，OpenCV是一个非常好的工具，可以进行这些操作。此外，它还有一个描述良好的python接口此处。

OpenCV经过高度优化，您的问题并不容易。

[全球编辑：我的想法的重组]

以下是一些可以使用的功能：

为了检测条形码，如果条形码被隔离，您可能应该尝试进行距离变换（OpenCV中的DistTransform）。也许你可以通过match或matchShapes找到有趣的地方。我认为这是可行的，因为条形码应该有相同的形状（大小等）。兴趣点的分数可以用作特征。
图像的矩在这里可能很有用，因为你有不同种类的全局结构。这可能足以区分A&；B页（请参阅对于openCV函数有）（顺便说一下，您将获得不变描述符：）
您可能应该尝试计算垂直梯度和水平梯度垂直梯度==0和水平梯度的特定位置=0。这个主要优点是这些操作的成本低，因为您的目标只是检查页面上是否有这样的区域。你可以找到兴趣区并将其分数作为一项功能

Once you have your features, you can try to do supervised learning and test generalization. Your problem require very few false negative (because you are going to throw away some pages) so you should evaluate your performance with ROC curves and look carefully at the sensistivity (that should be high). For the classification, you could use regression with lasso penalization to find the best features. The post of whatnick also gives goods ideas and other descriptors (maybe more general).

Answer 3

因此，您希望能够使用特定元素来区分两种页面——基本上是条形码的存在。有两个步骤：

特征提取（计算机视觉）：找到兴趣点或兴趣线，这些兴趣点或线将是条形码的特定特征，而不是文本。
二进制分类（统计学习）：根据提取的特征确定是否存在条形码。

处理第一步时，您绝对应该看看Hough变换。它非常适合识别图像中的线条，并且可能对条形码检测有用。阅读这些例如，两页。下面是示例使用OpenCV。

关于第二步，最有用的分类将基于：

k nearest neighbours
logistic regression
random forest (really well implemented in R, but I do not know about Python)

Answer 4

您可以尝试通过将a和B的训练数据上传到demo.nanonets.ai（免费使用）

1）在此处上传您的培训数据：

demo.nanonets.ai

2）然后使用以下（Python代码）查询API：

import requests
import json
import urllib
model_name = "Enter-Your-Model-Name-Here"
url = "https://cdn.pixabay.com/photo/2012/04/24/12/13/letter-39694_960_720.png"
files = { uploadfile : urllib.urlopen(url).read()}
url = "http://demo.nanonets.ai/classify/?appId="+model_name
r = requests.post(url, files=files)
print json.loads(r.content)

3）响应看起来像：

{
  "message": "Model trained",
  "result": [
    {
      "label": "A",
      "probability": 0.97
    },
    {
      "label": "B",
      "probability": 0.03
    }
  ]
}

友情链接