我正在寻找一种对主要由文本组成的扫描页面进行分类的方法。
以下是我问题的细节。我收集了大量扫描文档,需要检测这些文档中是否存在某些类型的页面。我计划将文档“分解”到它们的组成页中(每一页都是一张单独的图像),并将这些图像中的每一张分类为“A”或“B”。但我想不出最好的办法。
更多详细信息:
- I have numerous examples of "A" and "B" images (pages), so I can do supervised learning.
- It s unclear to me how to best extract features from these images for the training. E.g. What are those features?
- The pages are occasionally rotated slightly, so it would be great if the classification was somewhat insensitive to rotation and (to a lesser extent) scaling.
- I d like a cross-platform solution, ideally in pure python or using common libraries.
- I ve thought about using OpenCV, but this seems like a "heavy weight" solution.
编辑:
- The "A" and "B" pages differ in that the "B" pages have forms on them with the same general structure, including the presence of a bar code. The "A" pages are free text.