English 中文(简体)
OCR: How to improve accuracy - existing libraries for removing non-text furniture , shapes, etc to avoid confusing OCR?
原标题:

I want to remove rectangles etc that enclose text in a screenshot image, so that I can perform optical character recognition to get accurate text from the screenshot.

Background:

I doing this to extract data from a legacy application for use with other applications. This is the only way to get at this data as associated files are in a closed, proprietary, binary format.

I will be using AutoItScript to drive the application to show data in its UI, then I will screenshot this and feed this to tesseract.

I ve already had some success in automating the UI, and have been able to use tesseract to get plain ascii text out of the bitmap.

There are several AutoItScripr forum articles discussing its use with tesseract/OCR but not specifically for my question. http://www.autoitscript.com/forum/index.php?s=6c32c3ece12756e635a619cdf175eff9&showforum=2

What I need to do

There are thin, 1-pixel wide rectangles that closely enclose some text, when fed to tesseract, it sees them as I for example for a verticle line of the rectangle.

Any thoughts on how to remove the rectangles, or best practices?

I m asking if there is a generic command line based toolset to overwrite rectangles, for example, in .png files. I could then pass the .png through this, then pass it to tesseract.

Details on the tesseract release/setup I ve used are as follows:

Go here: http://code.google.com/p/tesseract-ocr/downloads/list - For the basic english generic character set to get Tesseract up and running and recognising your bitmapped text into ascii text, use tesseract-2.00.eng.tar.gz (current version at time of writing is: "English language data for Tesseract (2.00 and up) Jul 2007 989 KB 84845")

Related questions I have already looked at on Stack Overflow

In these, my question is not completely answered or a commercial solution is being sold. I do not want to consider a commercial solution at this stage.

最佳回答

There s probably not going to be a free off the shelf solution for this, but coding your own shouldn t be too hard since it s probably safe to assume that a rectangle will never be a valid character in your font s alphabet and can therefore be removed safely. It also helps that all your rectangle borders are exactly one pixel wide.

So search for a contiguous horizontal line that is joined to another, parallel line of the same length by exactly two vertical lines. Repeat the search until you find all the rectangles in the image then render them all transparent with Graphics.DrawRectangle and Pens.Transparent. Don t render a rectangle transparent until you ve finished searching else you risk wiping out parts of overlapped rectangles before you ve found them. This is just a starter suggestion, I haven t implemented or debugged this algorithm.

问题回答

暂无回答




相关问题
Deploying WCF application

I have IIS-Hosted WCF application and services. I want to automate the process to deploy this application into test/Acceptance test/production environments What is the best way to automate the process ...

CGWindowID from AXUIElement

I m trying to automate a foreign OSX application using the accessibility API. Some of the state of the application isn t available through the API, so I acquire it through screen scraping. To do this, ...

Snapping pictures from Windows C# Canon SDK vs PTP or MTP

I am hoping to receive some general guidance on accomplishing a seemingly simple goal. I have a DSLR camera (Canon EOS 50D) and need to write an application that will tell the camera to take a ...

热门标签