Question

I want get a small image of every word in a lot of scanned books (that is in Persian (Arabic-script)). I have no experiment in image prossessing.
How can I do that in most efficient way?

Answer 1

I suggest you write a script in MATLAB something like this.
a : half of the maximum distance between the letters.(in pixels)
b : half of the minimum distance between the words.(in pixels)
(lets hope a < b )

维护该网页的扫描图像。

I(I < Th) = 0;I(I > Th) = 1;

Choose Th by experimenting. You should get a binary image I having 1 s where letters are. Dilate the image.

imdilate(I,a);

This will connect the letters together.
Remove noise.

I = bwareaopen(I,n);

this will remove all connected components with less that n pixels.
Do connected component analysis.

CC = bwconncomp(I);  
Rect = regionprops(I, BoundingBox );

This will return a list of co-ordinates of a rectangle containing a single word. Extract the sub-matrix from original copy and write the image using imwrite().

友情链接