Using Ruby And Ubuntu With Optical Character Recognition

I am a university student and it s time to buy textbooks again. This quarter there are over 20 books I need for classes. Normally this wouldn t be such a big deal, as I would just copy and paste the ISBNs into Amazon. The ISBNs, however, are converted into an image on my school s book site. All I want to do is get the ISBNs into a string so I don t have to type each one by hand. I have used GOCR to convert the images into text, but I want to use it with a Ruby script so I can automate the process and do the same for my classmates.

I can navigate to the site. How can I save the image to a file on my computer (running UBUNTU), convert the image with GOCR, and finally save it to a file so I can then access them again with my Ruby script?


GOCR seems to be a good choice at first, but from what I can tell from my own "research", quality isn t quite sufficient for daily use. Maybe this could lead to a problem, depending on the image input. If it doesn t work out for you, try the "new" feature of Google Docs, which allows you to upload images for OCR. You can then retrieve the results using some google api ( there are tons out there, I m using gdata-ruby-util which requires some hacking, though.

You could also use tesseract-ocr for the OCR part, it s also open source and in active development.

For the retrieval part, I would as well stick with hpricot, super-powerful and flexible.

Sounds like a cool project, and shouldn t be too hard if the ISBN images are stored in individual files.

This all can be run in the background:

  • download web page (net/http)
  • save metadata + image file for each book (paperclip)
  • run GOCR on all the images

All you need is a list of urls or a crawler (mechanize) and then you probably need to spend a few minutes writing a parser (see joe s post) for the university html pages.

