Introduction to OCR

Someone gave me a lot of complete information. These are 200 MB .tiff images of scanned ads, which are returned until the 40s. I want to digitize this, but I don't know anything about OCR. Some of the earliest materials are barely human-readable, not to mention the car. It is also in Hebrew.

I am looking for advice on how to approach this. Good offer on books, articles, code libraries or software (all of which should be available free of charge on the Internet). I know C ++ and Python and can choose a different language if necessary.

Thank.

+5
source share
1 answer

This sounds like a great task for Python using the OCR library. A quick google pytesser search :

PyTesser - Python. .

PyTesser Tesseract OCR engine, Tesseract script. Windows Python. .

...

>>> from pytesser import *
>>> image = Image.open('fnord.tif')  # Open image object using PIL
>>> print image_to_string(image)     # Run tesseract.exe on image
fnord
>>> print image_file_to_string('fnord.tif')
fnord
+5

All Articles