Python OCR: Convert Scanned Image to Text for Processing

Question

Python OCR: Convert Scanned Image to Text for Processing

I am trying to create a request for an answer icon (multiple choice question) python. The answer sheet will be scanned into an image file (gif, png, jpg, depending on the format).

My application has access to a database where all answers are stored.

Thus, all you need is some data from the scanned image so that it can compare the answer and calculate the labels.

The answer sheet is fixed in size with a tabular format like this (answers will be marked with an “X” to indicate their answers):

enter image description here

After searching the Internet, I found that there are several OCR APIs.

The first is Pytesser . It is very easy to use, and the results are all right. But it only works for images with pure clear texts. So, I think this is not suitable.

The second found Ocropus . It seems powerful, but it has documentation

Window
OCRopus relies heavily on POSIX path names and file systems. You may be able to install OCRopus on Windows. An easier way is to install VirtualBox and run OCRopus on Ubuntu under VirtualBox.

So, I think this is mainly for Linux. I could not find a detailed installation guide for the window platform. (I'm new, so I might be wrong)

The third I found python-tesseract , a wrapper for Tesseract OCR . An installation guide was provided on the page. Basically, I need

python-tesseract-win32.deb
python opencv
Numpy

but I do not know how to install .deb files in a window. I already have opencv and nampy .

So the following questions:

(1) How can I convert a table image into processed data (is this possible?)?

(2) Are there any other useful OCR APIs that I haven't mentioned here that might be useful?

(3) Finally, (my stupid idea) Is it possible to split the image into small cartridges (depending on the size of the table cells - since the table sizes are known) using PIL , and then use Pytesser to convert each small image to text, then process the data accordingly ?

FYI: I only need this for the Windows Platform, perhaps for Windows xp 32 bit. I am using python 2.7.5.

+7

python python-2.7 ocr tesseract python-imaging-library

Chris aung Nov 20 '13 at 12:15

source share

1 answer

Paul · Accepted Answer · 2013-11-20T13:02:31+0000

Answers match your numbers

1) OCR as a whole is very difficult, but (good news for you) to process test scores, I think it's almost a problem. In this spirit, there are tried and true solutions to such problems. School systems do this to automate scantron tests over the years, so if you have access to such resources, this route may be your best bet. At least you should check how they do it.

2) I'm sure there are others, but these are the basic free ones that I know about

3) a I think that if you are trying to do it on a budget, and time is shorter, your “stupid” idea is not really stupid. This may be the best way to do this, and it is likely that screenron graders use a similar method. You know the exact dimensions of the test mold. You can know the direct display of pixels, where to look. You can use pytesser very easily. Keep in mind that pytesser sometimes requires you to resize the image (sometimes up, sometimes down) to get maximum accuracy.

3) b You might want to consider translating your own decision. You can use the concept of morphological operations (numpy and other image libraries can do this almost out of the box). Perhaps you don’t even need these operators and simply execute the binary threshold of the table rows (provided that you have already cropped the image in the table rows) and just look at the blobs and mark the score as coming from the column with the most blob values.

Python OCR: Convert Scanned Image to Text for Processing

More articles: