Character recognition with tesseract

Question

Character recognition with tesseract

I am trying to interact with the tesseract API, I am also new to image processing and I am just struggling with it for the last few days. I tried simple algorithms and I achieved almost 70% accuracy.

I want its accuracy to be 90%. The problem with images is that they are at 72dpi. I also tried to increase the resolution, but did not get good results, the images I'm trying to admit are attached.

Any help would be appreciated, and I'm sorry if I ask for something very simple.

EDIT

I forgot to mention that I try to do all the processing and recognition within 2-2.5 seconds on the Linux platform and the method for detecting the text mentioned in this answer takes a lot of time. I also prefer not to use a command line solution, but I would prefer Leptonica or OpenCV .

Download most images here

I tried the following things to binarize tickets, but no luck

Ticket contains

a little bad light
Non-text area
less resolution

I tried to transfer the image directly to the tesseract API, and it gives me 70% good results in an average of 1 second. But I want to increase accuracy by noting the time factor. So far i tried

Image Edge Detection
Blob analysis for blobs
Adaptive Threshold Binarized Ticket

Then I tried to transfer these binarized images to tesseract, the accuracy was reduced to less than 50-60%, although the binarized image looks perfect.

+2

c ++ image-processing opencv ocr tesseract

Muaz usmani Dec 20 '13 at 12:23

source share

4 answers

Some possible improvements:

Resolution must be at least 300 dpi.
Make your lighting more complicated. There are several dark areas that may affect the results.
Try to increase your characters a little. Currently, they have different sizes, and some letters are even distorted.
Pre-process the image using thresholds and binarization.

You can do this with your own programming, or Fred ImageMagick Scripts can help.

+2

lennon310 Dec 20 '13 at 14:27

source share

I'm not sure my post is useful to you because my answer is not about Tesseract. But this is about high accuracy, so I decided it might be interesting for you to find out how the paid OCR SDK solution works.

These are recognition results using the ABBYY Cloud OCR SDK without any additional settings.

Disclaimer: I work for ABBYY.

0

Vitalik Kudryavtsev Jan 10 '14 at 17:34

source share

You can try using ScanTailor ( http://scantailor.sourceforge.net/ , as well as the CLI interface) to binarize, process and decorate images. Scaling images can help improve recognition. Because Tesseract recognition profiles have been optimized for at least 300 DPI.

Another possibility is to teach Tesseract the font that is characteristic of your material (more on this can be here: https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 ).

I don’t think that a dictionary search will help here because you have mostly numbers.

0

maneo Jan 10 '14 at 10:02

source share

karlphillip · Accepted Answer · 2013-12-20T19:07:55+0000

There are a few things you could try:

To improve accuracy, you must improve the image quality for the OCR engine, which means preprocessing the images before submitting them to Tesseract . I suggest exploring OpenCV for this purpose.
The main problem with OCR mechanisms is that they are not as good at character recognition as we are. Therefore, even things that are not text are sometimes mistakenly identified as if they were . Therefore, to prevent this from happening, it is better to define areas of the text and send them to Tesseract instead of sending the full image, as you do with image # 2.
Another way to extract text areas of an image is by using this technique before isolating them .
When you get results from Tesseract, you can improve their comparison of the resulting text with the dictionary .

Character recognition with tesseract

More articles: