Character recognition with tesseract

I am trying to interact with the tesseract API, I am also new to image processing and I am just struggling with it for the last few days. I tried simple algorithms and I achieved almost 70% accuracy.

I want its accuracy to be 90%. The problem with images is that they are at 72dpi. I also tried to increase the resolution, but did not get good results, the images I'm trying to admit are attached.

Any help would be appreciated, and I'm sorry if I ask for something very simple.

Image 1

Image 2

Image 3

EDIT

I forgot to mention that I try to do all the processing and recognition within 2-2.5 seconds on the Linux platform and the method for detecting the text mentioned in this answer takes a lot of time. I also prefer not to use a command line solution, but I would prefer Leptonica or OpenCV .

Download most images here

I tried the following things to binarize tickets, but no luck

Ticket contains

  • a little bad light
  • Non-text area
  • less resolution

I tried to transfer the image directly to the tesseract API, and it gives me 70% good results in an average of 1 second. But I want to increase accuracy by noting the time factor. So far i tried

  • Image Edge Detection
  • Blob analysis for blobs
  • Adaptive Threshold Binarized Ticket

Then I tried to transfer these binarized images to tesseract, the accuracy was reduced to less than 50-60%, although the binarized image looks perfect.

+2
c ++ image-processing opencv ocr tesseract
source share
4 answers

There are a few things you could try:

  • To improve accuracy, you must improve the image quality for the OCR engine, which means preprocessing the images before submitting them to Tesseract . I suggest exploring OpenCV for this purpose.

  • The main problem with OCR mechanisms is that they are not as good at character recognition as we are. Therefore, even things that are not text are sometimes mistakenly identified as if they were . Therefore, to prevent this from happening, it is better to define areas of the text and send them to Tesseract instead of sending the full image, as you do with image # 2.

  • Another way to extract text areas of an image is by using this technique before isolating them .

  • When you get results from Tesseract, you can improve their comparison of the resulting text with the dictionary .

+3
source share

Some possible improvements:

  • Resolution must be at least 300 dpi.
  • Make your lighting more complicated. There are several dark areas that may affect the results.
  • Try to increase your characters a little. Currently, they have different sizes, and some letters are even distorted.
  • Pre-process the image using thresholds and binarization.

You can do this with your own programming, or Fred ImageMagick Scripts can help.

+2
source share

I'm not sure my post is useful to you because my answer is not about Tesseract. But this is about high accuracy, so I decided it might be interesting for you to find out how the paid OCR SDK solution works.

These are recognition results using the ABBYY Cloud OCR SDK without any additional settings.

enter image description hereenter image description here

Disclaimer: I work for ABBYY.

0
source share

You can try using ScanTailor ( http://scantailor.sourceforge.net/ , as well as the CLI interface) to binarize, process and decorate images. Scaling images can help improve recognition. Because Tesseract recognition profiles have been optimized for at least 300 DPI.

Another possibility is to teach Tesseract the font that is characteristic of your material (more on this can be here: https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 ).

I don’t think that a dictionary search will help here because you have mostly numbers.

0
source share

All Articles