Create a training image for Tesseract OCR

I am writing an image training generator for Tesseract OCR.

When creating a training image for a new font for Tesseract OCR, what are the best values ​​for:

  • DPI
  • Font size in dots
  • If the font is smoothed or not
  • If the bounding rectangles fit snugly: enter image description here or not: enter image description here
+6
source share
3 answers

I found the answer to the fourth question: "If the bounding rectangles fit snugly."

Giving rectangles as much as possible gives much better results.

For the remaining 12 points and 300 dpi will be enough, as suggested by @ Yaroslav. I think anti-aliasing is better off.

+1
source

Question 2 somehow answered here: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Generate_Training_Images No need to train with multiple sizes. 10 point I will do. (The exception to this is very small text. If you want to recognize text with an x ​​height of less than about 15 pixels, you should either train it specific or scale the images before trying to recognize them.)

Questions 1 and 3: from experience, I have successfully used fonts with a resolution of 300 dpi / without anti-aliasing. In particular, I used the following conversion options on the training pdf, which created a satisfactory image:

convert -density 300 -depth 8 [input].pdf -background white -flatten +matte -compress none -monochrome [output].tif 

But then I tried to add a font to Tesseract, and it correctly recognized the characters when I used an image with a resolution of 150 dpi. So, I don’t think there is a general solution, it depends on the type of fonts you are trying to add.

+2
source

A good tool for learning tesseract http://vietocr.sourceforge.net/training.html

This is a good tool because it has several advantages.

  • the bounding box on the letter may be editable via a graphical interface
  • automatically create all the required files
  • automatically merges all files, such as freq-dawg, word-dawg, user words (may be an empty file), Inttemp, Normproto, Pffmtable, Unicharset, DangAmbigs (may be an empty file), scalable into a single eng.traineddata file.
  • New training data can be used with existing tesseract end.traineddata file
-1
source

All Articles