Create a training image for Tesseract OCR

Question

Create a training image for Tesseract OCR

I am writing an image training generator for Tesseract OCR.

When creating a training image for a new font for Tesseract OCR, what are the best values for:

DPI
Font size in dots
If the font is smoothed or not
If the bounding rectangles fit snugly: or not:

+6

ocr tesseract

sashoalm Nov 16 '12 at 10:04

source share

3 answers

Question 2 somehow answered here: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Generate_Training_Images No need to train with multiple sizes. 10 point I will do. (The exception to this is very small text. If you want to recognize text with an x height of less than about 15 pixels, you should either train it specific or scale the images before trying to recognize them.)

Questions 1 and 3: from experience, I have successfully used fonts with a resolution of 300 dpi / without anti-aliasing. In particular, I used the following conversion options on the training pdf, which created a satisfactory image:

convert -density 300 -depth 8 [input].pdf -background white -flatten +matte -compress none -monochrome [output].tif

But then I tried to add a font to Tesseract, and it correctly recognized the characters when I used an image with a resolution of 150 dpi. So, I don’t think there is a general solution, it depends on the type of fonts you are trying to add.

+2

Luiza Utsch May 09 '13 at 22:24

source share

A good tool for learning tesseract http://vietocr.sourceforge.net/training.html

This is a good tool because it has several advantages.

the bounding box on the letter may be editable via a graphical interface
automatically create all the required files
automatically merges all files, such as freq-dawg, word-dawg, user words (may be an empty file), Inttemp, Normproto, Pffmtable, Unicharset, DangAmbigs (may be an empty file), scalable into a single eng.traineddata file.
New training data can be used with existing tesseract end.traineddata file

-1

N. Singh Sep 05 '16 at 10:06

source share

sashoalm · Accepted Answer · 2012-11-21T15:12:44+0000

I found the answer to the fourth question: "If the bounding rectangles fit snugly."

Giving rectangles as much as possible gives much better results.

For the remaining 12 points and 300 dpi will be enough, as suggested by @ Yaroslav. I think anti-aliasing is better off.

Create a training image for Tesseract OCR

More articles: