How to configure Tesseract ignores noise?

I have such an image (white background and black text). If there is no noise (as you can see: the top and bottom of the number line have a lot of noise), Tesseract can recognize the number very good.

But when there is noise, Tesseract will try to recognize it as a number and add more numbers to the result. This is really bad. How can I make Tesseract ignore noise? I canโ€™t pre-process the image to make it more contrast or sharp. It doesnโ€™t help anything.

If any instrument can catch fire only a line. This can be a really good contribution to Tesseract. Please help me. Thanks to everyone.

enter image description here

+4
source share
5 answers

You should try blurring and expanding :

The most basic morphological operations are two: Erosion and Dilation. They have a wide range of applications, that is:

Noise removal

...

+3
source

you can try to shift the sample of the binary image and try it again ( pyrDown and PyrUp ), or you can try to smooth your image with Gaussian blur. And, as already mentioned, erode and dilate your image.

+2
source

I see 3 solutions to your problem:

  • As already mentioned - try using erode and dilate or some kind of blur. This is the simplest solution.
  • Find all the paths ( findContours function), and then delete all the paths with an area less than some value (try different values, you should find the right one fast enough). Note that the value may not be constant - for example, you can try to use 80% of the middle area of โ€‹โ€‹the path (just add all the path areas, divide them by the number of paths and multiply by 0.8).
  • Find all the contours. Create one dimensional array of integers with a length equal to the height of your image. Fill the array with zeros. Now for each circuit:
    I. Find the top and bottom points (the points with the largest and smallest y coordinate values). Call this point T and B
    II. Add one of all elements of the array whose index is between By and Ty . (therefore, if B = (1, 4) and T = (3, 11), add one to array [4], array [5], array [6] ..., array [11]).
    Find the largest element in the array. Name this value v . All circuits for which By <= v <= Ty must be letters, other circuits are noise.
+1
source

you can easily remove these noises using image processing methods (morphological operations such as erosion and expansion), you can choose opencv for this operation.

+1
source

Connect component marking .... that is, counting blob .... all doses of noise can never correspond to the size of numbers .... with morphological methods the numbers also change ... image sticker ... count the number of pixels in each marked area and set a threshold (which you can easily set since you will only have numbers and noises) ... cvblob is a C ++ library available in googles codes ...

0
source

All Articles