I have been working with Microsoft OCR for a while. Compared to Tesseract, it has very basic features.
For example, Microsoft OCR returns words and strings. But lines are nonsense. Randomly 2 or 3 words are grouped together as a "string", but they are not a real line. And the "lines" are completely disordered. In this aspect, it is worse than Tesseract. You must take the coordinates of each word and order them yourself.
Microsoft does not return character rectangles, and there is absolutely no way to customize or train Microsoft OCR in any way. You can add languages ββwith Windows Update for "Basic Typing" = OCR (see http://www.thewindowsclub.com/install-uninstall-languages-windows-10 ), but you cannot train your own language data.
MSDN says the following 25 languages ββare supported with varying degrees of accuracy:
- Excellent: Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Polish, Portuguese, Romanian, Serbian Cyrillic, Serbian Latin, Slovak, Spanish and Swedish.
- Very good: Simplified Chinese, Greek, Japanese, Russian and Turkish.
- Good: traditional Chinese and Korean.
Recognition quality is very similar to Tesseract. He even has the same problems as Tesseract. Some individual characters are not recognized (individual characters, such as a single '$'), and it has the same huge problem as asterisks like Tesseract. It also inserts places in the wrong places, as Tesseract does. So I ask myself if Microsoft uses Tesseract under the hood?
However, Microsoft OCR has an advantage over Tesseract: image preprocessing is much better. It doesn't matter if you have red text on a yellow background or white text on black. This is a trick for Tesseract that needs a good quality black and white image as input.
For both OCR libraries, the following apply: If you have recognition problems, try enhancing the image. Even blurring an image can be very good because it eliminates image noise.
Elmue
source share