How does Google Books find text regions?

One of the difficult topics in computer vision is the processing of document scans. This usually involves several steps, such as noise removal, color analysis, binarization, text block identification, OCR, and then possibly some contextual analysis and correction.

I'm curious if anyone understands, knows, or can point me to literature on how Google identifies text blocks before the OCR stage. Any ideas?

+4
source share
2 answers

I believe that Google uses the Tesseract OCR engine in combination with another tool called Ocropus , both of which are open-source. I donโ€™t know anything about how they work, but you may be interested in checking the code available at the above links.

+2
source

This is information obtained from a digitizer in my library, but it looks like Googleโ€™s approach is to just drop everything through an automated process, ocr everything that looks like text, and not worry too much about cropping individual images or doing a lot of semantic analogues to search for image captions, etc. They can do subtle things that are not obvious, but on the surface they definitely shoot in quantity by quality, which is reasonable for them to do for their purposes, IMO.

0
source

All Articles