Python text layout recognition

I am trying to sort several thousand scanned files and sort them into folders based on type (i.e.: if one of the files is a scanned copy of formA, then it should go to the formA folder, if it is scanned a copy of form B, then it should go in the form B folder, etc.). I feel that the best way to match files and types is based on their text paths, but I am completely new to image processing, so if there is a better solution, then I'm all ears.

I am working on python. Any ideas on a better way to do this? Pil? Opencv? ImageMagick?

Thanks in advance...

+7
source share
2 answers

This library probably interests you -
http://code.google.com/p/ocropus/
It was made by googlers and allows you to do OCR and mock analysis with python.
I had some problems with the installation, but it was a long time ago, so maybe everything is fixed now.

+3
source

I donโ€™t know in what format you have scanned documents, but pdfminer can perform mock analysis for pdf. I suppose this will put an invoice for your purpose if you receive the documents in a somewhat decent pdf format (if you just received โ€œclean imagesโ€, this will not do you any good)

+1
source

All Articles