Python text layout recognition

Question

Python text layout recognition

I am trying to sort several thousand scanned files and sort them into folders based on type (i.e.: if one of the files is a scanned copy of formA, then it should go to the formA folder, if it is scanned a copy of form B, then it should go in the form B folder, etc.). I feel that the best way to match files and types is based on their text paths, but I am completely new to image processing, so if there is a better solution, then I'm all ears.

I am working on python. Any ideas on a better way to do this? Pil? Opencv? ImageMagick?

Thanks in advance...

+7

python image-processing

danwoods Jul 11 '11 at 20:18

source share

2 answers

I don’t know in what format you have scanned documents, but pdfminer can perform mock analysis for pdf. I suppose this will put an invoice for your purpose if you receive the documents in a somewhat decent pdf format (if you just received “clean images”, this will not do you any good)

+1

Steven Jul 11 '11 at 10:29

source share

Aditya mukherji · Accepted Answer · 2011-07-11T20:23:04+0000

This library probably interests you -
http://code.google.com/p/ocropus/
It was made by googlers and allows you to do OCR and mock analysis with python.
I had some problems with the installation, but it was a long time ago, so maybe everything is fixed now.

Python text layout recognition

More articles: