My stuff: python 2.6 64 bit (with pyPdf-1.13.win32.exe installed). Wing IDE. Windows 7 64 bit.
I got the following error:
NotImplementedError: unsupported filter / LZWDecode
When I ran the following code:
from pyPdf import PdfFileWriter, PdfFileReader import sys, os, pyPdf, re path = 'C:\\Users\\Homer\\Documents\\'
Here is some background. The path is full pdf. Some of them were saved from text documents using an Adobe pdf printer (at least I think they did). And some were scanned as images. I wanted to separate them, and OCR those that are images (ideal images are not perfect and should not be mixed).
I asked here a few days ago how to do this:
OCR Batch Software for PDF Files
The only problem I got was in VB and I only speak python. So I decided that I would try to write an answer to my question. My strategy (reflected in the code above) is as follows. If this is just an image, then this regular expression will return an empty list. If it has text, a regular expression (says any word with 2 or more alphanumeric characters) will return a list filled with things like u'word '(in python, I think this is a Unicode string).
Thus, the code should work, and we can take the first step to complete this other stream using open source software (separating ocrd from the generated PDF files), but I do not know how to handle this filter error and search by Google did not help. So if anyone knows, it would be very helpful.
I really don't know how to use this material. I'm not sure what the filter means in pyPdf. I think this suggests that he really cannot read the PDF file or anything else, although it does work. Funny, I put one of the non-ocrd and one of the ocrd files in the same folder as the python file, and this only worked without the for loop, so I don’t know why to do them with the created for loop filter error. Below I will put one code. thanks.
from pyPdf import PdfFileWriter, PdfFileReader import sys, os, pyPdf, re pdf = pyPdf.PdfFileReader(open(my_ocrd_file.pdf', 'rb')) has_text_list = [] does_not_have_text_list = [] for i in range(0, pdf.getNumPages()): content = pdf.getPage(i).extractText() does_it_have_text = re.findall(r'\w{2,}', content) print does_it_have_text
and it prints the material, so I don’t know why I get the filter error on one and not on the other. When I run this code against another file in the directory (one that is NOT open), the output is the line emptry on one line and the line emptry on the following, for example:
[]
[]
Therefore, I do not think this is a filter problem with non-ocrd files. It looks like my head and I need help here.
Edit:
A Google search found this, but I don’t know what to do with it:
http://vaitls.com/treas/pdf/pyPdf/filters.py