Python, pyPdf, Adobe PDF OCR error: unsupported filter / lzwdecode

My stuff: python 2.6 64 bit (with pyPdf-1.13.win32.exe installed). Wing IDE. Windows 7 64 bit.

I got the following error:

NotImplementedError: unsupported filter / LZWDecode

When I ran the following code:

from pyPdf import PdfFileWriter, PdfFileReader import sys, os, pyPdf, re path = 'C:\\Users\\Homer\\Documents\\' # This is where I put my pdfs filelist = os.listdir(path) has_text_list = [] does_not_have_text_list = [] for pdf_name in filelist: pdf_file_with_directory = os.path.join(path, pdf_name) pdf = pyPdf.PdfFileReader(open(pdf_file_with_directory, 'rb')) for i in range(0, pdf.getNumPages()): content = pdf.getPage(i).extractText() #this is the line what done it does_it_have_text = re.findall(r'\w{2,}', content) if does_it_have_text == []: does_not_have_text_list.append(pdf_name) print pdf_name else: has_text_list.append(pdf_name) print does_not_have_text_list 

Here is some background. The path is full pdf. Some of them were saved from text documents using an Adobe pdf printer (at least I think they did). And some were scanned as images. I wanted to separate them, and OCR those that are images (ideal images are not perfect and should not be mixed).

I asked here a few days ago how to do this:

OCR Batch Software for PDF Files

The only problem I got was in VB and I only speak python. So I decided that I would try to write an answer to my question. My strategy (reflected in the code above) is as follows. If this is just an image, then this regular expression will return an empty list. If it has text, a regular expression (says any word with 2 or more alphanumeric characters) will return a list filled with things like u'word '(in python, I think this is a Unicode string).

Thus, the code should work, and we can take the first step to complete this other stream using open source software (separating ocrd from the generated PDF files), but I do not know how to handle this filter error and search by Google did not help. So if anyone knows, it would be very helpful.

I really don't know how to use this material. I'm not sure what the filter means in pyPdf. I think this suggests that he really cannot read the PDF file or anything else, although it does work. Funny, I put one of the non-ocrd and one of the ocrd files in the same folder as the python file, and this only worked without the for loop, so I don’t know why to do them with the created for loop filter error. Below I will put one code. thanks.

 from pyPdf import PdfFileWriter, PdfFileReader import sys, os, pyPdf, re pdf = pyPdf.PdfFileReader(open(my_ocrd_file.pdf', 'rb')) has_text_list = [] does_not_have_text_list = [] for i in range(0, pdf.getNumPages()): content = pdf.getPage(i).extractText() does_it_have_text = re.findall(r'\w{2,}', content) print does_it_have_text 

and it prints the material, so I don’t know why I get the filter error on one and not on the other. When I run this code against another file in the directory (one that is NOT open), the output is the line emptry on one line and the line emptry on the following, for example:

[]
[]

Therefore, I do not think this is a filter problem with non-ocrd files. It looks like my head and I need help here.

Edit:

A Google search found this, but I don’t know what to do with it:

http://vaitls.com/treas/pdf/pyPdf/filters.py

+4
source share
2 answers

Replace pyPdf filter.py with http://vaitls.com/treas/pdf/pyPdf/filters.py in the pyPdf source folder. It worked for me.

+2
source

LZW is a compression format used in GIFs, and sometimes in PDF files. If you look at the filters available in pyPdf.filters , you will see that LZW does not exist, therefore NotImplementedError. You sent a link to the code in the subversion repository, where someone implemented the LZW filter.

+1
source

All Articles