Struct.error: unpack requires a string argument of length 16

When processing a PDF file (2.pdf) using pdfminer (pdf2txt.py) I got the following error:

pdf2txt.py 2.pdf Traceback (most recent call last): File "/usr/local/bin/pdf2txt.py", line 115, in <module> if __name__ == '__main__': sys.exit(main(sys.argv)) File "/usr/local/bin/pdf2txt.py", line 109, in main interpreter.process_page(page) File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 843, in render_contents self.init_resources(resources) File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 347, in init_resources self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec) File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 195, in get_font font = self.get_font(None, subspec) File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 186, in get_font font = PDFCIDFont(self, spec) File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 654, in __init__ StringIO(self.fontfile.get_data())) File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in __init__ (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16)) struct.error: unpack requires a string argument of length 16 

While a similar file (1.pdf) does not cause a problem.

I can not find any error information. I added issue to the pdfminer GitHub repository, but it went unanswered. Can someone explain to me why this is happening? What can I do for parsing 2.pdf ?


Update . I get a similar error with BytesIO instead of StringIO after installing pdfminer directly from the GitHub repository.

  $ pdf2txt.py 2.pdf Traceback (most recent call last): File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 116, in <module> if __name__ == '__main__': sys.exit(main(sys.argv)) File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 110, in main interpreter.process_page(page) File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 839, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 850, in render_contents self.init_resources(resources) File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resources self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec) File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 204, in get_font font = self.get_font(None, subspec) File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 195, in get_font font = PDFCIDFont(self, spec) File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 665, in __init__ BytesIO(self.fontfile.get_data())) File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 386, in __init__ (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16)) struct.error: unpack requires a string argument of length 16 
+7
python pdf pdf-parsing pdfminer pdftotext
source share
6 answers

TL; DR

Thanks to @mkl and @hynecker for more information ... With this I can confirm that this is a bug in pdfminer and PDF. Whenever pdfminer tries to get embedded file streams (for example, font definitions), it picks the last one in the file before endobj . Unfortunately, not all PDF files strictly add an end tag, so pdfminer must be resistant to this.

Quick fix for this problem.

I created a patch that was sent as a port request on github. See https://github.com/euske/pdfminer/pull/159 .

Detailed diagnosis

As mentioned in other answers, the reason you see this is because you are not getting the expected number of bytes from the stream, since pdfminer is decompressing the data. But why?

As you can see in the stack trace, pdfminer (correctly) indicates that it has a CID font for processing. It then processes the embedded font file as a TrueType font (in pdffont.py ). He tries to parse the associated stream (stream ID 18) by reading a set of binary tables.

This does not work for 2.pdf because it has a text stream. You can see this by running dumppdf -b -i 18 2.pdf . I started here:

 /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS def ... 

So, trash, trash ... Is this an error in your file or pdfminer? Well, the fact that other readers can handle it made me suspicious.

Digging a bit more, I see that this thread is identical for thread ID 17, which is the cmap for the ToUnicode field. Quick view The PDF specification shows that they cannot be the same.

Digging further in the code, I see that all streams receive the same data. Unfortunately! This is mistake. The reason seems to be related to the fact that some end tags are missing from this PDF document - as @hynecker noted.

The fix is ​​to return the correct data for each thread. Any other correction, just to catch the error, will lead to bad data used for all streams, and, for example, to incorrect font definitions.

I believe that the attached patch will fix your problem and should be safe for use in general.

+5
source share

I fixed your problem in the source code and I am trying to use the 2.pdf file to make sure it worked.

In the pdffont.py file , I replaced:

 class TrueTypeFont(object): class CMapNotFound(Exception): pass def __init__(self, name, fp): self.name = name self.fp = fp self.tables = {} self.fonttype = fp.read(4) (ntables, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8)) for _ in xrange(ntables): (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16)) self.tables[name] = (offset, length) return 

:

 class TrueTypeFont(object): class CMapNotFound(Exception): pass def __init__(self, name, fp): self.name = name self.fp = fp self.tables = {} self.fonttype = fp.read(4) (ntables, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8)) for _ in xrange(ntables): fp_bytes = fp.read(16) if len(fp_bytes) < 16: break (name, tsum, offset, length) = struct.unpack('>4sLLL', fp_bytes) self.tables[name] = (offset, length) return 

Explanation

@ Nabeul Ahmed was right

For a line foramt> 4sLLL, a buffer size of 16 bytes is required, which is correctly set for fp.read to read 16 bytes at a time.

Thus, the problem can only be related to the buffer stream that it is reading, that is, the contents of your particular PDF file.

In the code, we see that fp.read(16) executed in a loop without any checking. Thus, we do not know for sure whether he successfully read everything. He could, for example, achieve EOF .

To avoid this problem, I just break from the for loop when such a problem appears.

  for _ in xrange(ntables): fp_bytes = fp.read(16) if len(fp_bytes) < 16: break 

In any regular cases, he should never change anything.

I will try to fulfill the transfer request on github, but I'm not even sure that it will be accepted, so I suggest you make a monkey patch and now change your file /home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py .

+4
source share

This is really an invalid PDF, because after three indirect objects there are a few missing endobj keywords . (objects 5, 18 and 22)

The definition of an indirect object in a PDF file should consist of the object number and the generation number (separated by a space), followed by the value of the object enclosed in square brackets between the keywords obj and endobj . (chapter 7.3.10 in PDF link )

Example 2.pdf is a simple version of PDF 1.3 that uses a simple uncompressed cross-reference and uncompressed object delimiters. Failure can be easily detected by the grep command and the usual file viewer that PDF has 22 indirect objects. The "obj" template was found correctly exactly 22 times (never by chance in a string object or in a stream, fortunately for simplicity), but the endobj keyword is missing three times.

 $ grep --binary-files=text -B1 -A2 -E " obj|endobj" 2.pdf ... 18 0 obj << /Length 451967/Length1 451967/Filter [/FlateDecode] >> stream ... endstream % # see the missing "endobj" here 17 0 obj << /Length 12743 /Filter [/FlateDecode] >> stream ... endstream endobj ... 

Similarly, object 5 does not have endobj in front of object 1, and object 22 does not have endobj in front of object 21.

It is known that broken cross-references in PDF can and should usually be reconstructed using the obj / endobj keywords (see PDF link, chapter C.2). Some applications, on the contrary, probably fix missing endobjs if cross-references are correct, but this is not written advice.

+4
source share

The last error message tells you a lot:

File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in

INIT (name, tsum, offset, length) = struct.unpack ('> 4sLLL', fp.read (16)) struct.error: unpack requires a string argument of length 16

You can easily debug what happens, for example, by placing the necessary debug statements exactly in the pdffont.py file. I guess there is something special about your pdf content. Judging by the name of the method - TrueTypeFont - which produces an error message, there is some incompatibility with the font type.

+2
source share

Let's start by explaining the instruction in which you get the exception:

 struct.unpack('>4sLLL', fp.read(16)) 

where is the synopsis:

struct.unpack(fmt, buffer)

The unpack method is unpacked from the buffer buffer (which is supposedly previously packaged pack(fmt, ...) ) according to the fmt format string . The result is a tuple, even if it contains exactly one element. The size of the buffers in bytes must correspond to the size required by the format, as shown by calcsize ().

The most common case is the wrong number of bytes ( 16 ) for the format used ( >4sLLL ) - for example, for a format that expects 4 bytes, you specified 3 bytes:

 (name, tsum, offset, length) = struct.unpack('BH', fp.read(3)) 

for this you get

 struct.error: unpack requires a string argument of length 4 

The reason is that the struct ('BH') format expects 4 bytes, that is, when we collect something using the "BH" format, it will occupy 4 bytes of memory. Good explanation here .


To clarify this further, consider the format string >4sLLL . To check the size of unpack 'd, expect for a buffer (bytes that you read from a PDF file). Quote from the docs:

The size of the buffers in bytes should correspond to the size required by the format, as shown by calcsize ().

 >>> import struct >>> struct.calcsize('>4sLLL') 16 >>> 

At this point, we can say that there is nothing wrong with the statement:

 (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16)) 

The string foramt >4sLLL requires 16 bytes of buffer size, which is correctly set for fp.read to read 16 bytes at a time.

Thus, the problem can only be related to the buffer stream that it is reading, that is, the contents of your particular PDF file.


Could be a mistake - according to this comment :

This is a bug in the above PDFminer by @euske. Corrections seem to be for this, so this should be easy to fix. Beyond this, I also need to enhance PDF analysis so that we never get out of a bad session

I will edit the question that I find, something useful to add here is a solution or a patch.

+2
source share

If you still get some structure errors after applying the Peter patch, especially when analyzing many files in a single script run (using os.listdir), try changing the resource manager caching to false.

 rsrcmgr = PDFResourceManager(caching=False) 

This helped me get rid of the rest of the errors after applying the above solutions.

0
source share

All Articles