Why is my code not correctly separating each page in a scanned pdf?

Question

Why is my code not correctly separating each page in a scanned pdf?

Update: Thanks to stardt whose script is working! PDF is another page. I tried the script on a different one and it also correctly bound each pdf page, but the page number order is sometimes correct and sometimes incorrect. For example, on page 25-28 pdf of the file the number of printed pages is 14, 15, 17, 16. I was wondering why? The entire pdf file can be downloaded from http://download304.mediafire.com/u6ewhjt77lzg/bgf8uzvxatckycn/3.pdf

Original: I have a scanned pdf file where two paper pages sit side by side on the pdf page. I would like to split the pdf page into two parts, with the original left half becoming the first of two new PDF pages. PDF looks like enter image description here .

Here is my Python script called un2up inspired by Gilles :

 #!/usr/bin/env python import copy, sys from pyPdf import PdfFileWriter, PdfFileReader input = PdfFileReader(sys.stdin) output = PdfFileWriter() for p in [input.getPage(i) for i in range(0,input.getNumPages())]: q = copy.copy(p) (w, h) = p.mediaBox.upperRight p.mediaBox.upperLeft = (0, h/2) p.mediaBox.upperRight = (w, h/2) p.mediaBox.lowerRight = (w, 0) p.mediaBox.lowerLeft = (0, 0) q.mediaBox.upperLeft = (0, h) q.mediaBox.upperRight = (w, h) q.mediaBox.lowerRight = (w, h/2) q.mediaBox.lowerLeft = (0, h/2) output.addPage(q) output.addPage(p) output.write(sys.stdout)

I tried the script in the pdf connector with the command un2up < page.pdf > out.pdf , but the output of out.pdf not correctly divided.

I also checked the values of the variables w and h , the output of p.mediaBox.upperRight , and they are 514 and 1224 , which do not look right depending on their actual relationship.

The file can be downloaded from http://download851.mediafire.com/bdr4sv7v5nzg/raci13ct5w4c86j/page.pdf .

+8

python pdf pypdf

Tim Aug 13 '11 at 0:20

source share

3 answers

The code

@stardt was very useful, but I had problems with the section of the package of PDF files with different orientations. Here's a more general function that will work regardless of page orientation:

 import copy import math import pyPdf def split_pages(src, dst): src_f = file(src, 'r+b') dst_f = file(dst, 'w+b') input = pyPdf.PdfFileReader(src_f) output = pyPdf.PdfFileWriter() for i in range(input.getNumPages()): p = input.getPage(i) q = copy.copy(p) q.mediaBox = copy.copy(p.mediaBox) x1, x2 = p.mediaBox.lowerLeft x3, x4 = p.mediaBox.upperRight x1, x2 = math.floor(x1), math.floor(x2) x3, x4 = math.floor(x3), math.floor(x4) x5, x6 = math.floor(x3/2), math.floor(x4/2) if x3 > x4: # horizontal p.mediaBox.upperRight = (x5, x4) p.mediaBox.lowerLeft = (x1, x2) q.mediaBox.upperRight = (x3, x4) q.mediaBox.lowerLeft = (x5, x2) else: # vertical p.mediaBox.upperRight = (x3, x4) p.mediaBox.lowerLeft = (x1, x6) q.mediaBox.upperRight = (x3, x6) q.mediaBox.lowerLeft = (x1, x2) output.addPage(p) output.addPage(q) output.write(dst_f) src_f.close() dst_f.close()

+1

moraes Apr 1 '13 at 10:37

source share

I would like to add that you should pay attention that your mediaBox variables mediaBox not shared between copies of p and q . This can easily happen if you read from p.mediaBox before taking a copy.

In this case, a record, for example, p.mediaBox.upperRight can change q.mediaBox and vice versa.

Decision

@moraes will take care of this by explicitly copying the mediaBox.

0

florian Aug 14 '13 at 10:18

source share

stardt · Accepted Answer · 2011-08-13T00:43:34+0000

Your code assumes p.mediaBox.lowerLeft is (0,0) but actually (0, 497)

This works for the file you provided:

 #!/usr/bin/env python import copy, sys from pyPdf import PdfFileWriter, PdfFileReader input = PdfFileReader(sys.stdin) output = PdfFileWriter() for i in range(input.getNumPages()): p = input.getPage(i) q = copy.copy(p) bl = p.mediaBox.lowerLeft ur = p.mediaBox.upperRight print >> sys.stderr, 'splitting page',i print >> sys.stderr, '\tlowerLeft:',p.mediaBox.lowerLeft print >> sys.stderr, '\tupperRight:',p.mediaBox.upperRight p.mediaBox.upperRight = (ur[0], (bl[1]+ur[1])/2) p.mediaBox.lowerLeft = bl q.mediaBox.upperRight = ur q.mediaBox.lowerLeft = (bl[0], (bl[1]+ur[1])/2) if i%2==0: output.addPage(q) output.addPage(p) else: output.addPage(p) output.addPage(q) output.write(sys.stdout)

Why is my code not correctly separating each page in a scanned pdf?

More articles: