PyPdf ignores newlines in a PDF file

I am trying to extract each pdf page as a string:

import pyPdf pages = [] pdf = pyPdf.PdfFileReader(file('g-reg-101.pdf', 'rb')) for i in range(0, pdf.getNumPages()): this_page = pdf.getPage(i).extractText() + "\n" this_page = " ".join(this_page.replace(u"\xa0", " ").strip().split()) pages.append(this_page.encode("ascii", "xmlcharrefreplace")) for page in pages: print '*' * 80 print page 

But this script ignores newline characters, leaving me with messy strings like information concerning an individual which, because of name, identifyingnumber, mark or description (i.e. this should read an identifying number , not an identifyingumber ).

Here is an example of the type of PDF I'm trying to parse.

+4
source share
2 answers

I don't know much about PDF encoding, but I think you can solve your specific problem by modifying pdf.py In the PageObject.extractText method PageObject.extractText you see what happens:

 def extractText(self): [...] for operands,operator in content.operations: if operator == "Tj": _text = operands[0] if isinstance(_text, TextStringObject): text += _text elif operator == "T*": text += "\n" elif operator == "'": text += "\n" _text = operands[0] if isinstance(_text, TextStringObject): text += operands[0] elif operator == '"': _text = operands[2] if isinstance(_text, TextStringObject): text += "\n" text += _text elif operator == "TJ": for i in operands[0]: if isinstance(i, TextStringObject): text += i 

If the operator is Tj or Tj (this is Tj in your PDF example), the text is just added and no new line is added. Now you don’t necessarily want to add a new line, at least if I read the PDF link correctly: Tj/TJ is just one and several show-string statements, and the existence of a separator is optional.

Anyway, if you change this code to something like

 def extractText(self, Tj_sep="", TJ_sep=""): 

[...]

  if operator == "Tj": _text = operands[0] if isinstance(_text, TextStringObject): text += Tj_sep text += _text 

[...]

  elif operator == "TJ": for i in operands[0]: if isinstance(i, TextStringObject): text += TJ_sep text += i 

then the default behavior should be the same:

 In [1]: pdf.getPage(1).extractText()[1120:1250] Out[1]: u'ing an individual which, because of name, identifyingnumber, mark or description can be readily associated with a particular indiv' 

but you can change it if you want:

 In [2]: pdf.getPage(1).extractText(Tj_sep=" ")[1120:1250] Out[2]: u'ta" means any information concerning an individual which, because of name, identifying number, mark or description can be readily ' 

or

 In [3]: pdf.getPage(1).extractText(Tj_sep="\n")[1120:1250] Out[3]: u'ta" means any information concerning an individual which, because of name, identifying\nnumber, mark or description can be readily ' 

Alternatively, you can simply add the delimiters yourself by changing the operands themselves in place, but that might break something else (methods like get_original_bytes make me nervous).

Finally, you do not need to edit pdf.py yourself if you do not want this: you could just pull this method into a function.

+7
source

pyPdf is not really made for this kind of text extraction, try pdfminer (or use pdftotext or something like that, t mind creates another process)

0
source

All Articles