I don't know much about PDF encoding, but I think you can solve your specific problem by modifying pdf.py
In the PageObject.extractText
method PageObject.extractText
you see what happens:
def extractText(self): [...] for operands,operator in content.operations: if operator == "Tj": _text = operands[0] if isinstance(_text, TextStringObject): text += _text elif operator == "T*": text += "\n" elif operator == "'": text += "\n" _text = operands[0] if isinstance(_text, TextStringObject): text += operands[0] elif operator == '"': _text = operands[2] if isinstance(_text, TextStringObject): text += "\n" text += _text elif operator == "TJ": for i in operands[0]: if isinstance(i, TextStringObject): text += i
If the operator is Tj
or Tj
(this is Tj in your PDF example), the text is just added and no new line is added. Now you donβt necessarily want to add a new line, at least if I read the PDF link correctly: Tj/TJ
is just one and several show-string statements, and the existence of a separator is optional.
Anyway, if you change this code to something like
def extractText(self, Tj_sep="", TJ_sep=""):
[...]
if operator == "Tj": _text = operands[0] if isinstance(_text, TextStringObject): text += Tj_sep text += _text
[...]
elif operator == "TJ": for i in operands[0]: if isinstance(i, TextStringObject): text += TJ_sep text += i
then the default behavior should be the same:
In [1]: pdf.getPage(1).extractText()[1120:1250] Out[1]: u'ing an individual which, because of name, identifyingnumber, mark or description can be readily associated with a particular indiv'
but you can change it if you want:
In [2]: pdf.getPage(1).extractText(Tj_sep=" ")[1120:1250] Out[2]: u'ta" means any information concerning an individual which, because of name, identifying number, mark or description can be readily '
or
In [3]: pdf.getPage(1).extractText(Tj_sep="\n")[1120:1250] Out[3]: u'ta" means any information concerning an individual which, because of name, identifying\nnumber, mark or description can be readily '
Alternatively, you can simply add the delimiters yourself by changing the operands themselves in place, but that might break something else (methods like get_original_bytes
make me nervous).
Finally, you do not need to edit pdf.py
yourself if you do not want this: you could just pull this method into a function.