Programmatically change font color of text in PDF

I am not at all familiar with the PDF specification. I was wondering if it is possible to directly manipulate the PDF file so that certain blocks of text that I have identified as important are highlighted in the color of my choice. The language of choice is python.

+7
python fonts pdf
source share
2 answers

It is possible, but not necessarily easy, because the PDF format is so rich. You can find a document describing it in detail here . The first elementary example that he gives is how PDF files display text:

BT /F13 12 Tf 288 720 Td (ABC) Tj ET 

BT and ET are commands for starting and ending a text object; Tf is a command to use the external resource of the font F13 (which happens to be Helvetica) of size 12; Td - cursor positioning command in the given coordinates; Tj is a command for writing glyphs for the previous line. The fragrance is somewhat “back polished,” and, indeed, very close to Postscript, one of Adobe’s other contributions to typing.

The problem is that there is nothing in the PDF specifications that says that the text that “looks” like it belongs together on the page, as shown, should be “together”; since the exact coordinates can always be set if the PDF is created by a complex typography layout system, it can position the text exactly, by character, by coordinates. Therefore, reconstructing the text in the form of words and sentences is not so easy - it is almost as difficult as recognizing the optical text, except that you are exactly given the characters (well, almost ... some of the alleged "images" can actually be displayed as characters. ..; -).

pyPdf is a very simple pure Python library that is a good starting point for playing PDF files. Its “text extraction” function is fairly basic and does nothing but concatenate the arguments of several text drawing commands; You will see that this is sufficient for some documents, and completely unsuitable for others, but at least this is the beginning. As is common, pyPdf does almost nothing with colors, but with some hacks that can be fixed.

reportlab powerful Python library is fully focused on creating new PDF files, rather than interpreting or modifying existing ones. On the other hand, the pure Python pdfminer library focuses entirely on parsing PDF files; it does some clustering to try to recover text in cases where simpler libraries are deadlocked.

I don’t know about the existing library that performs the required transformational tasks, but it should be possible to mix and match some of these existing ones to get most of this ... good luck!

+11
source share

Highlighting is possible in a pdf file using PDF annotations, but this is not so easy to do from the very beginning. If any of the libraries mentioned provides such a tool, you can search.

0
source share

All Articles