It is possible, but not necessarily easy, because the PDF format is so rich. You can find a document describing it in detail here . The first elementary example that he gives is how PDF files display text:
BT /F13 12 Tf 288 720 Td (ABC) Tj ET
BT and ET are commands for starting and ending a text object; Tf is a command to use the external resource of the font F13 (which happens to be Helvetica) of size 12; Td - cursor positioning command in the given coordinates; Tj is a command for writing glyphs for the previous line. The fragrance is somewhat “back polished,” and, indeed, very close to Postscript, one of Adobe’s other contributions to typing.
The problem is that there is nothing in the PDF specifications that says that the text that “looks” like it belongs together on the page, as shown, should be “together”; since the exact coordinates can always be set if the PDF is created by a complex typography layout system, it can position the text exactly, by character, by coordinates. Therefore, reconstructing the text in the form of words and sentences is not so easy - it is almost as difficult as recognizing the optical text, except that you are exactly given the characters (well, almost ... some of the alleged "images" can actually be displayed as characters. ..; -).
pyPdf is a very simple pure Python library that is a good starting point for playing PDF files. Its “text extraction” function is fairly basic and does nothing but concatenate the arguments of several text drawing commands; You will see that this is sufficient for some documents, and completely unsuitable for others, but at least this is the beginning. As is common, pyPdf does almost nothing with colors, but with some hacks that can be fixed.
reportlab powerful Python library is fully focused on creating new PDF files, rather than interpreting or modifying existing ones. On the other hand, the pure Python pdfminer library focuses entirely on parsing PDF files; it does some clustering to try to recover text in cases where simpler libraries are deadlocked.
I don’t know about the existing library that performs the required transformational tasks, but it should be possible to mix and match some of these existing ones to get most of this ... good luck!
Alex martelli
source share