PDF text string encoding

I am working on a parser for PDF (text extraction).

When a page needs to be decoded by Flate (from zlib compression), my code is able to unpack content streams, and then I have output (stream object), as shown below:

BT 56.8 721.3 Td /F2 12 Tf [<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ ET 

I'm interested in a string array (TJ operand).

It seems that there are several strings with hexadecimal encoding in this array, but the corresponding hexadecimal values ​​do not make sense. Instead, a sequence like 010203 appears ... like lz77 compression.

  • Do PDF files have several compression levels?
  • How to get plain text from a string array?
+8
source share
2 answers

Abhishek

This is far from an easy question, and unfortunately, it shows that you have not read the PDF specification. You have to do it.

You can download the Acrobat SDK here: http://www.adobe.com/devnet/acrobat/sdk/eula.html

This is partly a PDF specification, which is a very impressive document explaining the ins and outs of PDFs (including the answer to your question).

In short - and not as a substitute for reading documentation - what you are looking for is the character encoding of the font specified by the / F 2 12 Tf command, which sets the specific font to use when writing text later on.

+4
source

Before starting such an ambitious project, you should familiarize yourself with the full official specification of PDF-1.7 . Be careful: this is a 756-page document, and it refers to 90 other documents, which it also declares “regulatory” for PDF.

You will learn that in order to reverse the PDF source code to text content, you must cancel the encoding used by the font. You can use 5 standard standard encodings:

  • StandardEncoding
  • MacRomanEncoding
  • WinAnsiEncoding
  • PDFDocEncoding
  • MacExpertEncoding

In addition, there may also be CustomEncoding (which comes into play when the embedded font is a subset and does not contain all the glyphs defined by the font, but only those glyphs required by the document). You can only cancel the CustomEncode-d text if it has the /ToUnicode table defined inside the PDF. Only then can you turn encoded characters into character names.

You will also learn that there is not only one, but there are four operators that can be used to display text strings:

  • Tj : Show Text
  • Tj : "Show text allowing individual glyph positioning"
  • ' : "Go to the next line and display the text"
  • " :" Set the distance between words and characters, go to the next line and show the text "

In addition, there are three different ways to represent text strings. Here are examples for the string:

  • (string) . The parentheses use standard printable ASCII characters (only possible for Latin / ASCII text parts).
  • (\163\164\162\151\156\147) . Uses octal character codes (also in parentheses) as indicated in "Appendix D (normative) character sets and encodings" in the specification document.
  • <737472696E67> . It uses six-coded encoded characters inside angle brackets.

The problems for the text extractor are as follows:

  • The use of printed ASCII characters ( 1. above) and octal character codes ( 2. ) can be mixed. All of the following are also "legal" representations of the string "string" (listing is not complete!):

      (\163tring)Tj (\163\164\162\151\156g) Tj (st\162i\156g) Tj ... 
  • The use of hexadecimal coded character codes ( 3. ) is also not straightforward, since all of the following representations are equivalent:

     <73 74 72 69 6E 67> TJ <73 7472 696E67> TJ <7 374 7 269 6E 67>TJ <73 74 72696E 67> TJ <73 74 7 2 69 6E 67> TJ 

For more strangeness allowed by the PDF specification (or carried by Adobe viewers), see also:

I myself have recently created a small series of manual-encoded PDF files that demonstrate the effect of an erroneous, incorrect, managed, or correct table /ToUnicode on the result of any PDF-to-Text conversion:


Finally, looking at a small piece of PDF source code, OP provided:

 BT 56.8 721.3 Td /F2 12 Tf [<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ ET 
  • BT and ET indicate the beginning and end of the text showing the section

  • 56.8 721.3 Td positions the current point in the coordinates "56.8 points in the horizontal plane, 721.3 points in the vertical direction".

  • 12 Tf sets the font size to 12 points.

  • /F1 sets the font to use in the one defined elsewhere in the PDF. This font also somewhere sets the encoding of the font (and possibly the table /ToUnicode ). The font encoding will determine what form of the glyph should be drawn when a specific character code is displayed in text strings.

  • [<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ

This last part can be broken down into these parts:

  • <01>2 : <01> is the first character code. 2 is a parameter for the "individual glyph positioning" permitted by using the show show Tj statement.
  • <0203>2 : <0203> are two more character codes. 2 again is a parameter for “individual glyph positioning” for Tj .
  • <04>-10 : <04> is the code for the fourth character. -10 again for "individual glyph positioning" with Tj .
  • <0503>2 : <05> is the fifth character code, <03> is the third character code (used earlier). 2 is for "individual glyph positioning" ...
  • and etc.

Individual glyph positioning . Individual glyph positioning works as follows:

  • Positive numbers move the next glyph to the left (decreasing the distance of the glyph to the next glyph).
  • Negative numbers move the next glyph to the right (adding more space to the next glyph).
  • The figures themselves should be perceived as one thousandth of the unit of the current block.

The meaning of character codes . To find out the meaning of the first, second, third, ... last character codes, you will have to look for them in the table /ToUnicode your PDF file. If it does not have such a built-in table, then failure!

Check easily retrievable text . To check if a PDF file is easy to extract for text, you can use the pdffonts command line pdffonts . Here is an example output:

 $ pdffonts sample.pdf name type encoding emb sub uni object ID ------------------------- ------------- ------------ --- --- --- --------- IADKRB+Arial-BoldMT CID TrueType Identity-H yes yes yes 10 0 SSKFGJ+ArialMT CID TrueType Custom yes yes no 11 0 

In the above example, the random font SSKFGJ+ArialMT uses custom encoding, but the PDF does not have /ToUnicode for this font, as indicated in the uni column. Therefore, it is not easy to extract the text that is displayed with this font (manual reverse engineering is required to extract it), but then you can also just “read” the PDF pages).

+45
source

All Articles