Java PDFBox - Read and modify PDF with special characters (diacritics)

I am trying to change pdf using this method (the first block of code is using PDFStreamParser and iterating through PDFOperator, and then updating COSString if necessary):

http://www.coderanch.com/t/556009/open-source/PdfBox-Replace-String-double-pdf

I have a problem with some UTF-8 characters (diacritics): when I print the text I want to update, it looks like "Societ ?? ii Na? Ionale" (where "?" Is a code like 0002 or 0004).

Fun stuff:

  • When I write an updated pdf file, the characters are displayed correctly (although I could not detect and replace them)
  • If I try to delete text using PDFTextStripper getText (...), the text will be extracted perfectly.
  • I tried 2 versions of pdfbox: 1.5.0 (which behaves as described above) and 1.8.1 (where the final, written, pdf file does not display special characters, and "zero" lines appear in the document)

What can I do (configure) for the classes used to update pdf (or at least try ...) so that all UTF-8 characters display correctly?

EDIT:

Screenshot: enter image description here

EDIT 2:

I searched through the pdfbox source code in PDFTextStripper and its superclass, and I found out how the text was extracted:

At the beginning of the processStream method, we have

graphicsState = new PDGraphicsState(aPage.findCropBox()); 

when deleting text in processEncodedText, an instance of the PDFont class is used as follows:

 final PDFont font = graphicsState.getTextState().getFont(); 

and the text is extracted from byte [] with:

 String c = font.encode( string, i, codeLength ); 

The new problem is that when I create a PDFont class with the same two lines of code, I get a β€œzero” font class, and therefore I cannot use the .encode (...) method. The source code for these classes is here: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.pdfbox/pdfbox/1.5.0/org/apache/pdfbox/util/PDFStreamEngine.java and http://grepcode.com/file/repo1.maven.org/maven2/org.apache.pdfbox/pdfbox/1.5.0/org/apache/pdfbox/util/PDFTextStripper.java

Now I'm digging more ...

+6
source share
2 answers

Finally, it seems that the process of extracting fonts in a pdf file is quite complicated. I could not use the fonts explicitly, so I looked inside the PDFStreamEngine code and classes that extend OperatorProcessor, and found how the PDFont objects were created on the map (I pretty much copied the code snippets that I needed to extract diacritics), So, after that I used the detected fonts when interacting through parser.getTokens () to call the encode (...) method for each character in the string.

+1
source

You cannot just replace text in lines. I do not say it frivolously. I worked on Acrobat many years ago and used the text search tool in the initial version, so I have a pretty deep understanding of text encoding issues. The main problem is that each line in PDF format is somehow encoded. This is because PDF was made before Unicode was generally available and had a history in PostScript. PosctScript liked to have very flexible encoding methods for fonts and to encourage re-encoding.

So, let's take a step back and understand the whole picture.

A character in a string in PDF format, which is intended to be displayed by a text operator, is by default encoded as a sequence of 8-bit characters. To determine which glyph is drawn for each byte, the byte is pushed through the encoding vector for that font. The encoding vector maps the byte to the glyph name, which is then viewed in font and drawn on the page. Keep in mind that this description is half-truth (later).

Most of the applications that generate PDFs are kind and just use standard encoding like StandardEncoding or WinAnsiEncoding , most of which are pretty reasonable. Others will use standard encodings along with delta strong> encoding, which are differences from standard encoding from what is encoded.

Some applications try to be more economical in the PDF they create, so they look at the glyphs they use and decide whether to embed a subset of the font. If they use only letters and numbers in upper and lower case, they rearrange the font without these elements and can also re-index them and provide an encoding vector, so byte 0x00 goes to the glyph 'a', and 0x01 goes to the glyph 'b' and etc.

Now back to the half truth. There is a class of fonts that are encoded with a character identifier (or CID), and TrueType and OpenType fonts fall into this category. In this case, you get access to Unicode, but again there is an encoding step where you, the string that is now UTF16BE, maps to the CID, which is used to get the font glyph. And without much reason, Adobe uses the PostScript function to display. Again, this is true about 3 / 4s, because there is a different encoding for encoding Chinese, Japanese, and Korean fonts.

So, before you blithely place a character in a string for a PDF font, you should ask a few questions:

  • Is my glyph a font?
  • Is my glyph encoded?
  • What is the encoding of my glyph?

And any of them may differ from what you expect. So, for example, if you want to put in Γ„ (diresis), you should see if the font has a glyph (which may not be there, because the font is a subset). Then the font may have a funny encoding, which may not contain a glyph. Finally, the actual value of the bytes to use for Γ„ may not be standard.

So when I see someone trying to just replace a piece of PDF text, all I see is a world of pain. For most normal PDFs this will work, say, in 90% of cases, but for something exotic, good luck. The differences in rendering PDF text are so painful that it is sometimes easier to think of it as a write-only format.

+17
source

All Articles