How to Define Artificial Bold Style, Artificial Italics, and Artificial Contour Text Style Using PDFBOX

Question

How to Define Artificial Bold Style, Artificial Italics, and Artificial Contour Text Style Using PDFBOX

I use PDFBox to validate a PDF document. There is a specific requirement for checking the following types of text present in PDF

Artificial bold style text
Text with artificial italics.
Artificial Contour Style Text

I searched the list of api in PDFBOX format but could not find this type of api.

Can someone help me and tell you how to identify the different types of artificial font / text styles that will be present in PDF using PDFBOX.

0

pdf pdfbox detect font-size

Krishnendu Jan 2 '14 at 7:17

source share

2 answers

My solution to this problem was to create a new class extending the PDFTextStripper class and overriding the function:

getCharactersByArticle()

Note: PDFBox version 1.8.5

Class CustomPDFTextStripper

 public class CustomPDFTextStripper extends PDFTextStripper { public CustomPDFTextStripper() throws IOException { super(); } public Vector<List<TextPosition>> getCharactersByArticle(){ return charactersByArticle; } }

That way, I can parse the PDF document and then get the TextPosition from the custom extraction function:

  private void extractTextPosition() throws FileNotFoundException, IOException { PDFParser parser = new PDFParser(new FileInputStream(pdf)); parser.parse(); StringWriter outString = new StringWriter(); CustomPDFTextStripper stripper = new CustomPDFTextStripper(); stripper.writeText(parser.getPDDocument(), outString); Vector<List<TextPosition>> vectorlistoftps = stripper.getCharactersByArticle(); for (int i = 0; i < vectorlistoftps.size(); i++) { List<TextPosition> tplist = vectorlistoftps.get(i); for (int j = 0; j < tplist.size(); j++) { TextPosition text = tplist.get(j); System.out.println(" String " + "[x: " + text.getXDirAdj() + ", y: " + text.getY() + ", height:" + text.getHeightDir() + ", space: " + text.getWidthOfSpace() + ", width: " + text.getWidthDirAdj() + ", yScale: " + text.getYScale() + "]" + text.getCharacter()); } } }

TextPositions contains a wealth of information about the characters of a PDF document.

OUTPUT:

String [x: 168.24, y: 64.15997, height: 6.061287, space: 8.9664, width: 3.4879303, yScale: 8.9664] J
String [x: 171.69745, y: 64.15997, height: 6.061287, space: 8.9664, width: 2.2416077, yScale: 8.9664] N
String [x: 176.25777, y: 64.15997, height: 6.0343876, space: 8.9664, width: 6.4737396, yScale: 8.9664] N
String [x: 182.73778, y: 64.15997, height: 4.214208, space: 8.9664, width: 3.981079, yScale: 8.9664] e .....

0

yeaaaahhhh..hamf hamf Nov 18 '14 at 13:55

source share

mkl · Accepted Answer · 2014-01-04T18:24:54+0000

General procedure and problem of PDFBox

In theory, you should start with this by deriving a class from PDFTextStripper and overriding its method:

 /** * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code> * and just calls {@link #writeString(String)}. * * @param text The text to write to the stream. * @param textPositions The TextPositions belonging to the text. * @throws IOException If there is an error when writing the text. */ protected void writeString(String text, List<TextPosition> textPositions) throws IOException { writeString(text); }

Instead, you should use List<TextPosition> textPositions instead of String text ; each TextPosition essentially represents a single letter and information about the active graphic state when this letter was drawn.

Unfortunately, the textPositions list textPositions not contain the correct content in the current version 1.8.3. For instance. for the line "This is normal text." from your PDF, the writeString method writeString called four times, once for the lines "This", "is", "normal" and "text." Unfortunately, the textPositions list each time contains TextPosition instances for the letters of the last line "text".

In fact, this has already been recognized as a problem PDFBox PDFBOX-1804 , which in the meantime has been fixed as fixed for versions 1.8.4 and 2.0 0.0.

That was said, once you have the PDFBox version installed, you can check out some of the artificial styles as follows:

Artificial Italic Text

This text style is created this way in the page content:

 BT /F0 1 Tf 24 0 5.10137 24 66 695.5877 Tm 0 Tr [<03>]TJ ...

The corresponding part occurs when tuning the text matrix Tm . 5.10137 is the factor by which the text is cut.

When you check TextPosition textPosition as above, you can request this value using

 textPosition.getTextPos().getValue(1, 0)

If this value has a value greater than 0.0, you have artificial italics. If it is relevant less than 0.0, you have artificial backslash.

Artificial bold or outline text

These artificial styles use double block letters, using different rendering modes; for example, capital "T", in the case of bold:

 0 0 0 1 k ... BT /F0 1 Tf 24 0 0 24 66.36 729.86 Tm <03>Tj 4 M 0.72 w 0 0 Td 1 Tr 0 0 0 1 K <03>Tj ET

(i.e., first we draw the letter in normal mode, filling the area of the letter, and then drawing it in outline mode, drawing a line along the border of the letter, both in black and in CMYK 0, 0, 0, 1; the impression of a thicker letter .)

and in the case of a circuit:

 BT /F0 1 Tf 24 0 0 24 66 661.75 Tm 0 0 0 0 k <03>Tj /GS1 gs 4 M 0.288 w 0 0 Td 1 Tr 0 0 0 1 K <03>Tj ET

(i.e., first draw a letter in the usual white mode, CMYK 0, 0, 0, 0, filling the area of the letter, and then drawing it in outline mode, drawing a line along the border of the letter in black, CMYK 0, 0, 0 , 1, it leaves the impression of a black outline on a white background.)

Unfortunately, PDFBox PDFTextStripper does not track the text rendering mode. In addition, it explicitly omits duplicate characters at approximately the same position. Thus, it is not a matter of realizing these artificial styles.

If you really need to do this, you will have to change the TextPosition so that it also contains the PDFStreamEngine rendering mode to add it to the generated TextPosition and PDFTextStripper so as not to drop duplicate glyphs in the processTextPosition .

Corrections

I wrote

Unfortunately, PDFBox PDFTextStripper does not track the text rendering mode.

This is not entirely true, you can find the current rendering mode using getGraphicsState().getTextState().getRenderingMode() . This means that during processTextPosition you have a rendering mode available and you can try to save information about the rendering mode (and colors!) For a given TextPosition somewhere, for example. in some Map<TextPosition, ...> , for later use.

In addition, it clearly discards repeated characters in approximately the same position.

You can disable this by calling setSuppressDuplicateOverlappingText(false) .

With these two changes, you can also do the necessary tests to check the artificial bold and contour.

The last change may not even be necessary if you store and check styles at the beginning of processTextPosition .

How to get the rendering mode and color

As mentioned in Bugfixes, you can actually get the rendering mode and color information by collecting this information in the processTextPosition override.

For this, the OP commented that

Always stroking and non-smooth color goes like black

At first it was a bit unexpected, but after looking at PDFTextStripper.properties (from which the operators supported when extracting text were initialized), the reason became clear:

 # The following operators are not relevant to text extraction, # so we can silently ignore them. ... K k

Thus, color adjustment operators are ignored in this context (especially for CMYK colors, as in this document)! Fortunately, the implementation of these operators for PageDrawer can also be used in this context.

Thus, the following proof of concept shows how you can get all the necessary information.

 public class TextWithStateStripperSimple extends PDFTextStripper { public TextWithStateStripperSimple() throws IOException { super(); setSuppressDuplicateOverlappingText(false); registerOperatorProcessor("K", new org.apache.pdfbox.util.operator.SetStrokingCMYKColor()); registerOperatorProcessor("k", new org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor()); } @Override protected void processTextPosition(TextPosition text) { renderingMode.put(text, getGraphicsState().getTextState().getRenderingMode()); strokingColor.put(text, getGraphicsState().getStrokingColor()); nonStrokingColor.put(text, getGraphicsState().getNonStrokingColor()); super.processTextPosition(text); } Map<TextPosition, Integer> renderingMode = new HashMap<TextPosition, Integer>(); Map<TextPosition, PDColorState> strokingColor = new HashMap<TextPosition, PDColorState>(); Map<TextPosition, PDColorState> nonStrokingColor = new HashMap<TextPosition, PDColorState>(); protected void writeString(String text, List<TextPosition> textPositions) throws IOException { writeString(text + '\n'); for (TextPosition textPosition: textPositions) { StringBuilder textBuilder = new StringBuilder(); textBuilder.append(textPosition.getCharacter()) .append(" - shear by ") .append(textPosition.getTextPos().getValue(1, 0)) .append(" - ") .append(textPosition.getX()) .append(" ") .append(textPosition.getY()) .append(" - ") .append(renderingMode.get(textPosition)) .append(" - ") .append(toString(strokingColor.get(textPosition))) .append(" - ") .append(toString(nonStrokingColor.get(textPosition))) .append('\n'); writeString(textBuilder.toString()); } } String toString(PDColorState colorState) { if (colorState == null) return "null"; StringBuilder builder = new StringBuilder(); for (float f: colorState.getColorSpaceValue()) { builder.append(' ') .append(f); } return builder.toString(); } }

Using this, you will get a period of '.' in plain text:

 . - shear by 0.0 - 256.5701 88.6875 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0

In artificial bold text you will receive:

 . - shear by 0.0 - 378.86 122.140015 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0 . - shear by 0.0 - 378.86002 122.140015 - 1 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0

In artificial italics:

 . - shear by 5.10137 - 327.121 156.4123 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0

And in the artificial scheme:

 . - shear by 0.0 - 357.25 190.25 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 0.0 . - shear by 0.0 - 357.25 190.25 - 1 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 0.0

So, here you are, all the information necessary to recognize these artificial styles. Now you just need to analyze the data.

By the way, take a look at the artificial bold case: the coordinates may not always be the same, but just very similar. Thus, some kind of indulgence is required for verification if two objects of the text position describe the same position.

How to Define Artificial Bold Style, Artificial Italics, and Artificial Contour Text Style Using PDFBOX

General procedure and problem of PDFBox

Artificial Italic Text

Artificial bold or outline text

Corrections

How to get the rendering mode and color

More articles: