General procedure and problem of PDFBox
In theory, you should start with this by deriving a class from PDFTextStripper and overriding its method:
protected void writeString(String text, List<TextPosition> textPositions) throws IOException { writeString(text); }
Instead, you should use List<TextPosition> textPositions instead of String text ; each TextPosition essentially represents a single letter and information about the active graphic state when this letter was drawn.
Unfortunately, the textPositions list textPositions not contain the correct content in the current version 1.8.3. For instance. for the line "This is normal text." from your PDF, the writeString method writeString called four times, once for the lines "This", "is", "normal" and "text." Unfortunately, the textPositions list each time contains TextPosition instances for the letters of the last line "text".
In fact, this has already been recognized as a problem PDFBox PDFBOX-1804 , which in the meantime has been fixed as fixed for versions 1.8.4 and 2.0 0.0.
That was said, once you have the PDFBox version installed, you can check out some of the artificial styles as follows:
Artificial Italic Text
This text style is created this way in the page content:
BT /F0 1 Tf 24 0 5.10137 24 66 695.5877 Tm 0 Tr [<03>]TJ ...
The corresponding part occurs when tuning the text matrix Tm . 5.10137 is the factor by which the text is cut.
When you check TextPosition textPosition as above, you can request this value using
textPosition.getTextPos().getValue(1, 0)
If this value has a value greater than 0.0, you have artificial italics. If it is relevant less than 0.0, you have artificial backslash.
Artificial bold or outline text
These artificial styles use double block letters, using different rendering modes; for example, capital "T", in the case of bold:
0 0 0 1 k ... BT /F0 1 Tf 24 0 0 24 66.36 729.86 Tm <03>Tj 4 M 0.72 w 0 0 Td 1 Tr 0 0 0 1 K <03>Tj ET
(i.e., first we draw the letter in normal mode, filling the area of ββthe letter, and then drawing it in outline mode, drawing a line along the border of the letter, both in black and in CMYK 0, 0, 0, 1; the impression of a thicker letter .)
and in the case of a circuit:
BT /F0 1 Tf 24 0 0 24 66 661.75 Tm 0 0 0 0 k <03>Tj /GS1 gs 4 M 0.288 w 0 0 Td 1 Tr 0 0 0 1 K <03>Tj ET
(i.e., first draw a letter in the usual white mode, CMYK 0, 0, 0, 0, filling the area of ββthe letter, and then drawing it in outline mode, drawing a line along the border of the letter in black, CMYK 0, 0, 0 , 1, it leaves the impression of a black outline on a white background.)
Unfortunately, PDFBox PDFTextStripper does not track the text rendering mode. In addition, it explicitly omits duplicate characters at approximately the same position. Thus, it is not a matter of realizing these artificial styles.
If you really need to do this, you will have to change the TextPosition so that it also contains the PDFStreamEngine rendering mode to add it to the generated TextPosition and PDFTextStripper so as not to drop duplicate glyphs in the processTextPosition .
Corrections
I wrote
Unfortunately, PDFBox PDFTextStripper does not track the text rendering mode.
This is not entirely true, you can find the current rendering mode using getGraphicsState().getTextState().getRenderingMode() . This means that during processTextPosition you have a rendering mode available and you can try to save information about the rendering mode (and colors!) For a given TextPosition somewhere, for example. in some Map<TextPosition, ...> , for later use.
In addition, it clearly discards repeated characters in approximately the same position.
You can disable this by calling setSuppressDuplicateOverlappingText(false) .
With these two changes, you can also do the necessary tests to check the artificial bold and contour.
The last change may not even be necessary if you store and check styles at the beginning of processTextPosition .
How to get the rendering mode and color
As mentioned in Bugfixes, you can actually get the rendering mode and color information by collecting this information in the processTextPosition override.
For this, the OP commented that
Always stroking and non-smooth color goes like black
At first it was a bit unexpected, but after looking at PDFTextStripper.properties (from which the operators supported when extracting text were initialized), the reason became clear:
Thus, color adjustment operators are ignored in this context (especially for CMYK colors, as in this document)! Fortunately, the implementation of these operators for PageDrawer can also be used in this context.
Thus, the following proof of concept shows how you can get all the necessary information.
public class TextWithStateStripperSimple extends PDFTextStripper { public TextWithStateStripperSimple() throws IOException { super(); setSuppressDuplicateOverlappingText(false); registerOperatorProcessor("K", new org.apache.pdfbox.util.operator.SetStrokingCMYKColor()); registerOperatorProcessor("k", new org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor()); } @Override protected void processTextPosition(TextPosition text) { renderingMode.put(text, getGraphicsState().getTextState().getRenderingMode()); strokingColor.put(text, getGraphicsState().getStrokingColor()); nonStrokingColor.put(text, getGraphicsState().getNonStrokingColor()); super.processTextPosition(text); } Map<TextPosition, Integer> renderingMode = new HashMap<TextPosition, Integer>(); Map<TextPosition, PDColorState> strokingColor = new HashMap<TextPosition, PDColorState>(); Map<TextPosition, PDColorState> nonStrokingColor = new HashMap<TextPosition, PDColorState>(); protected void writeString(String text, List<TextPosition> textPositions) throws IOException { writeString(text + '\n'); for (TextPosition textPosition: textPositions) { StringBuilder textBuilder = new StringBuilder(); textBuilder.append(textPosition.getCharacter()) .append(" - shear by ") .append(textPosition.getTextPos().getValue(1, 0)) .append(" - ") .append(textPosition.getX()) .append(" ") .append(textPosition.getY()) .append(" - ") .append(renderingMode.get(textPosition)) .append(" - ") .append(toString(strokingColor.get(textPosition))) .append(" - ") .append(toString(nonStrokingColor.get(textPosition))) .append('\n'); writeString(textBuilder.toString()); } } String toString(PDColorState colorState) { if (colorState == null) return "null"; StringBuilder builder = new StringBuilder(); for (float f: colorState.getColorSpaceValue()) { builder.append(' ') .append(f); } return builder.toString(); } }
Using this, you will get a period of '.' in plain text:
. - shear by 0.0 - 256.5701 88.6875 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0
In artificial bold text you will receive:
. - shear by 0.0 - 378.86 122.140015 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0 . - shear by 0.0 - 378.86002 122.140015 - 1 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0
In artificial italics:
. - shear by 5.10137 - 327.121 156.4123 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0
And in the artificial scheme:
. - shear by 0.0 - 357.25 190.25 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 0.0 . - shear by 0.0 - 357.25 190.25 - 1 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 0.0
So, here you are, all the information necessary to recognize these artificial styles. Now you just need to analyze the data.
By the way, take a look at the artificial bold case: the coordinates may not always be the same, but just very similar. Thus, some kind of indulgence is required for verification if two objects of the text position describe the same position.