How to get font color using pdfbox

I am trying to extract text with all the information from pdf using pdfbox. I got all the information I want except color. I tried different ways to get the font (including Retrieving Text Using PDFBox ). But does not work. And now I copied the code from the PageDrawer class pdfBox. But then also the RGB value is wrong.

protected void processTextPosition(TextPosition text) { Composite com; Color col; switch(this.getGraphicsState().getTextState().getRenderingMode()) { case PDTextState.RENDERING_MODE_FILL_TEXT: com = this.getGraphicsState().getNonStrokeJavaComposite(); int r = this.getGraphicsState().getNonStrokingColor().getJavaColor().getRed(); int g = this.getGraphicsState().getNonStrokingColor().getJavaColor().getGreen(); int b = this.getGraphicsState().getNonStrokingColor().getJavaColor().getBlue(); int rgb = this.getGraphicsState().getNonStrokingColor().getJavaColor().getRGB(); float []cosp = this.getGraphicsState().getNonStrokingColor().getColorSpaceValue(); PDColorSpace pd = this.getGraphicsState().getNonStrokingColor().getColorSpace(); break; case PDTextState.RENDERING_MODE_STROKE_TEXT: System.out.println(this.getGraphicsState().getStrokeJavaComposite().toString()); System.out.println(this.getGraphicsState().getStrokingColor().getJavaColor().getRGB()); break; case PDTextState.RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT: //basic support for text rendering mode "invisible" Color nsc = this.getGraphicsState().getStrokingColor().getJavaColor(); float[] components = {Color.black.getRed(),Color.black.getGreen(),Color.black.getBlue()}; Color c1 = new Color(nsc.getColorSpace(),components,0f); System.out.println(this.getGraphicsState().getStrokeJavaComposite().toString()); break; default: System.out.println(this.getGraphicsState().getNonStrokeJavaComposite().toString()); System.out.println(this.getGraphicsState().getNonStrokingColor().getJavaColor().getRGB()); } 

I am using the code above. Getting the values ​​r = 0, g = 0, b = 0, inside the object of the object cosp [0.0], inside the array pd object = null and colorSpace = null. and the RGB value is always -16777216. Please help me. Thanks in advance.

+7
source share
4 answers

I tried the code in the link you provided, and it worked for me. The colors I get are 148.92, 179.01001 and 214.965. I wish I could provide you with my PDF file for work, maybe if I save it externally for SO? My PDF used a shade of blue that seemed to fit. It was just one page of text, created in Word 2010 and exported, nothing too intense.

A few suggestions ....

  • Recall that the return value is a float between 0 and 1. If the value is accidentally chosen for int, then, of course, the values ​​will contain almost all 0. The code associated is a multiple of 255 to get a range from 0 to 255.
  • As the commentator said, the most common color for a PDF file is black, which is 0 0 0

That's all I can think of now, otherwise I have version 1.7.1 from pdfbox and fontbox, and as I said, I pretty much followed the link you gave.

EDIT

Based on my comments, is there perhaps a slight invasive way to do this for PDFs such as color.pdf ?

In PDFStreamEngine.java in PDFStreamEngine.java method can be executed inside try block

 if (operation.equals("RG")) { // stroking color space System.out.println(operation); System.out.println(arguments); } else if (operation.equals("rg")) { // non-stroking color space System.out.println(operation); System.out.println(arguments); } else if (operation.equals("BT")) { System.out.println(operation); } else if (operation.equals("ET")) { System.out.println(operation); } 

This will show you the information, then it’s up to you to process the color information for each section according to your needs. Below is a snippet from the beginning of the output of the above code when running on color.pdf ...

BT rG [COSInt(1), COSInt(0), CosInt(0)] RG [COSInt(1), COSInt(0), CosInt(0)] ET BT ET BT rG [COSFloat{0.573}, COSFloat{0.816}, COSFloat{0.314}] RG [COSFloat{0.573}, COSFloat{0.816}, COSFloat{0.314}] ET ......

In the above, you see the empty BT ET section, this is the section labeled DEVICEGRAY. All the others give you the values ​​[0,1] for the components R, G and B

+5
source

I also finished doing something like that. Paste the code below, hope this helps someone.

 import java.io.IOException; import java.util.List; import org.apache.pdfbox.exceptions.COSVisitorException; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.pdmodel.edit.PDPageContentStream; import org.apache.pdfbox.pdmodel.font.PDFont; import org.apache.pdfbox.pdmodel.font.PDType1Font; import org.apache.pdfbox.pdmodel.graphics.PDGraphicsState; import org.apache.pdfbox.util.PDFTextStripper; import org.apache.pdfbox.util.ResourceLoader; import org.apache.pdfbox.util.TextPosition; public class Parser extends PDFTextStripper { public Parser() throws IOException { super(ResourceLoader.loadProperties( "org/apache/pdfbox/resources/PageDrawer.properties", true)); super.setSortByPosition(true); } public void parse(String path) throws IOException{ PDDocument doc = PDDocument.load(path); List<PDPage> pages = doc.getDocumentCatalog().getAllPages(); for (PDPage page : pages) { this.processStream(page, page.getResources(), page.getContents().getStream()); } } @Override protected void processTextPosition(TextPosition text) { try { PDGraphicsState graphicsState = getGraphicsState(); System.out.println("R = " + graphicsState.getNonStrokingColor().getJavaColor().getRed()); System.out.println("G = " + graphicsState.getNonStrokingColor().getJavaColor().getGreen()); System.out.println("B = " + graphicsState.getNonStrokingColor().getJavaColor().getBlue()); } catch (IOException ioe) {} } public static void main(String[] args) throws IOException, COSVisitorException { Parser p = new Parser(); p.parse("/Users/apple/Desktop/123.pdf"); } } 
+5
source

I found the code in one of my maintenance programs.
I do not know if this works for you or not, please try. Also check out this link http://pdfbox.apache.org/apidocs/org/apache/pdfbox/pdmodel/common/class-use/PDStream.html

It can help you.

 PDDocument doc = null; try { doc = PDDocument.load("C:/Path/To/Pdf/Sample.pdf"); PDFStreamEngine engine = new PDFStreamEngine(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties")); PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0); engine.processStream(page, page.findResources(), page.getContents().getStream()); PDGraphicsState graphicState = engine.getGraphicsState(); System.out.println(graphicState.getStrokingColor().getColorSpace().getName()); float colorSpaceValues[] = graphicState.getStrokingColor().getColorSpaceValue(); for (float c : colorSpaceValues) { System.out.println(c * 255); } } finally { if (doc != null) { doc.close(); } 
+3
source

In pdfbox verson 2.0+, you must select these operators in the constructor of your overwritten PDFTextStripper:

 addOperator(new SetStrokingColorSpace()); addOperator(new SetNonStrokingColorSpace()); addOperator(new SetStrokingDeviceCMYKColor()); addOperator(new SetNonStrokingDeviceCMYKColor()); addOperator(new SetNonStrokingDeviceRGBColor()); addOperator(new SetStrokingDeviceRGBColor()); addOperator(new SetNonStrokingDeviceGrayColor()); addOperator(new SetStrokingDeviceGrayColor()); addOperator(new SetStrokingColor()); addOperator(new SetStrokingColorN()); addOperator(new SetNonStrokingColor()); addOperator(new SetNonStrokingColorN()); 

Only then getGraphicsState () will return the correct information.

See https://pdfbox.apache.org/2.0/migration.html

+1
source

All Articles