How to get text with a specific color from pdf C #

Question

How to get text with a specific color from pdf C #

I need to put data from a PDF file into a specific database structure. This requires me to be able to get certain data from the pdf file. Since pdf has no tags, etc., I was wondering if it is possible to get text based on color. Say, for example, I want all the red text. Or I want all the italic text in the document. Is this possible in C #? Or is there another way to easily filter data in a PDF document?

enter image description here

+3

c # colors pdf itextsharp

Ojtwist May 03 '11 at 15:41

source share

4 answers

Using this library http://www.codeproject.com/KB/files/xpdf_csharp.aspx?msg=3154408 you have access to each style of the word (font, color ...)

this.pdfDoc.Pages[4].WordList.ElementAt(143).ForeColor

0

anth May 03 '11 at 16:14

source share

iText PdfTextExtractor (and all the code it relies on) DOES NOT track the current color. Uch. It's not that hard to add, so you can change iText yourself:

Add hatching to the GraphicState class and fill in the color elements (and update the various constructors accordingly).
You need to add ContentOperator classes for 'g', 'G', 'rg', 'RG', 'K' and 'k' (and possibly CS, cs, SC, sc, SCN, scn) to change the stroke and fill colors.
Add methods to TextRenderInfo to get the current stroke and fill the colors.

0

Mark Storer May 03 '11 at 18:13

source share

Try PdfLibTET http://www.pdflib.com/products/tet/
He should be able to receive information about the text.

0

Fabrizio Accatino May 03 '11 at 19:35

source share

Ojtwist · Accepted Answer · 2011-05-04 17:12

I used a different approach. I converted the pdf to excel file. And it was very easy to find colored text.

How to get text with a specific color from pdf C #

More articles: