How to get text with a specific color from pdf C #

I need to put data from a PDF file into a specific database structure. This requires me to be able to get certain data from the pdf file. Since pdf has no tags, etc., I was wondering if it is possible to get text based on color. Say, for example, I want all the red text. Or I want all the italic text in the document. Is this possible in C #? Or is there another way to easily filter data in a PDF document?

enter image description here

+3
c # colors pdf itextsharp
May 03 '11 at 15:41
source share
4 answers

I used a different approach. I converted the pdf to excel file. And it was very easy to find colored text.

0
May 04 '11 at 17:12
source share

Using this library http://www.codeproject.com/KB/files/xpdf_csharp.aspx?msg=3154408 you have access to each style of the word (font, color ...)

this.pdfDoc.Pages[4].WordList.ElementAt(143).ForeColor 
0
May 03 '11 at 16:14
source share

iText PdfTextExtractor (and all the code it relies on) DOES NOT track the current color. Uch. It's not that hard to add, so you can change iText yourself:

  • Add hatching to the GraphicState class and fill in the color elements (and update the various constructors accordingly).
  • You need to add ContentOperator classes for 'g', 'G', 'rg', 'RG', 'K' and 'k' (and possibly CS, cs, SC, sc, SCN, scn) to change the stroke and fill colors.
  • Add methods to TextRenderInfo to get the current stroke and fill the colors.
0
May 03 '11 at 18:13
source share

Try PdfLibTET http://www.pdflib.com/products/tet/
He should be able to receive information about the text.

0
May 03 '11 at 19:35
source share



All Articles