Extract PDF text by coordinates

Question

Extract PDF text by coordinates

I would like to know if there is any PDF library in Microsoft.NET capable of extracting text by specifying coordinates.

For example (in pseudo-code):

PdfReader reader = new PdfReader(); reader.Load("file.pdf"); // Top, bottom, left, right in pixels or any other unit string wholeText = reader.GetText(100, 150, 20, 50);

I tried to do this using PDFBox for .NET (the one that runs on top of IKVM), with no luck, and it seems very outdated and undocumented.

Maybe someone has a good selection using PDFBox, iTextSharp, or any other open source library, and he / she can give me a hint.

Thanks in advance.

+4

c # pdf .net-4.0

Matías Fidemraizer Sep 13 '11 at 16:28

source share

4 answers

This is not an open source, but hopefully this will help you (and possibly anyone else using ABCPDF!)

I did this earlier today by going through the available fields in the PDF. This means that the PDF file used must be correctly created, and you need to know the name of the field for which you want to receive text (you could do this by adding a breakpoint and navigating through the available fields).

 WebSupergoo.ABCpdf6.Doc newPDF = new WebSupergoo.ABCpdf6.Doc(); newPDF.Read("existing_file.pdf"); foreach ( WebSupergoo.ABCpdf6.Objects.Field field in newPDF.Form.Fields ) { if ( field.Name == "Text1" ) { // update "Text1" field.Value = "new value for Text1"; } } newPDF.Save("new_file.pdf"); newPDF.Clear();

In this example, “Text1” is the name of the field being updated. Note. I also provide an example for saving updated fields.

Hopefully this at least gives you an idea of how to approach this issue.

+3

Ben pearson Sep 13 '11 at 16:44

source share

This should work:

 RenderFilter[] filters = new RenderFilter[1]; LocationTextExtractionStrategy regionFilter = new LocationTextExtractionStrategy(); filters[0] = new RegionTextRenderFilter(new Rectangle(llx,lly,urx,ury)); FilteredTextRenderListener strategy = new FilteredTextRenderListener(regionFilter, filters); String result = PdfTextExtractor.GetTextFromPage(pdfReader, i, strategy); Console.WriteLine(result);

+3

Timo hoen Aug 3 '12 at 8:17

source share

iText RegionTextRenderFilter is exactly what you are looking for.

So you want something like this (forgive my Java, but this should be trivial for translation):

 PdfReader reader = new PdfReader(path); FilteredTextExtractionStrategy regionFilter = new FilteredTextExtractionStrategy( new SimpleTextExtrationStrategy, new RegionTextRenderFilter( someRect ) ); String regionText = PdfTextExtractor.getTextFromPage(reader, 0, regionFilter );

+2

Mark storer Sep 13 '11 at 18:37

source share

Matías Fidemraizer · Accepted Answer · 2011-09-13T17:00:25+0000

Ok, thanks for your effort.

I got it using Apache PDFBox on top of IKVM compilation, and this is the final code:

 PDDocument doc = PDDocument.load(@"c:\invoice.pdf"); PDFTextStripperByArea stripper = new PDFTextStripperByArea(); stripper.addRegion("testRegion", new java.awt.Rectangle(0, 10, 100, 100)); stripper.extractRegions((PDPage)doc.getDocumentCatalog().getAllPages().get(0)); string text = stripper.getTextForRegion("testRegion");

And it works like a charm.

Thanks, and I hope my own answer helps others. If you need more information, just comment here and I will update this answer.

Extract PDF text by coordinates

More articles: