Reading tables from a PDF using C #

I want to read tables inside a pdf file, I have a pdf file with the whole table, which SDK is used in C # to recognize tables inside pdf files and some mechanism for reading a cell by cell?

Can anyone suggest if you know any dlls that recognize tables inside pdf files.

+10
source share
6 answers

In the PDF format, there is no concept of a β€œtable”, since its vector grammar consists of simple primitives associated with paths (that is, lines, curves, font outlines ...) and selective content (that is, bitmap images).

However, a good heuristic algorithm could detect a weak presence of the so-called β€œtabular” representation (ie typically, intersecting lines mixed with content).

+8
source

iTextPdf may be what you are looking for. I did not use it, but I heard very good things. In addition, it is open source and free (for non-commercial use), which is always nice.

+4
source

I know this is an old question, but someone might need it

"Absolutely obvious" introduction:
PDF files are a stream of a graphic object (for example, lines) and text. When a PDF is rendered, the human eye realizes that there are tables because of the lines and text between them.

Solution (mine)
Starting with reading PDF (iTextSharp), you need to:
1. read the lines (hopefully only vertical and horizontal lines)
2. join the rows (the row of the table may be several rows, for example, one per cell),
3. understand where the tables are (sometimes creating a hypothesis based on your needs); 4. it is not necessary to find the text outside the tables (it is better to save all the text) and insert it into paragraphs,
5. Paste the text inside the table cells

If you need something already written to get started (to work with my pdf files), you can find something here https://github.com/bubibubi/ExtractTablesFromPdf
It uses the GPL version of iTextSharp.

+3
source

Pdfbox

From IKVM.NET it is successfully used to parse PDF documents in .NET.

Using PDFBox to parse PDF files is quite simple:

private static string parseUsingPDFBox(string filename) { PDDocument doc = PDDocument.load(filename); PDFTextStripper stripper = new PDFTextStripper(); return stripper.getText(doc); } 
+2
source

I needed the same thing for the project. My process is a bit overhead, but it works quite well. When I improve it a bit, I will post it. Eat the main stream:

  • use libpdf to convert pdf to json
  • import json file to get text strings with their coordinates
  • use ghostscript to convert pdf to image
  • use Aforge blobcounter to get table cells.
  • group cells into tables
  • use cell location and size to determine which text lines contain
0
source

Can you bring your code here or in the list of main lines of code?

Bubi, you wrote a good answer. I have not tested this. But how to determine if there is a table with different types of dividing lines? What about tables without rows?

Dz

-2
source

All Articles