Reading tables from a PDF using C #

Question

Reading tables from a PDF using C #

I want to read tables inside a pdf file, I have a pdf file with the whole table, which SDK is used in C # to recognize tables inside pdf files and some mechanism for reading a cell by cell?

Can anyone suggest if you know any dlls that recognize tables inside pdf files.

+10

c # pdf

praveen Aug 05 '11 at 12:47

source share

6 answers

Paolo moretti · Answer 1 · 2011-08-05T20:40:35+0000

In the PDF format, there is no concept of a “table”, since its vector grammar consists of simple primitives associated with paths (that is, lines, curves, font outlines ...) and selective content (that is, bitmap images).

However, a good heuristic algorithm could detect a weak presence of the so-called “tabular” representation (ie typically, intersecting lines mixed with content).

Jetti · Answer 2 · 2011-08-05T13:02:35+0000

iTextPdf may be what you are looking for. I did not use it, but I heard very good things. In addition, it is open source and free (for non-commercial use), which is always nice.

bubi · Answer 3 · 2017-07-04T09:30:42+0000

I know this is an old question, but someone might need it

"Absolutely obvious" introduction:
PDF files are a stream of a graphic object (for example, lines) and text. When a PDF is rendered, the human eye realizes that there are tables because of the lines and text between them.

Solution (mine)
Starting with reading PDF (iTextSharp), you need to:
1. read the lines (hopefully only vertical and horizontal lines)
2. join the rows (the row of the table may be several rows, for example, one per cell),
3. understand where the tables are (sometimes creating a hypothesis based on your needs); 4. it is not necessary to find the text outside the tables (it is better to save all the text) and insert it into paragraphs,
5. Paste the text inside the table cells

If you need something already written to get started (to work with my pdf files), you can find something here https://github.com/bubibubi/ExtractTablesFromPdf
It uses the GPL version of iTextSharp.

Justin shield · Answer 4 · 2011-08-05T14:18:09+0000

Pdfbox

From IKVM.NET it is successfully used to parse PDF documents in .NET.

http://www.codeproject.com/KB/string/pdf2text.aspx

Using PDFBox to parse PDF files is quite simple:

private static string parseUsingPDFBox(string filename) { PDDocument doc = PDDocument.load(filename); PDFTextStripper stripper = new PDFTextStripper(); return stripper.getText(doc); }

jason · Answer 5 · 2014-04-22T16:35:50+0000

I needed the same thing for the project. My process is a bit overhead, but it works quite well. When I improve it a bit, I will post it. Eat the main stream:

use libpdf to convert pdf to json
import json file to get text strings with their coordinates
use ghostscript to convert pdf to image
use Aforge blobcounter to get table cells.
group cells into tables
use cell location and size to determine which text lines contain

Dz · Answer 6 · 2019-04-30T14:00:10+0000

Can you bring your code here or in the list of main lines of code?

Bubi, you wrote a good answer. I have not tested this. But how to determine if there is a table with different types of dividing lines? What about tables without rows?

Dz

Reading tables from a PDF using C #

More articles: