I know this is an old question, but someone might need it
"Absolutely obvious" introduction:
PDF files are a stream of a graphic object (for example, lines) and text. When a PDF is rendered, the human eye realizes that there are tables because of the lines and text between them.
Solution (mine)
Starting with reading PDF (iTextSharp), you need to:
1. read the lines (hopefully only vertical and horizontal lines)
2. join the rows (the row of the table may be several rows, for example, one per cell),
3. understand where the tables are (sometimes creating a hypothesis based on your needs); 4. it is not necessary to find the text outside the tables (it is better to save all the text) and insert it into paragraphs,
5. Paste the text inside the table cells
If you need something already written to get started (to work with my pdf files), you can find something here https://github.com/bubibubi/ExtractTablesFromPdf
It uses the GPL version of iTextSharp.
bubi
source share