Besides Chris' general answer, some background in iText (Sharp) content analysis ...
iText (Sharp) provides a framework for extracting content in the namespace iTextSharp.text.pdf.parser / package com.itextpdf.text.pdf.parser . This franework reads the contents of the page, monitors the current state of the graphics, and sends information about the pieces of content to IExtRenderListener or IRenderListener / ExtRenderListener or RenderListener user (i.e. you ). In particular, the structure is not interpreted into this information.
This render listener can be a text extraction strategy ( ITextExtractionStrategy / TextExtractionStrategy ), that is, a special rendering receiver that is primarily designed to extract a clean text stream without formatting or layout information. And for this special case, iText (Sharp) additionally provides two implementation examples, SimpleTextExtractionStrategy and LocationTextExtractionStrategy .
For your task, you will need a more sophisticated rendering listener that either
- exports text with coordinates (Chris in one of his answers provided an extended
LocationTextExtractionStrategy , which can additionally provide positions and bounding fields of text fragments), which allows you to use additional code to analyze table structures; or - the analysis of the tabular data is performed.
I have no example for the latter option, because common recognition and parsing tables are a whole project in itself. You might want to look into the Tabula project for inspiration; this project is surprisingly good at extracting tables.
PS: If you feel at home trying to extract structured content from a clean string representation of the content, which nonetheless tries to reflect the original layout, you can try something like what is suggested in this answer , a LocationTextExtractionStrategy option that works similarly pdftotext -layout tool; it only shows the changes that will be applied to LocationTextExtractionStrategy .
PPS: extracting data from very specific PDF tables can be much easier; for example, see this answer , which demonstrates that after some PDF analysis, a specific way to create a given table can lead to a simple custom rendering handler to retrieve the data table. This may make sense for a single PDF file with a spreadsheet covering many pages, for example, in the case of this answer, or it may make sense if you have many PDF files that are identical created by the same software.
That is why I asked for a representative sample file in the comment to your question
Regarding your comments
Still with the pdf example above, both with the implementation of ITextExtractionStrategy from scratch and with the LocationExtractionStrategy extension, I see that each RenderText is called in the following snippets: Fi, el, d, A, Fi, el, d ... and so Further. Can this be changed?
The pieces of text that you receive as separate calls to RenderText are not separated randomly or from a random iText solution. These are the same lines as in the page content.
In your example, “Fi” “el”, “d” and “A” enter different RenderText calls because the content stream contains operations that draw the first “Fi”, then “el”, then “d”, then “ A ".
It may seem strange at first. A common reason for such word breaks is that PDF does not use kerning information from fonts; for kerning, so the PDF creation software must insert tiny forward or backward transitions between characters that should be farther or closer to each other than without kerning. Thus, words are often torn between kerning pairs.
So this cannot be changed, you will get these fragments, and the task of the text extraction strategy is to bring them together.
By the way, there are worse PDF files, some PDF generators position each glyph separately, especially generators that primarily create graphical interfaces, but can automatically export GUI canvases as PDF files as a function.
I would expect that by going into the “add my own implementation” area, I would control how to determine what a “piece” of text is.
You can ... well, you have to decide which of the input parts belongs together and which not. For example. make glyphs with the same y coordinate on one line? Or they form separate rows in different columns that are just next to each other.
So, you decide which glyphs you interpret as a single word or as the contents of a single table cell, but your input consists of glyph groups used in the PDF content stream.
Not only that, in none of the interface methods I can "determine" how / where it deals with non-textual data / images - so I could complain about a problem with an interval (RenderImage is not called)
RenderImage will be called for embedded bitmaps, JPEG, etc. If you want information about vector graphics, your strategy will also need to implement IExtRenderListener , which provides ModifyPath , RenderPath and ClipPath .