ITextSharp extracts the contents of a cell into new rows - how do you determine which column a given wrapped piece of data refers to?

Question

ITextSharp extracts the contents of a cell into new rows - how do you determine which column a given wrapped piece of data refers to?

I am using iTextSharp to extract data from pdf files. I came across the following problem described below:

I created an excel sample to illustrate. Here's what it looks like:

I convert it to pdf using one of the many free online converters available there that generates a view in pdf format (when I created the PDF, I did not apply the style to excel):

Now, using iTextSharp to extract data from pdf, returns me the following line as extracted data:

As you can see, the wrapped cell data generates new lines, where each wrapped piece of data is separated by a single space.

Problem : how to determine which column a given piece of wrapped data belongs to? If only iTextSharp saved as many white spaces as columns ...

In my example - how can I determine which column belongs to 111?

Update 1:

A similar problem occurs whenever a field has more than one word (i.e. contains spaces). For example, given the first line of the above example:

say it looks like

 ---A--- ---B--- ---C--- ---D--- aaaaaaa bb b cccc

iText would generate the extract again for this:

 aaaaaaa bb b cccc

The same problems here when you have to determine the boundaries of each column.

Update 2: Sample real pdf file I'm working with: Here's what the PDF looks like.

0

itextsharp

Veverke Dec 30 '15 at 14:24

source share

3 answers

This is not really the answer, but I need a place to show some things that can help you understand things.

The first "conversion" from Excel, Word, PowerPoint, HTML or any other to PDF will almost always be a devastating change. The destructive part is very important, and this is because you take data from a program that has very specific knowledge of what the data represents (Excel), and you turn it into drawing commands in a very universal universal format (PDF), which only takes care of how the data looks, not the data itself. If the data is not "marked" (and it almost never remains these days), then there is no context for the drawing commands. No paragraphs, no sentences, no columns, rows, tables, etc. Literally just draw this letter in x,y and draw this word in a,b .

Secondly, imagine that the Excel file has the following data, and for some reason the last column was narrower than the others when the PDF file was created:

 Column A | Column B | Column C Data #1 Data #2 Data #3

You and I have a context, so we know that the second and fourth lines are just a continuation of the first and third lines. But since iText has no context at the time of extraction, it does not think so and sees four lines of text. In fact, since it has no context, it does not even see the columns, but the rows themselves.

Thirdly, although you really need to understand that you are not drawing spaces in PDF. Imagine a table with three columns below:

 Column A | Column B | Column C Yes

If you extracted this from the PDF, you will get the following data:

 Column A | Column B | Column C Yes

Inside the PDF, the word “Yes” will simply be drawn at a specific x coordinate, which you and I count under the third column, and there will be no bunch of spaces in front of it.

As I said at the beginning, this is not a very answer, but I hope it will explain to you the problem you are trying to solve. If your PDF is tagged, it will have a context, and you can use this context during extraction. The context is not universal, however, as a rule, it is not just a magic flag "insert context". Excel actually has a checkbox (if I remember correctly) to make a marked PDF file during export, and ultimately it creates a marked PDF file using HTML-like tags for tables. Very primitive, but it will work. However, you will need to analyze this context.

+4

Chris Haas Dec 31 '15 at 21:15

source share

Leaving here an alternative strategy for extracting data that does not solve the problem of who is being processed in space / can be processed, but gives you a bit more control over the extraction, indicating the geometric areas from which you want to extract text. Taken from here .

  public static System.util.RectangleJ GetRectangle(float distanceInPixelsFromLeft, float distanceInPixelsFromBottom, float width, float height) { return new System.util.RectangleJ( distanceInPixelsFromLeft, distanceInPixelsFromBottom, width, height); } public static void Strategy2() { // In this example, I'll declare a pageNumber integer variable to // only capture text from the page I'm interested in int pageNumber = 1; var text = new StringBuilder(); List<Tuple<string, int>> result = new List<Tuple<string, int>>(); // The PdfReader object implements IDisposable.Dispose, so you can // wrap it in the using keyword to automatically dispose of it using (var pdfReader = new PdfReader("D:/Example.pdf")) { float distanceInPixelsFromLeft = 20; //float distanceInPixelsFromBottom = 730; float width = 300; float height = 10; for (int i = 800; i >= 0; i -= 10) { var rect = GetRectangle(distanceInPixelsFromLeft, i, width, height); var filters = new RenderFilter[1]; filters[0] = new RegionTextRenderFilter(rect); ITextExtractionStrategy strategy = new FilteredTextRenderListener( new LocationTextExtractionStrategy(), filters); var currentText = PdfTextExtractor.GetTextFromPage( pdfReader, pageNumber, strategy); currentText = Encoding.UTF8.GetString(Encoding.Convert( Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText))); //text.Append(currentText); result.Add(new Tuple<string, int>(currentText, currentText.Length)); } } // You'll do something else with it, here I write it to a console window //Console.WriteLine(text.ToString()); foreach (var line in result.Distinct().Where(r => !string.IsNullOrWhiteSpace(r.Item1))) { Console.WriteLine("Text: [{0}], Length: {1}", line.Item1, line.Item2); } //Console.WriteLine("", string.Join("\r\n", result.Distinct().Where(r => !string.IsNullOrWhiteSpace(r.Item1))));

Outputs:

PS: We still have the problem of how to handle spaces / non-textual data.

0

Veverke Jan 13 '16 at 11:10

source share

mkl · Accepted Answer · 2016-01-01 09:55

Besides Chris' general answer, some background in iText (Sharp) content analysis ...

iText (Sharp) provides a framework for extracting content in the namespace iTextSharp.text.pdf.parser / package com.itextpdf.text.pdf.parser . This franework reads the contents of the page, monitors the current state of the graphics, and sends information about the pieces of content to IExtRenderListener or IRenderListener / ExtRenderListener or RenderListener user (i.e. you ). In particular, the structure is not interpreted into this information.

This render listener can be a text extraction strategy ( ITextExtractionStrategy / TextExtractionStrategy ), that is, a special rendering receiver that is primarily designed to extract a clean text stream without formatting or layout information. And for this special case, iText (Sharp) additionally provides two implementation examples, SimpleTextExtractionStrategy and LocationTextExtractionStrategy .

For your task, you will need a more sophisticated rendering listener that either

exports text with coordinates (Chris in one of his answers provided an extended LocationTextExtractionStrategy , which can additionally provide positions and bounding fields of text fragments), which allows you to use additional code to analyze table structures; or
the analysis of the tabular data is performed.

I have no example for the latter option, because common recognition and parsing tables are a whole project in itself. You might want to look into the Tabula project for inspiration; this project is surprisingly good at extracting tables.

PS: If you feel at home trying to extract structured content from a clean string representation of the content, which nonetheless tries to reflect the original layout, you can try something like what is suggested in this answer , a LocationTextExtractionStrategy option that works similarly pdftotext -layout tool; it only shows the changes that will be applied to LocationTextExtractionStrategy .

PPS: extracting data from very specific PDF tables can be much easier; for example, see this answer , which demonstrates that after some PDF analysis, a specific way to create a given table can lead to a simple custom rendering handler to retrieve the data table. This may make sense for a single PDF file with a spreadsheet covering many pages, for example, in the case of this answer, or it may make sense if you have many PDF files that are identical created by the same software.

That is why I asked for a representative sample file in the comment to your question

Regarding your comments

Still with the pdf example above, both with the implementation of ITextExtractionStrategy from scratch and with the LocationExtractionStrategy extension, I see that each RenderText is called in the following snippets: Fi, el, d, A, Fi, el, d ... and so Further. Can this be changed?

The pieces of text that you receive as separate calls to RenderText are not separated randomly or from a random iText solution. These are the same lines as in the page content.

In your example, “Fi” “el”, “d” and “A” enter different RenderText calls because the content stream contains operations that draw the first “Fi”, then “el”, then “d”, then “ A ".

It may seem strange at first. A common reason for such word breaks is that PDF does not use kerning information from fonts; for kerning, so the PDF creation software must insert tiny forward or backward transitions between characters that should be farther or closer to each other than without kerning. Thus, words are often torn between kerning pairs.

So this cannot be changed, you will get these fragments, and the task of the text extraction strategy is to bring them together.

By the way, there are worse PDF files, some PDF generators position each glyph separately, especially generators that primarily create graphical interfaces, but can automatically export GUI canvases as PDF files as a function.

I would expect that by going into the “add my own implementation” area, I would control how to determine what a “piece” of text is.

You can ... well, you have to decide which of the input parts belongs together and which not. For example. make glyphs with the same y coordinate on one line? Or they form separate rows in different columns that are just next to each other.

So, you decide which glyphs you interpret as a single word or as the contents of a single table cell, but your input consists of glyph groups used in the PDF content stream.

Not only that, in none of the interface methods I can "determine" how / where it deals with non-textual data / images - so I could complain about a problem with an interval (RenderImage is not called)

RenderImage will be called for embedded bitmaps, JPEG, etc. If you want information about vector graphics, your strategy will also need to implement IExtRenderListener , which provides ModifyPath , RenderPath and ClipPath .

ITextSharp extracts the contents of a cell into new rows - how do you determine which column a given wrapped piece of data refers to?

More articles: