Get the corresponding coordinates of all words on a page using itextsharp

Question

Get the corresponding coordinates of all words on a page using itextsharp

My goal is to get the corresponding coordinates of all the words on the page, what I did

PdfReader reader = new PdfReader("cde.pdf"); TextWithPositionExtractionStategy S = new TextWithPositionExtractionStategy(); PdfTextExtractor.GetTextFromPage(reader,1,S); Vector curBaseline = renderInfo.GetDescentLine().GetStartPoint(); Vector topRight = renderInfo.GetAscentLine().GetEndPoint(); iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]); string x1 = curBaseline[Vector.I1].ToString(); string x2 = curBaseline[Vector.I2].ToString(); string x3 = topRight[Vector.I1].ToString(); string x4 = topRight[Vector.I2].ToString();

But what I got is the coordinates of a string containing all the words in the string, not the words. For example, the contents of the pdf file "I am a girl", what I received is the coordinate "I am a girl", but not the coordinates "I" "am" "a" "girl". How can I change the code to get the coordinate of a word. Thank.

+3

c # itextsharp

chengzixiaohai Dec 05

source share

1 answer

mkl · Accepted Answer · 2012-12-05 09:29

(I mainly work with the iText Java library, not the iTextSharp .Net library, so please ignore some Java-isms here, everything should be easy to translate.)

To extract the contents of a page using iText (Sharp), you use the classes in the parser package to feed it after some preliminary processing on the RenderListener of your choice.

In a context in which you are only interested in text, you most often use TextExtractionStrategy , which is obtained from RenderListener , and adds one getResultantText method to extract aggregated text from the page.

As the original intention of parsing text in iText was to implement this use case, most existing RenderListener patterns are TextExtractionStrategy implementations and only make text available.

Therefore, you will need to implement your own RenderListener , which, as you think, has a Christian TextWithPositionExtractionStategy .

Just like SimpleTextExtractionStrategy (which is implemented with some assumptions about the structure of the page content operators) and LocationTextExtractionStrategy (which does not have the same assumptions, but is somewhat more complicated), you might want to start with an implementation that makes some assumptions.

Thus, as in the case of SimpleTextExtractionStrategy , in your first, simple implementation, you expect that the text rendering events passed to your listener will arrive line by line and from line to line from left to right. Thus, as soon as you find a horizontal gap or punctuation, you know that your current word is finished, and you can process it.

Unlike text retrieval strategies, you do not need a StringBuffer member to collect your result, but instead a list of a word with position structure. In addition, you need a member variable to store TextRenderInfo events that you have already collected for this page, but could not be finalized (you can get the word in several separate events).

Once you (i.e. your renderText method) are called for a new TextRenderInfo object, you should work as follows (pseudocode):

 if (unprocessedTextRenderInfos not empty) { if (isNewLine // Check this like the simple text extraction strategy checks for hardReturn || isGapFromPrevious) // Check this like the simple text extraction strategy checks whether to insert a space { process(unprocessedTextRenderInfos); unprocessedTextRenderInfos.clear(); } } split new TextRenderInfo using its getCharacterRenderInfos() method; while (characterRenderInfos contain word end) { add characterRenderInfos up to excluding the white space/punctuation to unprocessedTextRenderInfos; process(unprocessedTextRenderInfos); unprocessedTextRenderInfos.clear(); remove used render infos from characterRenderInfos; } add remaining characterRenderInfos to unprocessedTextRenderInfos;

In process(unprocessedTextRenderInfos) you extract the necessary information from unprocessedTextRenderInfos; You combine the contents of a single text into a word and take the necessary coordinates; if you just want to start the coordinates, you take them from the first of these raw TextRenderInfos. If you need more data, you are also using data from another TextRenderInfos. With this data, you fill out the word with position structure and add it to the list of results.

When page processing is complete, you need to call the call process (unprocessedTextRenderInfos) and unprocessedTextRenderInfos.clear (); alternatively you can do this in the endTextBlock method.

Having done this, you can feel ready to implement a slightly more complex version, which does not have the same assumptions regarding the structure of the page content .;)

Get the corresponding coordinates of all words on a page using itextsharp

More articles: