Getting string coordinates using ITextExtractionStrategy and LocationTextExtractionStrategy in Itextsharp

Question

Getting string coordinates using ITextExtractionStrategy and LocationTextExtractionStrategy in Itextsharp

I have a PDF file that I read in a line using ITextExtractionStrategy.Now from the line I take a substring, for example My name is XYZ , and I need to get the rectangular coordinates of the substring from the PDF, but I am not able to do this. In googling, I found out that LocationTextExtractionStrategy , but I do not get how to use this to get the coordinates.

Here is the code.

 ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy(); string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText))); text.Append(currentText); string getcoordinate="My name is XYZ";

How can I get the rectangular coordinate of this substring using ITEXTSHARP ..

Please, help.

+15

c # itextsharp

user3664608 May 28 '14 at 11:05

source share

3 answers

This is an old question, but I leave my answer here, since I could not find the correct answer on the Internet.

As Chris Haas has shown, it’s not just talking about how iText deals in pieces. The code by which Chris's message failed in most of my tests because the word usually breaks into different pieces (he warns about this in the message).

To solve this problem, this is the strategy I used:

Separate chunks in characters (actually textrenderinfo objects for each char)
Grouping rows by row. This is not as straightforward as you have to deal with the alignment of a piece.
Find the word you want to find for each line

I leave the code here. I am testing it with several documents and it works very well, but in some scenarios this may be unsuccessful because it is a little more complicated in this chunk → words conversion.

Hope this helps someone.

  class LocationTextExtractionStrategyEx : LocationTextExtractionStrategy { private List<LocationTextExtractionStrategyEx.ExtendedTextChunk> m_DocChunks = new List<ExtendedTextChunk>(); private List<LocationTextExtractionStrategyEx.LineInfo> m_LinesTextInfo = new List<LineInfo>(); public List<SearchResult> m_SearchResultsList = new List<SearchResult>(); private String m_SearchText; public const float PDF_PX_TO_MM = 0.3528f; public float m_PageSizeY; public LocationTextExtractionStrategyEx(String sSearchText, float fPageSizeY) : base() { this.m_SearchText = sSearchText; this.m_PageSizeY = fPageSizeY; } private void searchText() { foreach (LineInfo aLineInfo in m_LinesTextInfo) { int iIndex = aLineInfo.m_Text.IndexOf(m_SearchText); if (iIndex != -1) { TextRenderInfo aFirstLetter = aLineInfo.m_LineCharsList.ElementAt(iIndex); SearchResult aSearchResult = new SearchResult(aFirstLetter, m_PageSizeY); this.m_SearchResultsList.Add(aSearchResult); } } } private void groupChunksbyLine() { LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk1 = null; LocationTextExtractionStrategyEx.LineInfo textInfo = null; foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk2 in this.m_DocChunks) { if (textChunk1 == null) { textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2); this.m_LinesTextInfo.Add(textInfo); } else if (textChunk2.sameLine(textChunk1)) { textInfo.appendText(textChunk2); } else { textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2); this.m_LinesTextInfo.Add(textInfo); } textChunk1 = textChunk2; } } public override string GetResultantText() { groupChunksbyLine(); searchText(); //In this case the return value is not useful return ""; } public override void RenderText(TextRenderInfo renderInfo) { LineSegment baseline = renderInfo.GetBaseline(); //Create ExtendedChunk ExtendedTextChunk aExtendedChunk = new ExtendedTextChunk(renderInfo.GetText(), baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), renderInfo.GetCharacterRenderInfos().ToList()); this.m_DocChunks.Add(aExtendedChunk); } public class ExtendedTextChunk { public string m_text; private Vector m_startLocation; private Vector m_endLocation; private Vector m_orientationVector; private int m_orientationMagnitude; private int m_distPerpendicular; private float m_charSpaceWidth; public List<TextRenderInfo> m_ChunkChars; public ExtendedTextChunk(string txt, Vector startLoc, Vector endLoc, float charSpaceWidth,List<TextRenderInfo> chunkChars) { this.m_text = txt; this.m_startLocation = startLoc; this.m_endLocation = endLoc; this.m_charSpaceWidth = charSpaceWidth; this.m_orientationVector = this.m_endLocation.Subtract(this.m_startLocation).Normalize(); this.m_orientationMagnitude = (int)(Math.Atan2((double)this.m_orientationVector[1], (double)this.m_orientationVector[0]) * 1000.0); this.m_distPerpendicular = (int)this.m_startLocation.Subtract(new Vector(0.0f, 0.0f, 1f)).Cross(this.m_orientationVector)[2]; this.m_ChunkChars = chunkChars; } public bool sameLine(LocationTextExtractionStrategyEx.ExtendedTextChunk textChunkToCompare) { return this.m_orientationMagnitude == textChunkToCompare.m_orientationMagnitude && this.m_distPerpendicular == textChunkToCompare.m_distPerpendicular; } } public class SearchResult { public int iPosX; public int iPosY; public SearchResult(TextRenderInfo aCharcter, float fPageSizeY) { //Get position of upperLeft coordinate Vector vTopLeft = aCharcter.GetAscentLine().GetStartPoint(); //PosX float fPosX = vTopLeft[Vector.I1]; //PosY float fPosY = vTopLeft[Vector.I2]; //Transform to mm and get y from top of page iPosX = Convert.ToInt32(fPosX * PDF_PX_TO_MM); iPosY = Convert.ToInt32((fPageSizeY - fPosY) * PDF_PX_TO_MM); } } public class LineInfo { public string m_Text; public List<TextRenderInfo> m_LineCharsList; public LineInfo(LocationTextExtractionStrategyEx.ExtendedTextChunk initialTextChunk) { this.m_Text = initialTextChunk.m_text; this.m_LineCharsList = initialTextChunk.m_ChunkChars; } public void appendText(LocationTextExtractionStrategyEx.ExtendedTextChunk additionalTextChunk) { m_LineCharsList.AddRange(additionalTextChunk.m_ChunkChars); this.m_Text += additionalTextChunk.m_text; } } }

+7

Ivan BASART Oct 08 '15 at 11:26

source share

I know this is a really old question, but below is what I ended up with. Just post it here, hoping it will be useful for someone else.

In the following code, the initial coordinates of the lines containing the search text will be indicated. It should not be difficult to change it to give a position to words. The note. I tested this on itextsharp 5.5.11.0 and will not work on some older versions

As mentioned above, in pdf files there is no concept of words / lines or paragraphs. But I found that LocationTextExtractionStrategy does a great job of separating lines and words. Therefore, my decision is based on this.

RENOUNCEMENT:

This solution is based on https://github.com/itext/itextsharp/blob/develop/src/core/iTextSharp/text/pdf/parser/LocationTextExtractionStrategy.cs , and there is a comment in this file that says this is preliminary viewing. So this may not work in the future.

Anyway, here is the code.

 using System.Collections.Generic; using iTextSharp.text.pdf.parser; namespace Logic { public class LocationTextExtractionStrategyWithPosition : LocationTextExtractionStrategy { private readonly List<TextChunk> locationalResult = new List<TextChunk>(); private readonly ITextChunkLocationStrategy tclStrat; public LocationTextExtractionStrategyWithPosition() : this(new TextChunkLocationStrategyDefaultImp()) { } /** * Creates a new text extraction renderer, with a custom strategy for * creating new TextChunkLocation objects based on the input of the * TextRenderInfo. * @param strat the custom strategy */ public LocationTextExtractionStrategyWithPosition(ITextChunkLocationStrategy strat) { tclStrat = strat; } private bool StartsWithSpace(string str) { if (str.Length == 0) return false; return str[0] == ' '; } private bool EndsWithSpace(string str) { if (str.Length == 0) return false; return str[str.Length - 1] == ' '; } /** * Filters the provided list with the provided filter * @param textChunks a list of all TextChunks that this strategy found during processing * @param filter the filter to apply. If null, filtering will be skipped. * @return the filtered list * @since 5.3.3 */ private List<TextChunk> filterTextChunks(List<TextChunk> textChunks, ITextChunkFilter filter) { if (filter == null) { return textChunks; } var filtered = new List<TextChunk>(); foreach (var textChunk in textChunks) { if (filter.Accept(textChunk)) { filtered.Add(textChunk); } } return filtered; } public override void RenderText(TextRenderInfo renderInfo) { LineSegment segment = renderInfo.GetBaseline(); if (renderInfo.GetRise() != 0) { // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to Matrix riseOffsetTransform = new Matrix(0, -renderInfo.GetRise()); segment = segment.TransformBy(riseOffsetTransform); } TextChunk tc = new TextChunk(renderInfo.GetText(), tclStrat.CreateLocation(renderInfo, segment)); locationalResult.Add(tc); } public IList<TextLocation> GetLocations() { var filteredTextChunks = filterTextChunks(locationalResult, null); filteredTextChunks.Sort(); TextChunk lastChunk = null; var textLocations = new List<TextLocation>(); foreach (var chunk in filteredTextChunks) { if (lastChunk == null) { //initial textLocations.Add(new TextLocation { Text = chunk.Text, X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]), Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1]) }); } else { if (chunk.SameLine(lastChunk)) { var text = ""; // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space if (IsChunkAtWordBoundary(chunk, lastChunk) && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text)) text += ' '; text += chunk.Text; textLocations[textLocations.Count - 1].Text += text; } else { textLocations.Add(new TextLocation { Text = chunk.Text, X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]), Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1]) }); } } lastChunk = chunk; } //now find the location(s) with the given texts return textLocations; } } public class TextLocation { public float X { get; set; } public float Y { get; set; } public string Text { get; set; } } }

How to call a method:

  using (var reader = new PdfReader(inputPdf)) { var parser = new PdfReaderContentParser(reader); var strategy = parser.ProcessContent(pageNumber, new LocationTextExtractionStrategyWithPosition()); var res = strategy.GetLocations(); reader.Close(); } var searchResult = res.Where(p => p.Text.Contains(searchText)).OrderBy(p => pY).Reverse().ToList(); inputPdf is a byte[] that has the pdf data pageNumber is the page where you want to search in

+3

Amila Jun 09 '17 at 6:11

source share

Chris Haas · Accepted Answer · 2014-05-28 15:11

Here is a very simple version of the implementation.

Before implementation it is very important to know that PDF files have a zero concept of “words”, “paragraphs”, “sentences”, etc. In addition, the text in the PDF is not necessarily laid out left to right and from top to bottom, and this has nothing to do with languages other than LTR. The phrase "Hello World" can be written in PDF as:

 Draw H at (10, 10) Draw ell at (20, 10) Draw rld at (90, 10) Draw o Wo at (50, 20)

It can also be written as

 Draw Hello World at (10,10)

The ITextExtractionStrategy interface to be implemented has a method called RenderText , which is called once for each piece of text in the PDF file. Notice, I said “piece,” not “word.” In the first example above, the method will be called four times for these two words. In the second example, it will be called once for these two words. This is a very important part to understand. PDF files have no words, and because of this, iTextSharp also has no words. The Word is 100% up to you.

Also in these lines, as I said above, there are no paragraphs in the PDF files. The reason this is known is because PDF files cannot wrap text on a new line. Each time you see something that looks like a paragraph return, you see a new text-drawing command that has a different y coordinate as the previous line. See for further discussion .

The code below is a very simple implementation. For this, I subclass LocationTextExtractionStrategy , which already implements ITextExtractionStrategy . Each time RenderText() call RenderText() I will find the rectangle of the current fragment (using the Mark code here ) and save it later. I use this simple helper class to store these pieces and rectangles:

 //Helper class that stores our rectangle and text public class RectAndText { public iTextSharp.text.Rectangle Rect; public String Text; public RectAndText(iTextSharp.text.Rectangle rect, String text) { this.Rect = rect; this.Text = text; } }

And here is the subclass:

 public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy { //Hold each coordinate public List<RectAndText> myPoints = new List<RectAndText>(); //Automatically called for each chunk of text in the PDF public override void RenderText(TextRenderInfo renderInfo) { base.RenderText(renderInfo); //Get the bounding box for the chunk of text var bottomLeft = renderInfo.GetDescentLine().GetStartPoint(); var topRight = renderInfo.GetAscentLine().GetEndPoint(); //Create a rectangle from it var rect = new iTextSharp.text.Rectangle( bottomLeft[Vector.I1], bottomLeft[Vector.I2], topRight[Vector.I1], topRight[Vector.I2] ); //Add this to our main collection this.myPoints.Add(new RectAndText(rect, renderInfo.GetText())); } }

And finally, the implementation of the above:

 //Our test file var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf"); //Create our test file, nothing special using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) { using (var doc = new Document()) { using (var writer = PdfWriter.GetInstance(doc, fs)) { doc.Open(); doc.Add(new Paragraph("This is my sample file")); doc.Close(); } } } //Create an instance of our strategy var t = new MyLocationTextExtractionStrategy(); //Parse page 1 of the document above using (var r = new PdfReader(testFile)) { var ex = PdfTextExtractor.GetTextFromPage(r, 1, t); } //Loop through each chunk found foreach (var p in t.myPoints) { Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom)); }

I can’t stress that the words “words” are not accepted in the above, it is up to you. The TextRenderInfo object that is passed to RenderText has a method called GetCharacterRenderInfos() , which you could use to get more information. You can also use GetBaseline() instead of GetDescentLine () `if you don't care about descenders in the font.

EDIT

(I had a great lunch, so I feel a little more helpful.)

Here's an updated version of MyLocationTextExtractionStrategy that does what my comments say below, namely: a string is required to search and find each fragment for that string. For all these reasons, this will not work in some / many / most / all cases. If a substring exists several times in one block, it will also return only the first instance. Ligatures and diacritics can also go bad.

 public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy { //Hold each coordinate public List<RectAndText> myPoints = new List<RectAndText>(); //The string that we're searching for public String TextToSearchFor { get; set; } //How to compare strings public System.Globalization.CompareOptions CompareOptions { get; set; } public MyLocationTextExtractionStrategy(String textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None) { this.TextToSearchFor = textToSearchFor; this.CompareOptions = compareOptions; } //Automatically called for each chunk of text in the PDF public override void RenderText(TextRenderInfo renderInfo) { base.RenderText(renderInfo); //See if the current chunk contains the text var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions); //If not found bail if (startPosition < 0) { return; } //Grab the individual characters var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList(); //Grab the first and last character var firstChar = chars.First(); var lastChar = chars.Last(); //Get the bounding box for the chunk of text var bottomLeft = firstChar.GetDescentLine().GetStartPoint(); var topRight = lastChar.GetAscentLine().GetEndPoint(); //Create a rectangle from it var rect = new iTextSharp.text.Rectangle( bottomLeft[Vector.I1], bottomLeft[Vector.I2], topRight[Vector.I1], topRight[Vector.I2] ); //Add this to our main collection this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor)); }

You would use it the same way as before, but now the constructor has one required parameter:

 var t = new MyLocationTextExtractionStrategy("sample");

Getting string coordinates using ITextExtractionStrategy and LocationTextExtractionStrategy in Itextsharp

RENOUNCEMENT:

More articles: