Here is a very simple version of the implementation.
Before implementation it is very important to know that PDF files have a zero concept of “words”, “paragraphs”, “sentences”, etc. In addition, the text in the PDF is not necessarily laid out left to right and from top to bottom, and this has nothing to do with languages other than LTR. The phrase "Hello World" can be written in PDF as:
Draw H at (10, 10) Draw ell at (20, 10) Draw rld at (90, 10) Draw o Wo at (50, 20)
It can also be written as
Draw Hello World at (10,10)
The ITextExtractionStrategy interface to be implemented has a method called RenderText , which is called once for each piece of text in the PDF file. Notice, I said “piece,” not “word.” In the first example above, the method will be called four times for these two words. In the second example, it will be called once for these two words. This is a very important part to understand. PDF files have no words, and because of this, iTextSharp also has no words. The Word is 100% up to you.
Also in these lines, as I said above, there are no paragraphs in the PDF files. The reason this is known is because PDF files cannot wrap text on a new line. Each time you see something that looks like a paragraph return, you see a new text-drawing command that has a different y coordinate as the previous line. See for further discussion .
The code below is a very simple implementation. For this, I subclass LocationTextExtractionStrategy , which already implements ITextExtractionStrategy . Each time RenderText() call RenderText() I will find the rectangle of the current fragment (using the Mark code here ) and save it later. I use this simple helper class to store these pieces and rectangles:
//Helper class that stores our rectangle and text public class RectAndText { public iTextSharp.text.Rectangle Rect; public String Text; public RectAndText(iTextSharp.text.Rectangle rect, String text) { this.Rect = rect; this.Text = text; } }
And here is the subclass:
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
And finally, the implementation of the above:
//Our test file var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf"); //Create our test file, nothing special using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) { using (var doc = new Document()) { using (var writer = PdfWriter.GetInstance(doc, fs)) { doc.Open(); doc.Add(new Paragraph("This is my sample file")); doc.Close(); } } } //Create an instance of our strategy var t = new MyLocationTextExtractionStrategy(); //Parse page 1 of the document above using (var r = new PdfReader(testFile)) { var ex = PdfTextExtractor.GetTextFromPage(r, 1, t); } //Loop through each chunk found foreach (var p in t.myPoints) { Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom)); }
I can’t stress that the words “words” are not accepted in the above, it is up to you. The TextRenderInfo object that is passed to RenderText has a method called GetCharacterRenderInfos() , which you could use to get more information. You can also use GetBaseline() instead of GetDescentLine () `if you don't care about descenders in the font.
EDIT
(I had a great lunch, so I feel a little more helpful.)
Here's an updated version of MyLocationTextExtractionStrategy that does what my comments say below, namely: a string is required to search and find each fragment for that string. For all these reasons, this will not work in some / many / most / all cases. If a substring exists several times in one block, it will also return only the first instance. Ligatures and diacritics can also go bad.
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
You would use it the same way as before, but now the constructor has one required parameter:
var t = new MyLocationTextExtractionStrategy("sample");