How to highlight text or word in pdf file using iTextsharp?

I need to find a word in an existing pdf file and I want to highlight a text or word

and save the pdf file

I have an idea using PdfAnnotation.CreateMarkup, we could find the position of the text, and we can add bgcolor to it ... but I don't know how to implement it :(

Please help me

+3
pdf itextsharp
Jun 29 '11 at 15:33
source share
5 answers

This is one of those that "sounds easy, but really really difficult." See Flag posts here and here . Ultimately, you'll probably point to a LocationTextExtractionStrategy . Good luck If you really know how to do this, post it here, there are a few people wondering what you are interested in!

+4
Jun 29 2018-11-11T00:
source share

I found how to do this, just in case someone needs to get words or sentences using locations (coordinates) from a PDF document, you will find this example Project HERE , For this I used VB.NET 2010. Do not forget to add a link to your iTextSharp DLL in this project.

I added my own TextExtraction strategy class based on the LocationTextExtractionStrategy class. I focused on TextChunks because they already have these coordinates.

There are some known limitations, for example:

  • It is not allowed to search for several lines (phrases), only char / s or a word or sentence in one line.
  • Does not work with rotated text.
  • I have not tested landscape-oriented PDFs, but I assume some changes may be required.
  • If you need to draw this HighLight / rectangles above the watermark, you will need to add / change the code, but just the code in the form, this is not related to the process of extracting text / locations.
+4
Jun 18 2018-12-12T00:
source share

@Jcis, I actually managed a workaround to handle multiple requests, using your example as a starting point. I use your project as a reference in a C # project and change what it does. Instead of just highlighting, I actually draw a white rectangle around the search term, and then using the coordinates of the rectangle, place the form field. I also had to change the recording mode as content in order to get content in order to completely block the search text. In fact, I created an array of strings of search terms, and then using the for loop, I create as many text fields as I need.

  Test.Form1 formBuilder = new Test.Form1(); string[] fields = new string[] { "%AccountNumber%", "%MeterNumber%", "%EmailFieldHolder%", "%AddressFieldHolder%", "%EmptyFieldHolder%", "%CityStateZipFieldHolder%", "%emptyFieldHolder1%", "%emptyFieldHolder2%", "%emptyFieldHolder3%", "%emptyFieldHolder4%", "%emptyFieldHolder5%", "%emptyFieldHolder6%", "%emptyFieldHolder7%", "%emptyFieldHolder8%", "%SiteNameFieldHolder%", "%SiteNameFieldHolderWithExtraSpace%" }; //int a = 0; for (int a = 0; a < fields.Length; ) { string[] fieldNames = fields[a].Split('%'); string[] fieldName = Regex.Split(fieldNames[1], "Field"); formBuilder.PDFTextGetter(fields[a], StringComparison.CurrentCultureIgnoreCase, htmlToPdf, finalhtmlToPdf, fieldName[0]); File.Delete(htmlToPdf); System.Array.Clear(fieldNames, 0, 2); System.Array.Clear(fieldName, 0, 1); a++; if (a == fields.Length) { break; } string[] fieldNames1 = fields[a].Split('%'); string[] fieldName1 = Regex.Split(fieldNames1[1], "Field"); formBuilder.PDFTextGetter(fields[a], StringComparison.CurrentCultureIgnoreCase, finalhtmlToPdf, htmlToPdf, fieldName1[0]); File.Delete(finalhtmlToPdf); System.Array.Clear(fieldNames1, 0, 2); System.Array.Clear(fieldName1, 0, 1); a++; } 

It bounces the PDFTextGetter function in your example back and forth between the two files until it reaches the finished product. It works very well, and that would not be possible without your initial project, so thanks for that. I also changed your VB to display a text field, for example:

  For Each rect As iTextSharp.text.Rectangle In MatchesFound cb.Rectangle(rect.Left, rect.Bottom + 1, rect.Width, rect.Height + 4) Dim field As New TextField(stamper.Writer, rect, FieldName & Fields) Dim form = stamper.AcroFields Dim fieldKeys = form.Fields.Keys stamper.AddAnnotation(field.GetTextField(), page) Fields += 1 Next 

I just decided to share what I managed to do with your project as a basis. It even increments the field names as I need them. I also had to add a new parameter to your function, but it should not be listed here. Thanks again for this great start.

+1
Jul 20 '12 at 17:25
source share

Thanks Jcis!

After several hours of research and reflection, I found your solution that helped me solve my problems.

there were 2 small mistakes.

first: before the reader has to close the closure, otherwise it throws an exception.

 Public Sub PDFTextGetter(ByVal pSearch As String, ByVal SC As StringComparison, ByVal SourceFile As String, ByVal DestinationFile As String) Dim stamper As iTextSharp.text.pdf.PdfStamper = Nothing Dim cb As iTextSharp.text.pdf.PdfContentByte = Nothing Me.Cursor = Cursors.WaitCursor If File.Exists(SourceFile) Then Dim pReader As New PdfReader(SourceFile) stamper = New iTextSharp.text.pdf.PdfStamper(pReader, New System.IO.FileStream(DestinationFile, FileMode.Create)) PB.Value = 0 : PB.Maximum = pReader.NumberOfPages For page As Integer = 1 To pReader.NumberOfPages Dim strategy As myLocationTextExtractionStrategy = New myLocationTextExtractionStrategy 'cb = stamper.GetUnderContent(page) cb = stamper.GetOverContent(page) Dim state As New PdfGState() state.FillOpacity = 0.3F cb.SetGState(state) 'Send some data contained in PdfContentByte, looks like the first is always cero for me and the second 100, but i'm not sure if this could change in some cases strategy.UndercontentCharacterSpacing = cb.CharacterSpacing strategy.UndercontentHorizontalScaling = cb.HorizontalScaling 'It not really needed to get the text back, but we have to call this line ALWAYS, 'because it triggers the process that will get all chunks from PDF into our strategy Object Dim currentText As String = PdfTextExtractor.GetTextFromPage(pReader, page, strategy) 'The real getter process starts in the following line Dim MatchesFound As List(Of iTextSharp.text.Rectangle) = strategy.GetTextLocations(pSearch, SC) 'Set the fill color of the shapes, I don't use a border because it would make the rect bigger 'but maybe using a thin border could be a solution if you see the currect rect is not big enough to cover all the text it should cover cb.SetColorFill(BaseColor.PINK) 'MatchesFound contains all text with locations, so do whatever you want with it, this highlights them using PINK color: For Each rect As iTextSharp.text.Rectangle In MatchesFound ' cb.Rectangle(rect.Left, rect.Bottom, rect.Width, rect.Height) cb.SaveState() cb.SetColorFill(BaseColor.YELLOW) cb.Rectangle(rect.Left, rect.Bottom, rect.Width, rect.Height) cb.Fill() cb.RestoreState() Next 'cb.Fill() PB.Value = PB.Value + 1 Next stamper.Close() pReader.Close() End If Me.Cursor = Cursors.Default End Sub 

second: your solution does not work when the searched text is in the last line of the embedded text.

  Public Function GetTextLocations(ByVal pSearchString As String, ByVal pStrComp As System.StringComparison) As List(Of iTextSharp.text.Rectangle) Dim FoundMatches As New List(Of iTextSharp.text.Rectangle) Dim sb As New StringBuilder() Dim ThisLineChunks As List(Of TextChunk) = New List(Of TextChunk) Dim bStart As Boolean, bEnd As Boolean Dim FirstChunk As TextChunk = Nothing, LastChunk As TextChunk = Nothing Dim sTextInUsedChunks As String = vbNullString ' For Each chunk As TextChunk In locationalResult For j As Integer = 0 To locationalResult.Count - 1 Dim chunk As TextChunk = locationalResult(j) If chunk.text.Contains(pSearchString) Then Thread.Sleep(1) End If If ThisLineChunks.Count > 0 AndAlso (Not chunk.SameLine(ThisLineChunks.Last) Or j = locationalResult.Count - 1) Then If sb.ToString.IndexOf(pSearchString, pStrComp) > -1 Then Dim sLine As String = sb.ToString 'Check how many times the Search String is present in this line: Dim iCount As Integer = 0 Dim lPos As Integer lPos = sLine.IndexOf(pSearchString, 0, pStrComp) Do While lPos > -1 iCount += 1 If lPos + pSearchString.Length > sLine.Length Then Exit Do Else lPos = lPos + pSearchString.Length lPos = sLine.IndexOf(pSearchString, lPos, pStrComp) Loop 'Process each match found in this Text line: Dim curPos As Integer = 0 For i As Integer = 1 To iCount Dim sCurrentText As String, iFromChar As Integer, iToChar As Integer iFromChar = sLine.IndexOf(pSearchString, curPos, pStrComp) curPos = iFromChar iToChar = iFromChar + pSearchString.Length - 1 sCurrentText = vbNullString sTextInUsedChunks = vbNullString FirstChunk = Nothing LastChunk = Nothing 'Get first and last Chunks corresponding to this match found, from all Chunks in this line For Each chk As TextChunk In ThisLineChunks sCurrentText = sCurrentText & chk.text 'Check if we entered the part where we had found a matching String then get this Chunk (First Chunk) If Not bStart AndAlso sCurrentText.Length - 1 >= iFromChar Then FirstChunk = chk bStart = True End If 'Keep getting Text from Chunks while we are in the part where the matching String had been found If bStart And Not bEnd Then sTextInUsedChunks = sTextInUsedChunks & chk.text End If 'If we get out the matching String part then get this Chunk (last Chunk) If Not bEnd AndAlso sCurrentText.Length - 1 >= iToChar Then LastChunk = chk bEnd = True End If 'If we already have first and last Chunks enclosing the Text where our String pSearchString has been found 'then it time to get the rectangle, GetRectangleFromText Function below this Function, there we extract the pSearchString locations If bStart And bEnd Then FoundMatches.Add(GetRectangleFromText(FirstChunk, LastChunk, pSearchString, sTextInUsedChunks, iFromChar, iToChar, pStrComp)) curPos = curPos + pSearchString.Length bStart = False : bEnd = False Exit For End If Next Next End If sb.Clear() ThisLineChunks.Clear() End If ThisLineChunks.Add(chunk) sb.Append(chunk.text) Next Return FoundMatches End Function 
+1
Mar 14 '17 at 15:46
source share

I convert a Jcis VB project to WpfApplication C # (file on Google Drive) and even apply Boris , but the project does not start. It is greatly appreciated if someone who understands the program algorithm correct it.

0
Apr 19 '17 at 12:36 on
source share



All Articles