How to convert pdf to text file in iTextSharp

I need to extract text from a PDF file. But using the following code, I get an empty text file.

for (int i = 0; i < n; i++) { pagenumber = i + 1; filename = pagenumber.ToString(); while (filename.Length < digits) filename = "0" + filename; filename = "_" + filename; filename = splitFile + name + filename; // step 1: creation of a document-object document = new Document(reader.GetPageSizeWithRotation(pagenumber)); // step 2: we create a writer that listens to the document PdfWriter writer = PdfWriter.GetInstance(document, new FileStream(filename + ".pdf", FileMode.Create)); // step 3: we open the document document.Open(); PdfContentByte cb = writer.DirectContent; PdfImportedPage page = writer.GetImportedPage(reader, pagenumber); int rotation = reader.GetPageRotation(pagenumber); if (rotation == 90 || rotation == 270) { cb.AddTemplate(page, 0, -1f, 1f, 0, 0, reader.GetPageSizeWithRotation(pagenumber).Height); } else { cb.AddTemplate(page, 1f, 0, 0, 1f, 0, 0); } // step 5: we close the document document.Close(); PDFParser parser = new PDFParser(); parser.ExtractText(filename + ".pdf", filename + ".txt"); } 

What am I doing wrong and how to extract text from a PDF?

+6
source share
2 answers

To extract text using iTextSharp, grab the current version of this library and use

 PdfTextExtractor.GetTextFromPage(reader, pageNumber); 

Beware, there is an error in the text extraction code in some version 5.3.x, which in the meantime has been fixed in the trunk. Therefore, you may need to check the boot version.

+9
source
 using System; using System.IO; using System.Linq; using System.Text; using iTextSharp.text.pdf; using iTextSharp.text.pdf.parser; namespace Pdf2Text { class Program { static void Main(string[] args) { if (!args.Any()) return; var file = args[0]; var output = Path.ChangeExtension(file, ".txt"); if (!File.Exists(file)) return; var bytes = File.ReadAllBytes(file); File.WriteAllText(output, ConvertToText(bytes), Encoding.UTF8); } private static string ConvertToText(byte[] bytes) { var sb = new StringBuilder(); try { var reader = new PdfReader(bytes); var numberOfPages = reader.NumberOfPages; for (var currentPageIndex = 1; currentPageIndex <= numberOfPages; currentPageIndex++) { sb.Append(PdfTextExtractor.GetTextFromPage(reader, currentPageIndex)); } } catch (Exception exception) { Console.WriteLine(exception.Message); } return sb.ToString(); } } } 
+3
source

Source: https://habr.com/ru/post/928022/


All Articles