PDf to String

What is the easiest way to get the text (words) of a PDF file as a single long line or an array of lines.

I tried pdfbox, but this does not work for me.

+6
java io text pdf
source share
4 answers

use iText. The following snippet, for example, will extract text.

  PdfTextExtractor parser = new PdfTextExtractor (new PdfReader ("C: /Text.pdf"));
 parser.getTextFromPage (3);

+4
source share

PDFBox barfs in many new PDF files, especially with embedded PNG images.

I was very impressed with PDFTextStream

+2
source share

JPedal and Multivalent also offer text extraction in Java or you can access xpdf using Runtime.exec

+1
source share

Well, I used Tika to extract the source text from pdf (it is based on the PDFBox), but I think that Tika is only useful when you need to extract text from different file formats (automatic detection helps a lot).

If you want to parse only text in text, I would suggest PDFTextStream because it is a much better parser than other apis (e.g. iText and PDFBox).

Using PDFTextStream, you can easily get structured text (pages-> blocks-> lines-> textUnits), and it gives you the ability to retrieve correlated information such as character encoding, height, location of the character on the page, etc.

Example:

 public class ExtractTextAllPages { public static void main (String[] args) throws IOException { String pdfFilePath = args[0]; PDFTextStream pdfts = new PDFTextStream(pdfFilePath); StringBuilder text = new StringBuilder(1024); pdfts.pipe(new OutputTarget(text)); pdfts.close(); System.out.printf("The text extracted from %s is:", pdfFilePath); System.out.println(text); } } 
0
source share

All Articles