PDf to String

Question

PDf to String

What is the easiest way to get the text (words) of a PDF file as a single long line or an array of lines.

I tried pdfbox, but this does not work for me.

+6

java io text pdf

Ankur Nov 05 '09 at 4:59

source share

4 answers

Kushal paudyal · Answer 1 · 2009-11-05T16:29:47+0000

use iText. The following snippet, for example, will extract text.

  PdfTextExtractor parser = new PdfTextExtractor (new PdfReader ("C: /Text.pdf"));
 parser.getTextFromPage (3);

Sam barnum · Answer 2 · 2009-11-05T15:53:01+0000

PDFBox barfs in many new PDF files, especially with embedded PNG images.

I was very impressed with PDFTextStream

mark stephens · Answer 3 · 2009-11-05T07:44:11+0000

JPedal and Multivalent also offer text extraction in Java or you can access xpdf using Runtime.exec

yeaaaahhhh..hamf hamf · Answer 4 · 2014-02-24T12:12:58+0000

Well, I used Tika to extract the source text from pdf (it is based on the PDFBox), but I think that Tika is only useful when you need to extract text from different file formats (automatic detection helps a lot).

If you want to parse only text in text, I would suggest PDFTextStream because it is a much better parser than other apis (e.g. iText and PDFBox).

Using PDFTextStream, you can easily get structured text (pages-> blocks-> lines-> textUnits), and it gives you the ability to retrieve correlated information such as character encoding, height, location of the character on the page, etc.

Example:

 public class ExtractTextAllPages { public static void main (String[] args) throws IOException { String pdfFilePath = args[0]; PDFTextStream pdfts = new PDFTextStream(pdfFilePath); StringBuilder text = new StringBuilder(1024); pdfts.pipe(new OutputTarget(text)); pdfts.close(); System.out.printf("The text extracted from %s is:", pdfFilePath); System.out.println(text); } }

PDf to String

More articles: