Extract text from a PDF using pdfbox

Question

Extract text from a PDF using pdfbox

I am trying to extract text from a PDF file using pdfbox, but not as a command line tool, but inside my Java application. I download pdf using jsoup.

res = Jsoup .connect(host+action) .ignoreContentType(true) .data(data) .cookies(cookies) .method(Method.POST) .timeout(20*1000) .execute(); // prepare document InputStream is = new ByteArrayInputStream(res.bodyAsBytes()); PDDocument pdf = new PDDocument(); pdf.load(is,true); // extract text PDFTextStripper stripper = new PDFTextStripper(); String text = stripper.getText(pdf); // print extracted text System.out.println(text);

This code prints only an empty string. When I do this:

 System.out.println(res.body());

it prints a pdf file for output as follows:

 %PDF-1.4 %     6 0 obj << /Filter /FlateDecode /Length 1869 >> stream x  X n

...

 << /Size 28 /Info 27 0 R /Root 26 0 R >> startxref 20632 %%EOF

So, I am sure that the PDF code is loaded correctly - only this PDF stripper does not work ...

---------------------------------------------- edit

this problem is resolved - working code is here http://thottingal.in/blog/2009/06/24/pdfbox-extract-text-from-pdf/

+4

java pdf jsoup pdfbox

user606521 Jan 16 '13 at 8:54

source share

1 answer

Brian Tompsett - 汤莱恩 · Answer 1 · 2015-01-26T14:21:57+0000

(The question was answered in the comments. See the Unanswered Question, but the problem was solved in the comments (or expanded in the chat) )

@WeloSefer wrote:

maybe this can help you get started ... I never worked with jsoup or pdfbox, so I don't need help, but I'm sure try pdfbox, since I tested the itextpdf reader to extract the texts.

OP wrote (a):

Thank you, this is what I was looking for - it works now :) this problem is solved - the working code is here http://thottingal.in/blog/2009/06/24/pdfbox-extract-text-from-pdf/

Extract text from a PDF using pdfbox

More articles: