I am trying to extract text from a PDF file using pdfbox, but not as a command line tool, but inside my Java application. I download pdf using jsoup.
res = Jsoup .connect(host+action) .ignoreContentType(true) .data(data) .cookies(cookies) .method(Method.POST) .timeout(20*1000) .execute(); // prepare document InputStream is = new ByteArrayInputStream(res.bodyAsBytes()); PDDocument pdf = new PDDocument(); pdf.load(is,true); // extract text PDFTextStripper stripper = new PDFTextStripper(); String text = stripper.getText(pdf); // print extracted text System.out.println(text);
This code prints only an empty string. When I do this:
System.out.println(res.body());
it prints a pdf file for output as follows:
%PDF-1.4 % 6 0 obj << /Filter /FlateDecode /Length 1869 >> stream x X n
...
<< /Size 28 /Info 27 0 R /Root 26 0 R >> startxref 20632 %%EOF
So, I am sure that the PDF code is loaded correctly - only this PDF stripper does not work ...
---------------------------------------------- edit
this problem is resolved - working code is here http://thottingal.in/blog/2009/06/24/pdfbox-extract-text-from-pdf/
source share