I am trying to read PDF text from a PDF that opens in a browser.
After clicking the "Print" button, a new URL opens in a new tab.
https://myappurl.com/employees/2Jb_rpRC710XGvs8xHSOmHE9_LGkL97j/details/listprint.pdf?ids%5B%5D=2Jb_rpRC711lmIvMaBdxnzJj_ZfipcXW
I ran the same program with other web addresses and found that it worked fine. I used the same code that is used here (Extract PDF text) .
And I use the versions of PDFBox below.
<dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>1.8.9</version> </dependency> <dependency> <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>fontbox</artifactId> <version>1.8.9</version> </dependency>
Below is the code that works fine with other urls:
public boolean verifyPDFContent(String strURL, String reqTextInPDF) { boolean flag = false; PDFTextStripper pdfStripper = null; PDDocument pdDoc = null; COSDocument cosDoc = null; String parsedText = null; try { URL url = new URL(strURL); BufferedInputStream file = new BufferedInputStream(url.openStream()); PDFParser parser = new PDFParser(file); parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper(); pdfStripper.setStartPage(1); pdfStripper.setEndPage(1); pdDoc = new PDDocument(cosDoc); parsedText = pdfStripper.getText(pdDoc); } catch (MalformedURLException e2) { System.err.println("URL string could not be parsed "+e2.getMessage()); } catch (IOException e) { System.err.println("Unable to open PDF Parser. " + e.getMessage()); try { if (cosDoc != null) cosDoc.close(); if (pdDoc != null) pdDoc.close(); } catch (Exception e1) { e.printStackTrace(); } } System.out.println("+++++++++++++++++"); System.out.println(parsedText); System.out.println("+++++++++++++++++"); if(parsedText.contains(reqTextInPDF)) { flag=true; } return flag; }
And below is the stacktrace exception in which im gets
java.io.IOException: Error: End-of-File, expected line at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1517) at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:372) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:186) at com.kareo.utils.PDFManager.getPDFContent(PDFManager.java:26)
Updating the image that I used when debugging by URL and file.
Please help me. Is this something with "https" ???
source share