I am using iText to read from a PDF document. I get an ArrayIndexOutOfBoundsException exception. It is strange that this happens only for certain files and in certain places in these files. I suspect this is due to the way the PDF is encoded in these places, but cannot figure out what the problem is.
I addressed this issue Reading pdf using iText , but it seems to have solved its problem by changing the location of this file. This will not work for me, because I get an exception in certain places in some files - so this is not the file itself, but the page in question causing the exception.
Stack trace
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Invalid index: 02 at com.lowagie.text.pdf.CMapAwareDocumentFont.decodeSingleCID (Unknown source) at com.lowagie.text.pdf.CMapAwareDocumentFont.decode (Not known) .lowagie.text.pdf.parser.PdfContentStreamProcessor.decode (Unknown source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.displayPdfString (Unknown source) at com.lowagie.text.pdf.parser.PdfContentTreamPreamvpor (Unknown source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.invokeOperator (unknown source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent (Unknown source) at com.lowagie.text.pdf.pdf.pdf.pdf.pdfer .PdfTextExtractor.getTextFromPage (Unknown source) in com.pdfextractor.main.Extractor.main (Extractor.java:61)
61 :
content = extractor.getTextFromPage();
, , getTextFromPage() .
public static void main(String[] args) throws IOException{
ArrayList<String> keywords = new ArrayList<String>();
keywords.add("location");
keywords.add("Mass Spectrometry");
keywords.add("vacuole");
keywords.add("cytosol");
String directory = "C:/Ankur/Projects/PEB/Extractor/papers/";
File directoryToRead = new File(directory);
String[] sa_filesToRead = directoryToRead.list();
List<String> filesToRead = Arrays.asList(sa_filesToRead);
Iterator<String> fileItr = filesToRead.iterator();
while(fileItr.hasNext()){
String nextFile = fileItr.next();
PdfReader reader = new PdfReader(directory+nextFile);
int noPages = reader.getNumberOfPages();
PdfTextExtractor extractor = new PdfTextExtractor(reader);
String content="";
for(int page=1;page<=noPages;page++){
int index = 1;
System.out.println(page);
content = extractor.getTextFromPage(page);
}
}
}