Reading a PDF with iText not working sometimes

I am using iText to read from a PDF document. I get an ArrayIndexOutOfBoundsException exception. It is strange that this happens only for certain files and in certain places in these files. I suspect this is due to the way the PDF is encoded in these places, but cannot figure out what the problem is.

I addressed this issue Reading pdf using iText , but it seems to have solved its problem by changing the location of this file. This will not work for me, because I get an exception in certain places in some files - so this is not the file itself, but the page in question causing the exception.

Stack trace

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Invalid index: 02 at com.lowagie.text.pdf.CMapAwareDocumentFont.decodeSingleCID (Unknown source) at com.lowagie.text.pdf.CMapAwareDocumentFont.decode (Not known) .lowagie.text.pdf.parser.PdfContentStreamProcessor.decode (Unknown source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.displayPdfString (Unknown source) at com.lowagie.text.pdf.parser.PdfContentTreamPreamvpor (Unknown source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.invokeOperator (unknown source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent (Unknown source) at com.lowagie.text.pdf.pdf.pdf.pdf.pdfer .PdfTextExtractor.getTextFromPage (Unknown source) in com.pdfextractor.main.Extractor.main (Extractor.java:61)

61 :
  content = extractor.getTextFromPage();
, , getTextFromPage() .

public static void main(String[] args) throws IOException{
    ArrayList<String> keywords = new ArrayList<String>();
        keywords.add("location");
        keywords.add("Mass Spectrometry");  
        keywords.add("vacuole");
        keywords.add("cytosol");

    String directory = "C:/Ankur/Projects/PEB/Extractor/papers/";
    File directoryToRead = new File(directory); 
    String[] sa_filesToRead = directoryToRead.list();
    List<String> filesToRead = Arrays.asList(sa_filesToRead);

    Iterator<String> fileItr = filesToRead.iterator();
    while(fileItr.hasNext()){           

        String nextFile = fileItr.next();

        PdfReader reader = new PdfReader(directory+nextFile);
        int noPages = reader.getNumberOfPages();
        PdfTextExtractor extractor = new PdfTextExtractor(reader);

    String content=""; 
    for(int page=1;page<=noPages;page++){
        int index = 1;
        System.out.println(page);
        content = extractor.getTextFromPage(page);

        }       
    }
    }
+1
3

/ Java , getTextFromPage(int) 0 - , getTextFromPage(0) 1, getTextFromPage(1) 2.

for, ArrayIndexOutOfBoundsException, 1.

, iText getTextFromPage(int) 1, () 0?

+1

IText?

0

I have a similar problem, and always happened when the text contains special characters. I wonder if there is a way around the encoding.

(Updated) I had this problem with com.itextpdf.itextpdf from 5.1.3, but after it it was updated to 5.3.4. This issue has been fixed.

0
source

All Articles