English 中文(简体)
Reading a PDF document with iText not working sometimes
原标题:

I am using iText to read from a PDF doc. I am getting an ArrayIndexOutOfBoundsException. The strange thing is it only happens for certain files and at certain locations in those files. I suspect it s something to do with the way the PDF is encoded at those locations but can t figure out what the problem is.

I have looked at this question Read pdf using iText but he seems to have solved his problem by changing the location of this file. This is not going to work for me as I get the exception at certain locations within some files - so it s not the file itself but the page in question that is causing the exception.

The stack trace is

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Invalid index: 02 at com.lowagie.text.pdf.CMapAwareDocumentFont.decodeSingleCID(Unknown Source) at com.lowagie.text.pdf.CMapAwareDocumentFont.decode(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.decode(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.displayPdfString(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor$ShowText.invoke(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent(Unknown Source) at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(Unknown Source) at com.pdfextractor.main.Extractor.main(Extractor.java:61)

And line 61 corresponds to this line:
content = extractor.getTextFromPage(page);
So it seems quite obvious that the getTextFromPage() method is not working.

public static void main(String[] args) throws IOException{
    ArrayList<String> keywords = new ArrayList<String>();
        keywords.add("location");
        keywords.add("Mass Spectrometry");  
        keywords.add("vacuole");
        keywords.add("cytosol");

    String directory = "C:/Ankur/Projects/PEB/Extractor/papers/";
    File directoryToRead = new File(directory); 
    String[] sa_filesToRead = directoryToRead.list();
    List<String> filesToRead = Arrays.asList(sa_filesToRead);

    Iterator<String> fileItr = filesToRead.iterator();
    while(fileItr.hasNext()){           

        String nextFile = fileItr.next();

        PdfReader reader = new PdfReader(directory+nextFile);
        int noPages = reader.getNumberOfPages();
        PdfTextExtractor extractor = new PdfTextExtractor(reader);

    String content=""; 
    for(int page=1;page<=noPages;page++){
        int index = 1;
        System.out.println(page);
        content = extractor.getTextFromPage(page);

        }       
    }
    }
最佳回答

Most Java classes/libraries expect that a method like getTextFromPage(int) are indexed starting at 0 - meaning that getTextFromPage(0) should return the text from page 1, getTextFromPage(1) should return the text from page 2.

Your for loop that causes the ArrayIndexOutOfBoundsException is indexed starting with 1.

Are you sure that iText s getTextFromPage(int) is indexed starting at 1 rather than the (almost) standard 0?

问题回答

Have you tried posting on the very active IText mailing list?

I have a similar problem and it always occurred where the text contains special characters. I wonder if there is a way to work around the encoding.

(Updated) I had this problem with com.itextpdf.itextpdf of 5.1.3 but after it s updated to 5.3.4. This problem has been fixed.





相关问题
Spring Properties File

Hi have this j2ee web application developed using spring framework. I have a problem with rendering mnessages in nihongo characters from the properties file. I tried converting the file to ascii using ...

Logging a global ID in multiple components

I have a system which contains multiple applications connected together using JMS and Spring Integration. Messages get sent along a chain of applications. [App A] -> [App B] -> [App C] We set a ...

Java Library Size

If I m given two Java Libraries in Jar format, 1 having no bells and whistles, and the other having lots of them that will mostly go unused.... my question is: How will the larger, mostly unused ...

How to get the Array Class for a given Class in Java?

I have a Class variable that holds a certain type and I need to get a variable that holds the corresponding array class. The best I could come up with is this: Class arrayOfFooClass = java.lang....

SQLite , Derby vs file system

I m working on a Java desktop application that reads and writes from/to different files. I think a better solution would be to replace the file system by a SQLite database. How hard is it to migrate ...

热门标签