I am using iText to read from a PDF doc. I am getting an ArrayIndexOutOfBoundsException. The strange thing is it only happens for certain files and at certain locations in those files. I suspect it s something to do with the way the PDF is encoded at those locations but can t figure out what the problem is.
I have looked at this question Read pdf using iText but he seems to have solved his problem by changing the location of this file. This is not going to work for me as I get the exception at certain locations within some files - so it s not the file itself but the page in question that is causing the exception.
The stack trace is
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Invalid index: 02 at com.lowagie.text.pdf.CMapAwareDocumentFont.decodeSingleCID(Unknown Source) at com.lowagie.text.pdf.CMapAwareDocumentFont.decode(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.decode(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.displayPdfString(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor$ShowText.invoke(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(Unknown Source) at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent(Unknown Source) at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(Unknown Source) at com.pdfextractor.main.Extractor.main(Extractor.java:61)
And line 61 corresponds to this line:
content = extractor.getTextFromPage(page);
So it seems quite obvious that the getTextFromPage() method is not working.
public static void main(String[] args) throws IOException{
ArrayList<String> keywords = new ArrayList<String>();
keywords.add("location");
keywords.add("Mass Spectrometry");
keywords.add("vacuole");
keywords.add("cytosol");
String directory = "C:/Ankur/Projects/PEB/Extractor/papers/";
File directoryToRead = new File(directory);
String[] sa_filesToRead = directoryToRead.list();
List<String> filesToRead = Arrays.asList(sa_filesToRead);
Iterator<String> fileItr = filesToRead.iterator();
while(fileItr.hasNext()){
String nextFile = fileItr.next();
PdfReader reader = new PdfReader(directory+nextFile);
int noPages = reader.getNumberOfPages();
PdfTextExtractor extractor = new PdfTextExtractor(reader);
String content="";
for(int page=1;page<=noPages;page++){
int index = 1;
System.out.println(page);
content = extractor.getTextFromPage(page);
}
}
}