Closed. This question is seeking recommendations for books, tools, software libraries, and more. It does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
I m looking for a PDF library which will allow me to extract the text from a PDF document. I ve looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be problematic because it produces sections of text that aren t useful and look garbled (for instance, lots of numbers mashed together).
I d like to extract the text from a PDF document, excluding any tables and special formatting. Is there a library out there that does this?