Question

I am trying to load with python langchain library an online pdf from: http://datasheet.octopart.com/CL05B683KO5NNNC-Samsung-Electro-Mechanics-datasheet-136482222.pdf

This is the code that I m running locally:

loader = PyPDFLoader(datasheet_path)
pages  = loader.load_and_split()

Am getting the following error
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
Cell In[4], line 8
      6 datasheet_path = "http://datasheet.octopart.com/CL05B683KO5NNNC-Samsung-Electro-Mechanics-datasheet-136482222.pdf"
      7 loader = PyPDFLoader(datasheet_path)
----> 8 pages = loader.load_and_split()
     11 query = """

File ***.venvlibsite-packageslangchaindocument_loadersase.py:36, in BaseLoader.load_and_split(self, text_splitter)
     34 else:
     35     _text_splitter = text_splitter
---> 36 docs = self.load()
     37 return _text_splitter.split_documents(docs)
...
   (...)
    114         for i, page in enumerate(pdf_reader.pages)
    115     ]

PermissionError: [Errno 13] Permission denied:  C:\Users\****\AppData\Local\Temp\tmpu_59ngam

Note1: running the same code in google Colab works well Note2: running the following code in the same notebook is working correctly so I m not sure access to the temp folder is problematic in any manner:

with open( C:\Users\benis\AppData\Local\Temp\test.txt ,  w ) as h:
    h.write("test")

Note3: I have tested several different online pdf. got same error for all.

The code should covert pdf to text and split to pages using Langchain and pyplot

Answer 1

You will not succeed with this task using langchain on windows with their current implementation. You can take a look at the source code here. Consider the following abridged code:

class BasePDFLoader(BaseLoader, ABC):
    def __init__(self, file_path: str):
        ...
        # If the file is a web path, download it to a temporary file, and use that
        if not os.path.isfile(self.file_path) and self._is_valid_url(self.file_path):
            r = requests.get(self.file_path)

            ...
            self.web_path = self.file_path
            self.temp_file = tempfile.NamedTemporaryFile()
            self.temp_file.write(r.content)
            self.file_path = self.temp_file.name
            ...

    def __del__(self) -> None:
        if hasattr(self, "temp_file"):
            self.temp_file.close()

Note that they open the file in the constructor, and close it in the destructor. Now let s look at the python documentation on NamedTemporaryFile (emphasis mine, docs are for python3.9):

This function operates exactly as TemporaryFile() does, except that the file is guaranteed to have a visible name in the file system (on Unix, the directory entry is not unlinked). That name can be retrieved from the name attribute of the returned file-like object. Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows).

Answer 2

Update the pdf.py (https://github.com/hwchase17/langchain/blob/5cfa72a130f675c8da5963a11d416f553f692e72/langchain/document_loaders/pdf.py#L65) file and make the NamedTemporaryFile not deletable (until the application exits)

self.temp_file = tempfile.NamedTemporaryFile(delete=False)

Reference: https://stackoverflow.com/questions/3924117/how-to-use-tempfile-namedtemporaryfile-in-python#:~:text=To%20fix%20this%20use%3A%20tf%20%3D%20tempfile.NamedTemporaryFile%20%28delete%3DFalse%29,won%27t%20let%20you%20open%20it%20using%20another%20application.

alternatively, this is a PR that is open in langchain: https://github.com/hwchase17/langchain/pull/5887/files

Answer 3

you need pypdf installed

pip install pypdf -q

write a reusable def to load pdf

def load_doc(file):
    from langchain.document_loaders import PyPDFLoader
    loader=PyPDFLoader(file)
    pages  = loader.load_and_split()
    print("pages",pages)
    data=loader.load()
    return data

if you work locally, you pass the destination of the file as the file arg. but if you want to load online pdf, you pass the url

data=load_doc( https://datasheet.octopart.com/CL05B683KO5NNNC-Samsung-Electro-Mechanics-datasheet-136482222.pdf )
print(data[1].page_content)
print(data[10].metadata)
print(f you have {len(data)} pages in your data )
print(f there are {len(data[20].page_content)} characters in the page )

友情链接