English 中文(简体)
Issue with loading online pdf in python notebook using langchain PyPDFLoader
原标题:

I am trying to load with python langchain library an online pdf from: http://datasheet.octopart.com/CL05B683KO5NNNC-Samsung-Electro-Mechanics-datasheet-136482222.pdf

This is the code that I m running locally:

loader = PyPDFLoader(datasheet_path)
pages  = loader.load_and_split()
Am getting the following error
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
Cell In[4], line 8
      6 datasheet_path = "http://datasheet.octopart.com/CL05B683KO5NNNC-Samsung-Electro-Mechanics-datasheet-136482222.pdf"
      7 loader = PyPDFLoader(datasheet_path)
----> 8 pages = loader.load_and_split()
     11 query = """

File ***.venvlibsite-packageslangchaindocument_loadersase.py:36, in BaseLoader.load_and_split(self, text_splitter)
     34 else:
     35     _text_splitter = text_splitter
---> 36 docs = self.load()
     37 return _text_splitter.split_documents(docs)
...
   (...)
    114         for i, page in enumerate(pdf_reader.pages)
    115     ]

PermissionError: [Errno 13] Permission denied:  C:\Users\****\AppData\Local\Temp\tmpu_59ngam 

Note1: running the same code in google Colab works well Note2: running the following code in the same notebook is working correctly so I m not sure access to the temp folder is problematic in any manner:

with open( C:\Users\benis\AppData\Local\Temp\test.txt ,  w ) as h:
    h.write("test")

Note3: I have tested several different online pdf. got same error for all.

The code should covert pdf to text and split to pages using Langchain and pyplot

最佳回答

You will not succeed with this task using langchain on windows with their current implementation. You can take a look at the source code here. Consider the following abridged code:

class BasePDFLoader(BaseLoader, ABC):
    def __init__(self, file_path: str):
        ...
        # If the file is a web path, download it to a temporary file, and use that
        if not os.path.isfile(self.file_path) and self._is_valid_url(self.file_path):
            r = requests.get(self.file_path)

            ...
            self.web_path = self.file_path
            self.temp_file = tempfile.NamedTemporaryFile()
            self.temp_file.write(r.content)
            self.file_path = self.temp_file.name
            ...

    def __del__(self) -> None:
        if hasattr(self, "temp_file"):
            self.temp_file.close()

Note that they open the file in the constructor, and close it in the destructor. Now let s look at the python documentation on NamedTemporaryFile (emphasis mine, docs are for python3.9):

This function operates exactly as TemporaryFile() does, except that the file is guaranteed to have a visible name in the file system (on Unix, the directory entry is not unlinked). That name can be retrieved from the name attribute of the returned file-like object. Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows).

问题回答

you need pypdf installed

pip install pypdf -q

write a reusable def to load pdf

def load_doc(file):
    from langchain.document_loaders import PyPDFLoader
    loader=PyPDFLoader(file)
    pages  = loader.load_and_split()
    print("pages",pages)
    data=loader.load()
    return data

if you work locally, you pass the destination of the file as the file arg. but if you want to load online pdf, you pass the url

data=load_doc( https://datasheet.octopart.com/CL05B683KO5NNNC-Samsung-Electro-Mechanics-datasheet-136482222.pdf )
print(data[1].page_content)
print(data[10].metadata)
print(f you have {len(data)} pages in your data )
print(f there are {len(data[20].page_content)} characters in the page )




相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签