English 中文(简体)
Shell script “dvc pull” not working at Streamlit server
原标题:

In my Streamlit app.py file, I used the code os.system("dvc pull") to load a .csv data file (labeled_projects.csv) from my Google service account (Google Drive), and it has been working well since I deployed it a few months ago. The code itself is loaded from my GitHub account.

But it appears that the code suddenly stopped working and I got the error message FileNotFoundError: [Errno 2] No such file or directory: /mount/src/mlops/data/labeled_projects.csv .

The Streamlit server provides no error message regarding the execution of os.system("dvc pull").

Attempting to replace os.system("dvc pull") by using the tempfile package to create a .sh file and executing it using the subprocess package does not help. Got the same FileNotFoundError message with no error message about dvc pull.

Also, executing the command find . -name labeled_projects.csv at the streamlit server could not find any matching return, which seems to indicate that the file is not downloaded.

The code dvc pull in the Stremlit app.py file works fine if executed locally.

Thanks for your help!

问题回答

Assuming you are using DVC 3.x, it s likely this bug, which is fixed in the latest release (3.14.0): https://github.com/iterative/dvc/issues/9651

Once you have updated DVC to the latest release, you may also need to clear the DVC site_cache_dir. If you run dvc doctor on your server you will see output like:

DVC version: 3.14.0 (pip)
---------------------------------------------
Platform: Python 3.11.4 on macOS-13.5-arm64-arm-64bit
Subprojects:
    ...
Supports:
    ...
Config:
    ...
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/73fbe43150218852364cf8d86e2a305f

To clear the cache, you can simply remove the directory listed in the Repo.site_cache_dir line (the last line in dvc doctor output) like:

rm -r /Library/Caches/dvc/repo/...

(on linux the path will probably be in /var/tmp/dvc/...)

It is also safe to just remove the entire /var/tmp/dvc/... directory as well if you are unable to determine the specific repo/... path for your app instance on the server (this will clear the cache for all DVC repos located on the server)

Thanks, @pmrowla and @ruslan-kuprieiev for the feedback.

First I updated the dvc to version 3.14.0. Then I found out that, the issue in my case is in fact, with a simple Shell script of dvc pull within the Streamlit app.py file, the correct executable of dvc that is installed into the Streamlit server from the requirements.txt file can not be successfully reached.

In the following, I just quote the answer I got from ChatGPT 4 for your information. The code in the quote resolved my problem and the .csv data file from my Google Drive was successfully loaded.

(start quote) "To run dvc pull (or any other command that requires a Python package installed in the virtual environment) within a Streamlit app deployed on Streamlit sharing, you ll need to take several steps.

  1. Install the Required Packages: Ensure you have dvc listed in your requirements.txt file in your GitHub repository. When you deploy the app on Streamlit sharing, it will automatically install the packages listed in requirements.txt.

  2. Use Python s sys.executable: You can use Python s sys.executable to get the path to the Python interpreter. This will help ensure you re calling the right Python environment where dvc is installed.

  3. Run the Command from Python: Use subprocess to run the dvc pull command from within your Streamlit app.

Here s an example:

import sys
import subprocess
import streamlit as st

def pull_data_with_dvc():
    cmd = [sys.executable, "-m", "dvc", "pull"]
    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.returncode == 0:
        st.write("Data pulled successfully!")
        st.write(result.stdout)
    else:
        st.write("Error pulling data!")
        st.write(result.stderr)

# Use this function somewhere in your Streamlit app.
pull_data_with_dvc()

Remember:

  • Ensure you ve set up your DVC remote correctly.
  • If DVC requires authentication, you ll need to provide the necessary credentials, which might be more involved, especially if you want to keep credentials secure.
  • This method assumes that DVC is installed as a Python module (as opposed to a standalone system binary).

Lastly, note that Streamlit sharing has some limitations in terms of storage and resources. If you re pulling a large amount of data, you might hit those limits. Always review Streamlit s documentation and limitations for the latest details." (end quote)





相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签