Question

我在试图在我的系统上操作一个NVIDIA GPU支持的Docker集装箱时面临一个问题。尽管通过<代码>nvidia-smi成功检测了NVIDIA驾驶员和GPUs,但试图操作带有“指挥编码”的多克集装箱——轮驾驶——所有泡沫:18.04 nvidia-smi,造成以下错误:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as  legacy  nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
Here s the output of nvidia-smi, showing that the NVIDIA drivers and GPUs are correctly detected and operational:

$ nvidia-smi
Thu Feb 22 02:39:45 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    On   | 00000000:18:00.0 Off |                    0 |
| 30%   37C    P8    14W / 230W |  11671MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000    On   | 00000000:86:00.0 Off |                    0 |
| 55%   80C    P2   211W / 230W |  13119MiB / 23028MiB |     79%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

为了解决问题,Impactnvidia-container-cli -k -d /dev/tty info证实发现NVIDIA图书馆,包括Libnvidia-ml.so.525.85.12。然而,Docker的错误依然存在,这表明存在一个将Ribnvidia-ml.so.1定位的问题。

迄今为止,我曾试图:

Reinstalling NVIDIA drivers and CUDA Toolkit. Reinstalling NVIDIA Container Toolkit. Ensuring Docker and NVIDIA Container Toolkit are correctly configured. Setting the LD_LIBRARY_PATH to include the path to NVIDIA libraries. Despite these efforts, the problem remains unresolved. I m operating on a Linux system with NVIDIA driver version 525.85.12.

是否有任何人经历过类似的问题,或能够就可能造成这一错误和如何解决这一错误提出见解? 我非常赞赏任何建议或指导。

What I Tried:

• 与NVIDIA GPU支持一道操作一个多克集装箱: 您试图利用NVIDIA GPU的指挥系统(docker跑道——所有用户:18.04 nvidia-smi打开一个多克集装箱。
检查NVIDIA司机和GPU探测器: 你们利用Nvidia-smi确保发现NVIDIA司机和GPU,并在你的系统上运行。
Diagnostic with NVIDIA Container Toolkit: You ran nvidia-container-cli -k -d /dev/tty info to diagnose the issue, which confirmed that the NVIDIA libraries, including libnvidia-ml.so.525.85.12, were detected by your system.
寻求解决的办法:
- Reinstalling NVIDIA Container Toolkit to ensure proper integration with Docker.
- Setting the LD_LIBRARY_PATH to include the path to NVIDIA libraries, attempting to resolve any issues related to the library path.

What You Were Expecting:

成功启动集装箱: 你期望Docker集装箱能够成功地在NVIDIA GPU的支持下启动,使你能够在集装箱内使用GPU资源。
图书馆探测问题决议: 您预计,所采取的措施将解决与探测Libnvidia-ml.so.1有关的任何问题,确保Docker和NVIDIA集装箱工具包能够进入和利用必要的NVIDIA图书馆。
业务组在Docker提供支持: 最后,你期望这些难点击步骤能够使万国邮联能够在多克集装箱内提供无缝的支持,从而使万国邮联的加速应用能够按预期运行。

预期结果和实际结果之间存在差异——这一持续错误信息表明,尽管已证实发现了NVIDIA的司机和图书馆,但无法找到Libnvidia-ml.so.1。这表明,与Docker和NVIDIA一体化的建立、图书馆道路,或者可能还有所涉工具和司机的具体版本,可能有根本的问题。

Answer 1

你们的东道国使用什么分配? NVIDIA驾驶员由ubuntu-drivers安插工具也可造成这一问题。为了解决这个问题,你可能需要重新接纳司机。首先,现有的司机需要停工(特别是乌班图,其他配送,支票here ):

sudo apt-get --purge remove "*cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" 
 "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" "*nvvm*"
sudo apt-get --purge remove "*nvidia*" "libxnvctrl*"
sudo apt-get autoremove

之后,高度建议使用包装管理人<代码>apt,取代更换司机。下面是指示(乌班图22.04, 支票 .here ,用于更多的平台:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
# To install the legacy kernel module flavor
sudo apt-get install -y cuda-drivers
# To install the open kernel module flavor of specific version
# sudo apt-get install -y nvidia-driver-550-open

Docker support

Note that the NVIDIA Container Toolkit also has been uninstalled by previous apt-get --purge commands. You can follow these steps to reinstall it.

For Ubuntu Server edition

转至HWE,供你的服务器使用:

sudo apt-get install --install-recommends linux-generic-hwe-22.04

司机还将在缺席时安装X11部件。如果不需要台式,则可以安装less版本的司机:

sudo apt-get install nvidia-headless-550

What I Tried:

What You Were Expecting:

Docker support

For Ubuntu Server edition

友情链接