您现在的位置是:首页 >其他 >【pytorch】Ubuntu+Anaconda+CUDA+pytorch 配置教程 | nvidia-smi 报错 NVIDIA-SMI has failed网站首页其他
【pytorch】Ubuntu+Anaconda+CUDA+pytorch 配置教程 | nvidia-smi 报错 NVIDIA-SMI has failed
Ubuntu+Anaconda+CUDA+pytorch 整体参考:https://aitechtogether.com/article/14143.html
下载安装anaconda
客户端下载anaconda的.sh之后通过SFTP上传到服务器
通过bash开始安装
bash Anaconda3-2023.03-Linux-x86_64.sh
各种按回车,输入yes
使用conda -V
命令查看安装的conda版本,如果出现-sh:conda:未找到命令说明没有把conda加入系统路径中,使用下列路径把conda加入系统路径
export PATH=/home/yourName/anaconda3/bin/:$PATH
然后再次输入conda -V会出现conda的版本
使用conda创建新环境
首先先添加一下国内镜像源
conda config --add channels https://anaconda.mirrors.sjtug.sjtu.edu.cn/pkgs/r
conda config --add channels https://anaconda.mirrors.sjtug.sjtu.edu.cn/pkgs/pro
conda config --add channels https://anaconda.mirrors.sjtug.sjtu.edu.cn/pkgs/msys2
conda config --add channels https://anaconda.mirrors.sjtug.sjtu.edu.cn/pkgs/mro
conda config --add channels https://anaconda.mirrors.sjtug.sjtu.edu.cn/pkgs/free
conda config --add channels https://anaconda.mirrors.sjtug.sjtu.edu.cn/pkgs/main
conda config --add channels http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --add channels http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
conda config --add channels http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda/
conda config --add channels http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
使用命令conda config --show-sources查看
所有配置的源
创造新环境
conda create -n envName python=3.8
下载完成后使用命令source activate envName
进入创建的新环境
使用命令conda list
查看已安装包的信息
下载pytorch
这里需要先查看一下自己服务器的CUDA版本,下载pytorch时,尽量选择比自己CUDA版本低的CUDA版本对应的pytorch,不然可能会出现兼容问题
使用命令nvidia-smi查看CUDA版本
这里我的CUDA版本是12.0,就选择CUDA11.7版本的下载了
如果nvidia-smi 输出结果异常,需要按照下面小节的步骤进行调整
nvidia-smi报错
执行nvidia-smi 时报错
(pttest) $ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
得去安装NVIDIA驱动
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
然后需要获取匹配的驱动version
安装 nvidia-cuda-toolkit 工具
sudo apt-get install nvidia-cuda-toolkit
检查系统推荐显卡驱动,记录下recommended选项
sudo ubuntu-drivers devices
(pttest) $ sudo ubuntu-drivers devices
== /sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0 ==
modalias : pci:v000010DEd00001F07sv00001043sd0000866Dbc03sc00i00
vendor : NVIDIA Corporation
model : TU106 [GeForce RTX 2070 Rev. A]
driver : nvidia-driver-470 - distro non-free
driver : nvidia-driver-450-server - distro non-free
driver : nvidia-driver-515-open - distro non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-525-open - distro non-free recommended
driver : nvidia-driver-515-server - distro non-free
driver : nvidia-driver-515 - distro non-free
driver : nvidia-driver-525-server - distro non-free
driver : nvidia-driver-510 - distro non-free
driver : nvidia-driver-525 - distro non-free
driver : nvidia-driver-418-server - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
recommended是 nvidia-driver-525-open
但实测nvidia-driver-525-open不行,得用nvidia-driver-525
用了nvidia-driver-525-open的话,最终nvidia-smi会报错No devices were found
$ nvidia-smi
No devices were found
所以这里得用nvidia-driver-525
sudo apt-get install nvidia-driver-525
sudo reboot
使用nvcc -V检查驱动和cuda
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
发现驱动已经存在
查看已安装驱动的版本信息
ls /usr/src | grep nvidia
$ ls /usr/src | grep nvidia
nvidia-525.105.17
依次输入以下命令
sudo apt-get install dkms
sudo dkms install -m nvidia -v 525.105.17
sudo dkms install -m nvidia -v 525.105.17 报错如下:
$ sudo dkms install -m nvidia -v 525.105.17
Error! Your kernel headers for kernel 5.15.0-69-generic cannot be found.
Please install the linux-headers-5.15.0-69-generic package or use the --kernelsourcedir option to tell DKMS where it's located.
这个错误提示是由于缺少与您当前的内核版本匹配的头文件。
您需要安装相应的内核头文件才能成功安装nvidia驱动。
运行以下命令以安装相应的内核头文件:
sudo apt-get update
sudo apt-get install linux-headers-$(uname -r)
其中$(uname -r)会自动获取当前正在运行的内核版本。
安装完成后,重新运行sudo dkms install -m nvidia -v 525.105.17即可。
$ sudo dkms install -m nvidia -v 525.105.17
Module nvidia/525.105.17 already installed on kernel 5.15.0-69-generic (x86_64).
sudo reboot
再次执行nvidia-smi
$ nvidia-smi
Fri Apr 14 20:21:29 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:3B:00.0 Off | N/A |
| 20% 47C P8 23W / 185W | 1MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
nvidia-smi显示正常