查看僵尸进程
ps -A -o stat,ppid,pid,cmd | grep -e '^[Zz]'
nvidia-smi刷新查看
watch -n 1 -d nvidia-smi
其中-d表示高亮
查看ubantu版本
cat /etc/issue
卸载cuda(并未成功)
https://www.jianshu.com/p/6b0e2c617591
sudo /usr/local/cuda-8.0/bin/uninstall_cuda_8.0.pl
/
sudo apt-get remove cuda
sudo apt autoremove
sudo apt-get remove cuda*
sudo rm -rf /usr/local/cuda*
安装cuda(并未成功)
- 从官网找版本:
https://developer.nvidia.com/cuda-11.2.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal
wget https://developer.download.nvidia.com/compute/cuda/11.2.0/local_installers/cuda_11.2.0_460.27.04_linux.run
sh cuda_11.2.0_460.27.04_linux.run
从nvidia下载镜像,并生成容器,进入容器
原始nvidia 提供镜像的网站(包含kaldi):
-
https://docs.nvidia.com/deeplearning/frameworks/kaldi-release-notes/rel_20-03.html#rel_20-03
-
本次采用的是21.02版本,包含如下内容:
Ubuntu 20.04 including Python 3.8
NVIDIA CUDA 11.2.0 including cuBLAS 11.3.1
NVIDIA cuDNN 8.1.0
NVIDIA NCCL 2.8.4 (optimized for NVLink™)
MLNX_OFED 5.1
OpenMPI 4.0.5
Nsight Compute 2020.3.0.18
Nsight Systems 2020.4.3.7
TensorRT 7.2.2 -
下载命令:docker pull nvcr.io/nvidia/kaldi:21.02-py3
下载之后,docker images就可以看到这个镜像了。
-
使用如下命令创建容器:
NV_GPU=0,1 nvidia-docker run -itd -P \ --name wyr_kaldi_cuda11.2 \ --mount type=bind,source=/home/work/wangyaru05,target=/home/work/wangyaru05 \ -v /opt/wfs1/aivoice:/opt/wfs1/aivoice \ --net host \ nvcr.io/nvidia/kaldi:21.02-py3 bash
-
启动容器:
docker container start wyr_kaldi_cuda11.2
-
进入容器:
nvidia-docker exec -it wyr_kaldi_cuda11.2 bash
-
进入容器快捷命令:
vim ~/.bashrc
alias wyr_docker_connect='nvidia-docker exec -it wyr_kaldi_cuda11.2 bash'
linux不能输入汉字
查看编码方式:locale -a
安装:apt-get install -y language-pack-zh-hans
~/.bashrc中添加:export LC_CTYPE='zh_CN.UTF-8'
更简单的方法:
打开文件
vim /etc/bash.bashrc
最后一行输入:
export LANG="C.UTF-8"
export LANGUAGE="C.UTF-8"
export LC_ALL="C.UTF-8"
运行生效
source /etc/bash.bashrc
-
查看机器配置情况
- 查看物理CPU个数
cat /proc/cpuinfo |grep “physical id”|sort |uniq|wc -l - 查看逻辑CPU个数
cat /proc/cpuinfo |grep “processor”|wc -l - 查看CPU是几核
cat /proc/cpuinfo |grep “cores”|uniq - 查看CPU主频
cat /proc/cpuinfo |grep MHz|uniq
- 查看物理CPU个数
nvidia-smi gpu id 顺序和 pytorch的不一致
https://blog.csdn.net/sdnuwjw/article/details/111615052
- nvidia-smi -L 查看机器上的GPU顺序
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-394b2f98-bdb5-f8bb-c773-f89fe6743b56)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-c35d0ab3-0eb7-8a44-2cc6-589370dcef70)
GPU 2: NVIDIA A100-PCIE-40GB (UUID: GPU-c6d27a3b-d4d6-91a0-67b2-aca6a5766e49)
GPU 3: NVIDIA A30 (UUID: GPU-0172e91e-ac9c-e234-2e00-402510d431d0)
GPU 4: NVIDIA A30 (UUID: GPU-aa50e590-a124-715d-f78e-4cf4a01b5fc4)
GPU 5: NVIDIA A100-PCIE-40GB (UUID: GPU-667d62ab-3140-9c22-2737-32ef349195e9)
GPU 6: NVIDIA A100-PCIE-40GB (UUID: GPU-132b588c-fe8c-3a66-c3ec-857ed2b7da10)
GPU 7: NVIDIA A100-PCIE-40GB (UUID: GPU-9b762f3b-f945-79d3-81b3-5d2039a6cab0)
- torch.cuda.get_device_name(3)查看id为3的GPU的名字
torch.cuda.get_device_name(3)
‘NVIDIA A100-PCIE-40GB’
不一致解决:
~/.bashrc中添加:export CUDA_DEVICE_ORDER=“PCI_BUS_ID”
nvidia-smi很慢
nvidia-smi -pm 1
linux重启之后的操作
(1)
docker进不去,没有权限时,使用root权限运行以下命令:
chmod a+rw /var/run/docker.sock
(2)
cd /opt/wfs1/wfs1_client
nohup ./wfs-client-20201001 -r /wfs1/aivoice -m /opt/wfs1/aivoice -s aivoice.key > log/wfs-client-nohup.log 2>&1 &
(3)
nvidia-smi -pm 1
101服务器在输入nvidia-smi时非常慢,最后结果显示0卡有ERROR
-
停掉 所有 在显卡上运行的程序, ERR会消失
-
设置显卡的persistence mode, 按照这个教程.
/usr/bin/nvidia-persistenced --verbose
-
限制最大的运行功率不要太大
sudo nvidia-smi -pl 200 -i 2
-
GPU重启
nvidia-smi -r
输入上述第一个命令时报错
错误:error while loading shared libraries: libtirpc.so.1
解决方法:可能得自己安装。不解决。