centos7-nvidia驱动安装及简单测试

📅 发布时间:2026/7/6 4:14:09 👁️ 浏览次数:
centos7-nvidia驱动安装及简单测试
centos7-nvidia驱动安装类别信息服务器型号Rack Mount Chassis NF5280M6CPUIntel® Xeon® Silver 4310 CPU 2.10GHz * 2系统版本Centos 7系统内核版本3.10.0-1160.el7.x86_64GPU型号NVIDIA A10040G*4Nvidia版本525.85.05CUDA版本12.0.0docker版本20.10.9一、基础系统部分(已经安装过可以不用安装)1、安装基础软件yum updateyum -yinstallopenssh-server openssh-client apt-utils freeipmi ipmitool sshpassethtoolzipunzipnanolessgitnetplan.io iputils-pingmtripvsadm smartmontools python3-pip socat conntrack libvirt-clients libnuma-dev ctorrent nvme-cli gcc-12 g-12vimwgetaptgitunzipzipntp ntpdate lrzsz lftp tree bash-completion elinks dos2unix tmux jqyum -yinstallnmap net-toolsmtrtraceroutetcptracerouteaptitudehtopiftop hping3 fping nethogs sshuttle tcpdump figlet stress iperf iperf3 dnsutilscurllinux-tools-generic linux-cloud-tools-genericyum groupinstall -yDevelopment Toolscurl-s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh|sudobashyuminstallgit-lfsgitlfsinstall2、调整文件描述符echoulimit -SHn 655350/etc/profileechofs.file-max 655350/etc/sysctl.confechoroot soft nofile 655350/etc/security/limits.confechoroot hard nofile 655350/etc/security/limits.confecho* soft nofile 655350/etc/security/limits.confecho* hard nofile 655350/etc/security/limits.confsource/etc/profile优化historycat/etc/profileexportHISTTIMEFORMAT%Y-%m-%d %H:%M:%SwhoamiexportHISTFILESIZE50000exportHISTSIZE50000source/etc/profile5、优化内核参数cp/etc/sysctl.conf /etc/sysctl.conf.bakvi/etc/sysctl.conf net.ipv4.tcp_syncookies1net.ipv4.tcp_abort_on_overflow1net.ipv4.tcp_max_tw_buckets6000net.ipv4.tcp_sack1net.ipv4.tcp_window_scaling1net.ipv4.tcp_rmem4096873804194304net.ipv4.tcp_wmem4096663844194304net.ipv4.tcp_mem94500000915000000927000000net.core.optmem_max81920net.core.wmem_default8388608net.core.wmem_max16777216net.core.rmem_default8388608net.core.rmem_max16777216net.ipv4.tcp_max_syn_backlog1020000net.core.netdev_max_backlog862144net.core.somaxconn262144net.ipv4.tcp_max_orphans327680net.ipv4.tcp_timestamps0net.ipv4.tcp_synack_retries1net.ipv4.tcp_syn_retries1net.ipv4.tcp_tw_reuse1net.ipv4.tcp_fin_timeout15net.ipv4.tcp_keepalive_time30net.ipv4.ip_local_port_range102465535net.netfilter.nf_conntrack_tcp_timeout_established180net.netfilter.nf_conntrack_max1048576net.nf_conntrack_max1048576fs.file-max655350modprobe nf_conntrack sysctl -p /etc/sysctl.conf sysctl -w net.ipv4.route.flush1二、显卡驱动、cuda等部署手动创建禁用 nouveau 的配置bash-cecho blacklist nouveau /etc/modprobe.d/blacklist-nvidia-nouveau.confbash-cecho options nouveau modeset0 /etc/modprobe.d/blacklist-nvidia-nouveau.confechooptions nouveaumodeset0|tee-a /etc/modprobe.d/nouveau-kms.conf# boot备份cp-r /boot/ /root/ dracut -f /boot/initramfs-$(uname-r).img$(uname-r)# 重启验证是否禁用成功rebootlsmod|grepnouveau重启成功后打开终端输入如下如果什么都不显示说明正面上面禁用nouveau的流程正确安装nvidia驱动https://download.nvidia.com/XFree86/Linux-x86_64获取推荐安装版本可不选择推荐安装版本# 导入 ELRepo 的公钥sudorpm--import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org# 安装 ELRepo 仓库sudoyuminstall-y https://www.elrepo.org/elrepo-release-7.0-4.el7.elrepo.noarch.rpmsudoyum makecache lspci|grep-i nvidia下载对应内核工具防止安装错误# 安装 yum-config-manager 工具开启工具查找centos7老版本内核工具yuminstall-y yum-utils# 启用 vault 仓库yum-config-manager --enable vault yuminstallkernel-devel-$(uname-r)kernel-headers-$(uname-r)wgethttps://download.nvidia.com/XFree86/Linux-x86_64/525.85.05/NVIDIA-Linux-x86_64-525.85.05.runchmodx NVIDIA-Linux-x86_64-525.85.05.runbashNVIDIA-Linux-x86_64-525.85.05.run --no-opengl-files --uinone --no-questions --accept-license安装完成后执行nvidia-smi查看[rootgnode196 ~]# nvidia-smiTue Jan2716:48:412026-----------------------------------------------------------------------------|NVIDIA-SMI525.85.05 Driver Version:525.85.05 CUDA Version:12.0||---------------------------------------------------------------------------|GPU Name Persistence-M|Bus-Id Disp.A|Volatile Uncorr. ECC||Fan Temp Perf Pwr:Usage/Cap|Memory-Usage|GPU-Util Compute M.||||MIG M.||||0NVIDIA A100-PCI... Off|00000000:4B:00.0 Off|0||N/A 32C P0 36W / 250W|0MiB / 40960MiB|0% Default||||Disabled|---------------------------------------------------------------------------|1NVIDIA A100-PCI... Off|00000000:65:00.0 Off|0||N/A 33C P0 36W / 250W|0MiB / 40960MiB|0% Default||||Disabled|---------------------------------------------------------------------------|2NVIDIA A100-PCI... Off|00000000:CA:00.0 Off|0||N/A 31C P0 38W / 250W|0MiB / 40960MiB|0% Default||||Disabled|---------------------------------------------------------------------------|3NVIDIA A100-PCI... Off|00000000:E3:00.0 Off|0||N/A 32C P0 39W / 250W|0MiB / 40960MiB|0% Default||||Disabled|--------------------------------------------------------------------------- -----------------------------------------------------------------------------|Processes:||GPU GI CI PID Type Process name GPU Memory||ID ID Usage||||No running processes found|-----------------------------------------------------------------------------安装cuda根据上面步骤可以看到cuda支持可用的cuda版本是12.0登录访问https://developer.nvidia.com/cuda-toolkit-archive 并下载12.0版本的cudawgethttps://developer.download.nvidia.com/compute/cuda/12.0.0/local_installers/cuda_12.0.0_525.60.13_linux.runbashcuda_12.0.0_525.60.13_linux.run --toolkit --silent --override增加环境变量并验证在pofile内添加cuda环境变量cat/etc/profileexportPATH/usr/local/cuda-12.0/bin:$PATHexportLD_LIBRARY_PATH/usr/local/cuda-12.0/lib64:$LD_LIBRARY_PATHsource/etc/profile nvcc -V 验证安装nvidia-dockercurl-s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo|\sudotee/etc/yum.repos.d/nvidia-container-toolkit.repo yuminstall-y nvidia-container-toolkit验证安装nvidia-container-cli --version nvidia-ctk --version配置docker使用nvidia-runtimenvidia-ctk runtime configure --runtimedocker systemctl restartdocker固定内核yum versionlockaddkernel-3.10.0-1160.el7.x86_64 yum versionlockaddkernel-core-3.10.0-1160.el7.x86_64 yum versionlockaddkernel-modules-3.10.0-1160.el7.x86_64echoexcludekernel*/etc/yum.confCPU/GPU相关性能开启# 持久化开启开启Persistence Mode模式nvidia-smi -pm1# 允许ECC内存模式下模拟错误nvidia-smi -e ENABLED# CPU锁频yuminstall-y kernel-tools cpupower idle-set -D0cpupower frequency-set -g performanceechocpupower frequency-set -g performance/etc/rc.localchmodx /etc/rc.d/rc.local# GPU相关优化锁到最高频nvidia-smi -lgc1410,1410# 关闭 PCIe ASPM节能grubby --update-kernelALL --argspcie_aspmoff部署HPC-X(https://developer.nvidia.com/networking/hpc-x 页面最下选择下载版本)wgethttp://www.mellanox.com/page/hpcx_eula?mrequestdownloadsmtypehpcmverhpc-xmnamev2.18.1/hpcx-v2.18.1-gcc-inbox-redhat7-cuda12-x86_64.tbztar-xf hpcx-v2.18.1-gcc-inbox-redhat7-cuda12-x86_64.tbz -C /opt/ln-s /opt/hpcx-v2.18.1-gcc-inbox-redhat7-cuda12-x86_64 /opt/hpcxexportHPCX_HOME/opt/hpcx.$HPCX_HOME/hpcx-init.sh hpcx_loadnccl/gpubun测试安装nccl(静态编译)mkdir-p /root/nccl/cd/root/ncclgitclone https://github.com/NVIDIA/nccl.gitcdncclmake-j24src.buildCUDA_HOME/usr/local/cudaPATH$PATH:/usr/local/cuda/binLD_LIBRARY_PATH/usr/local/cuda/lib64:$LD_LIBRARY_PATH# -j 并法参数安装nccl-test (静态编译)mkdir-p /root/nccl/cd/root/ncclgitclone https://github.com/NVIDIA/nccl-tests.gitcdnccl-testswhichmpirun# /opt/hpcx/ompi/bin/mpirun 截取 MPI_HOME/opt/hpcx/ompicd/root/nccl/nccl-testsPATH$PATH:/usr/local/cuda/binLD_LIBRARY_PATH$LD_LIBRARY_PATH:/usr/local/cuda/lib64LIBRARY_PATH$LIBRARY_PATH:/usr/local/cuda/lib64make-j30CUDA_HOME/usr/local/cudaNCCL_HOME/root/nccl/nccl/buildNCCL_LIBDIR/root/nccl/nccl/build/libNCCL_STATIC1NVCC_GENCODE-gencodearchcompute_80,codesm_80nccl测试exportLD_LIBRARY_PATH$LD_LIBRARY_PATH:/root/nccl/nccl/build/lib ./build/all_reduce_perf -b8-e 35G -f2-g4-n50测试参数-b大小起始大小如 -b8、-b 1M -e大小结束大小如 -e 10G -f倍数每次乘以几倍如 -f2表示翻倍 -g数量使用几个 GPU如 -g1、-g4 -n次数测试迭代次数如 -n100默认20# 1. 单 GPU 测试从 8 字节到 10GB每次翻倍./build/all_reduce_perf -b8-e 10G -f2-g1# 2. 4 GPU 测试./build/all_reduce_perf -b8-e 10G -f2-g4# 3. 测试更大数据量35GB4 GPU./build/all_reduce_perf -b8-e 35G -f2-g4# 4. 增加迭代次数结果更稳定./build/all_reduce_perf -b8-e 10G -f2-g4-n100# 5. 快速测试小数据范围./build/all_reduce_perf -b 1M -e 1G -f2-g4gpubungitclone https://github.com/wilicc/gpu-burn.git编辑配置文件cdgpu-burnviMakefile gpu_burn: gpu_burn-drv.o compare.ptx g -o$$-O3${LDFLAGS}修改为 gpu_burn: gpu_burn-drv.o compare.ptx g -o$$-O3${LDFLAGS}-static-libgcc -static-libstdc编译并测试修改后进行编译编译完成后在其他机器拷贝后就可以直接使用了 yuminstall-y libstdc-staticmakecleanmake./gpu_burn3600(测试时间)模型部署相关huggingface下载apt-get-yinstallgit-lfsgitlfsinstallapt-getinstallpython3 python-is-python3 python3 -m pipinstall--upgradepip20.3.4-i https://mirrors.aliyun.com/pypi/simple/ pip3.12 configsetglobal.index-url https://pypi.org/simple/ pip3.12install-U huggingface_hub --break-system-packageshuggingface登录huggingface-cli login# hf auth login# uggingface_hub 的最新版本1.2.3已经将 CLI 命令从 huggingface-cli 改为 hf。旧命令 huggingface-cli 在新版本中不再支持⚠️ Warning:huggingface-cli loginis deprecated. Usehf auth logininstead. _|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|To log in,huggingface_hubrequires a token generated from https://huggingface.co/settings/tokens.Enter your token(input will not be visible): Add token asgitcredential?(Y/n)y Token is valid(permission: fineGrained). The tokendeployhas been saved to /root/.cache/huggingface/stored_tokens[rootgnode196 ~]# git config --global credential.helper store[rootgnode196 ~]# git config --global credential.helperstore