当前位置: 首页 > news >正文

nvidia-smi 卡死问题解决

查看内核日志:

$ sudo dmesg -T
[Fri Sep 19 16:09:52 2025] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0xf9ffe068 flags=0x0020]
[Fri Sep 19 16:35:01 2025] show_signal_msg: 132 callbacks suppressed
[Fri Sep 19 16:35:01 2025] python[288587]: segfault at 7a84da1734c0 ip 00007a84da1734c0 sp 00007ffe936b2388 error 15 likely on CPU 158 (core 18, socket 1)
[Fri Sep 19 16:35:01 2025] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[Fri Sep 19 17:09:41 2025] nvme 0000:23:00.0: Using 64-bit DMA addresses
[Fri Sep 19 17:43:18 2025] python[137596]: segfault at 71bf6cc734c0 ip 000071bf6cc734c0 sp 00007ffe87f205c8 error 15 likely on CPU 24 (core 24, socket 0)
[Fri Sep 19 17:43:18 2025] Code: 00 00 a6 87 33 00 00 00 00 00 60 fa d1 a4 c1 71 00 00 10 cc d7 60 bf 71 00 00 70 0e c7 6c bf 71 00 00 80 fd 73 00 00 00 00 00 <09> 00 00 00 00 00 00 00 3b a7 ce 3a 06 5f 10 95 e4 60 77 4d c1 71
[Fri Sep 19 18:11:49 2025] nvme 0000:25:00.0: Using 64-bit DMA addresses
[Fri Sep 19 18:26:52 2025] perf: interrupt took too long (2504 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[Fri Sep 19 19:15:11 2025] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
[Fri Sep 19 19:15:11 2025] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
[Fri Sep 19 19:15:11 2025] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND
[Fri Sep 19 19:15:11 2025] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 32 times, consider switching to WQ_UNBOUND
[Sat Sep 20 13:40:04 2025] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 64 times, consider switching to WQ_UNBOUND
[Sat Sep 20 15:36:09 2025] i40e 0000:02:00.0 enp2s0f0np0: entered promiscuous mode
[Sat Sep 20 15:36:09 2025] Bluetooth: Core ver 2.22
[Sat Sep 20 15:36:09 2025] NET: Registered PF_BLUETOOTH protocol family
[Sat Sep 20 15:36:09 2025] Bluetooth: HCI device and connection manager initialized
[Sat Sep 20 15:36:09 2025] Bluetooth: HCI socket layer initialized
[Sat Sep 20 15:36:09 2025] Bluetooth: L2CAP socket layer initialized
[Sat Sep 20 15:36:09 2025] Bluetooth: SCO socket layer initialized
[Sat Sep 20 15:36:11 2025] i40e 0000:02:00.0 enp2s0f0np0: left promiscuous mode
[Sat Sep 20 15:36:14 2025] i40e 0000:02:00.0 enp2s0f0np0: entered promiscuous mode
[Sat Sep 20 15:36:16 2025] i40e 0000:02:00.0 enp2s0f0np0: left promiscuous mode
[Sat Sep 20 17:00:24 2025] NVRM: failed to allocate page table!
[Sat Sep 20 17:54:13 2025] i40e 0000:02:00.0 enp2s0f0np0: entered promiscuous mode
[Sat Sep 20 17:54:15 2025] i40e 0000:02:00.0 enp2s0f0np0: left promiscuous mode
[Sat Sep 20 21:26:38 2025] workqueue: delayed_vfree_work hogged CPU for >10000us 256 times, consider switching to WQ_UNBOUND
[Sat Sep 20 21:47:19 2025] Pid 1407577(ray::WorkerDict) over core_pipe_limit
[Sat Sep 20 21:47:19 2025] Skipping core dump
[Sat Sep 20 21:47:19 2025] Pid 1407582(ray::WorkerDict) over core_pipe_limit
[Sat Sep 20 21:47:19 2025] Skipping core dump
[Sat Sep 20 21:47:19 2025] Pid 1407567(ray::WorkerDict) over core_pipe_limit
[Sat Sep 20 21:47:19 2025] Skipping core dump
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from pmaAllocatePages(pMemReserveInfo->pPma, pageSize / PMA_CHUNK_SIZE_64K, PMA_CHUNK_SIZE_64K, &allocOptions, &pageBegin) @ pool_alloc.c:266
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || ((status == NV_ERR_NO_MEMORY) && (flags & VASPACE_FLAGS_RETRY_PTE_ALLOC_IN_SYS)) @ pool_alloc.c:601
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from rmMemPoolReserve(pCtxBufPool->pMemPool[i], totalSize[i], 0) @ ctx_buf_pool.c:315
[Sat Sep 20 22:13:36 2025] NVRM: ctxBufPoolReserve: Failed to reserve memory. trimming all pools
[Sat Sep 20 22:13:36 2025] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from ctxBufPoolReserve(pGpu, pKernelChannelGroup->pCtxBufPool, bufInfoList, bufCount) @ kernel_channel_group_api.c:557
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertFailedNoLog: Assertion failed: pLockInfo->state == initialLockState @ rs_server.c:872
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertFailedNoLog: Assertion failed: rmGpuLocksGetOwnedMask() == 0 @ rmapi.c:566
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertFailedNoLog: Assertion failed: rmGpuLocksGetOwnedMask() == 0 @ rmapi.c:566
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertFailedNoLog: Assertion failed: rmGpuLocksGetOwnedMask() == 0 @ rmapi.c:566
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertFailedNoLog: Assertion failed: rmGpuLocksGetOwnedMask() == 0 @ rmapi.c:566
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertFailedNoLog: Assertion failed: rmGpuLocksGetOwnedMask() == 0 @ rmapi.c:566
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertFailedNoLog: Assertion failed: rmGpuLocksGetOwnedMask() == 0 @ rmapi.c:566

解决方案:

  1. 使用大 SWAP 空间:

    sudo swapoff /swap.img            # 禁用 swap 文件
    sudo fallocate -l 512G /swap.img  # 为 swap 文件分配空间
    sudo mkswap /swap.img             # 格式化为 swap 格式
    sudo swapon /swap.img             # 启用 swap 文件
    
  2. 启用 IOMMU:

    sudoedit /etc/default/grub
    
    GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt pcie_aspm=off"  # AMD 的 IOMMU 选项
    
    sudo update-grub
    
http://www.hskmm.com/?act=detail&tid=11905

相关文章:

  • 临时
  • 题解:SP6562 PRUBALL - Esferas
  • 个人项目-文本查重
  • CSPS 2025游记
  • CMake 常用语句
  • 电脑硬件温度、占用率实时监控软件
  • Windows 超级管理器 v9.50 正式版
  • 采用python test测试http接口
  • CF2147 Codeforces Global Round 29 (Div. 1 + Div. 2) 解题报告
  • 数字图像基础知识
  • 详细介绍:农业XR数字融合工作站,赋能农业专业实践学习
  • 标题:分享一个值得推荐的免费云服务——阿贝云
  • PPT2Note使用说明
  • 设置Redis在CentOS7上的自启动配置
  • 挂载配置文件以Docker启动Redis服务
  • abc418d
  • Chapter 6 Joining Images
  • 动态主机配置协议(DHCP)中的中继机制及其配置
  • DDD - 概念复习
  • 软件工程第二次作业
  • CSP-J1S1_2025
  • Vdd Vcc
  • 基于ThinkPHP实现动态ZIP压缩包的生成
  • 使用Java实现用户的注册和登录流程
  • Windows安装Kafka(kafka_2.12-3.9.1),配置Kafka,以及遇到的困难解决方案
  • 准备工作之动态内存分配[基于郝斌课程]
  • 2025.6第一套六级听力生词
  • CSP-S 2025游记
  • atof() - 字符串转double类型
  • 完整教程:还在为第三方包 bug 头疼?patch-package 让你轻松打补丁!