查看内核日志:
$ sudo dmesg -T
[Fri Sep 19 16:09:52 2025] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0xf9ffe068 flags=0x0020]
[Fri Sep 19 16:35:01 2025] show_signal_msg: 132 callbacks suppressed
[Fri Sep 19 16:35:01 2025] python[288587]: segfault at 7a84da1734c0 ip 00007a84da1734c0 sp 00007ffe936b2388 error 15 likely on CPU 158 (core 18, socket 1)
[Fri Sep 19 16:35:01 2025] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[Fri Sep 19 17:09:41 2025] nvme 0000:23:00.0: Using 64-bit DMA addresses
[Fri Sep 19 17:43:18 2025] python[137596]: segfault at 71bf6cc734c0 ip 000071bf6cc734c0 sp 00007ffe87f205c8 error 15 likely on CPU 24 (core 24, socket 0)
[Fri Sep 19 17:43:18 2025] Code: 00 00 a6 87 33 00 00 00 00 00 60 fa d1 a4 c1 71 00 00 10 cc d7 60 bf 71 00 00 70 0e c7 6c bf 71 00 00 80 fd 73 00 00 00 00 00 <09> 00 00 00 00 00 00 00 3b a7 ce 3a 06 5f 10 95 e4 60 77 4d c1 71
[Fri Sep 19 18:11:49 2025] nvme 0000:25:00.0: Using 64-bit DMA addresses
[Fri Sep 19 18:26:52 2025] perf: interrupt took too long (2504 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[Fri Sep 19 19:15:11 2025] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
[Fri Sep 19 19:15:11 2025] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
[Fri Sep 19 19:15:11 2025] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND
[Fri Sep 19 19:15:11 2025] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 32 times, consider switching to WQ_UNBOUND
[Sat Sep 20 13:40:04 2025] workqueue: inode_switch_wbs_work_fn hogged CPU for >10000us 64 times, consider switching to WQ_UNBOUND
[Sat Sep 20 15:36:09 2025] i40e 0000:02:00.0 enp2s0f0np0: entered promiscuous mode
[Sat Sep 20 15:36:09 2025] Bluetooth: Core ver 2.22
[Sat Sep 20 15:36:09 2025] NET: Registered PF_BLUETOOTH protocol family
[Sat Sep 20 15:36:09 2025] Bluetooth: HCI device and connection manager initialized
[Sat Sep 20 15:36:09 2025] Bluetooth: HCI socket layer initialized
[Sat Sep 20 15:36:09 2025] Bluetooth: L2CAP socket layer initialized
[Sat Sep 20 15:36:09 2025] Bluetooth: SCO socket layer initialized
[Sat Sep 20 15:36:11 2025] i40e 0000:02:00.0 enp2s0f0np0: left promiscuous mode
[Sat Sep 20 15:36:14 2025] i40e 0000:02:00.0 enp2s0f0np0: entered promiscuous mode
[Sat Sep 20 15:36:16 2025] i40e 0000:02:00.0 enp2s0f0np0: left promiscuous mode
[Sat Sep 20 17:00:24 2025] NVRM: failed to allocate page table!
[Sat Sep 20 17:54:13 2025] i40e 0000:02:00.0 enp2s0f0np0: entered promiscuous mode
[Sat Sep 20 17:54:15 2025] i40e 0000:02:00.0 enp2s0f0np0: left promiscuous mode
[Sat Sep 20 21:26:38 2025] workqueue: delayed_vfree_work hogged CPU for >10000us 256 times, consider switching to WQ_UNBOUND
[Sat Sep 20 21:47:19 2025] Pid 1407577(ray::WorkerDict) over core_pipe_limit
[Sat Sep 20 21:47:19 2025] Skipping core dump
[Sat Sep 20 21:47:19 2025] Pid 1407582(ray::WorkerDict) over core_pipe_limit
[Sat Sep 20 21:47:19 2025] Skipping core dump
[Sat Sep 20 21:47:19 2025] Pid 1407567(ray::WorkerDict) over core_pipe_limit
[Sat Sep 20 21:47:19 2025] Skipping core dump
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from pmaAllocatePages(pMemReserveInfo->pPma, pageSize / PMA_CHUNK_SIZE_64K, PMA_CHUNK_SIZE_64K, &allocOptions, &pageBegin) @ pool_alloc.c:266
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertFailedNoLog: Assertion failed: (status == NV_OK) || ((status == NV_ERR_NO_MEMORY) && (flags & VASPACE_FLAGS_RETRY_PTE_ALLOC_IN_SYS)) @ pool_alloc.c:601
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from rmMemPoolReserve(pCtxBufPool->pMemPool[i], totalSize[i], 0) @ ctx_buf_pool.c:315
[Sat Sep 20 22:13:36 2025] NVRM: ctxBufPoolReserve: Failed to reserve memory. trimming all pools
[Sat Sep 20 22:13:36 2025] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051) returned from ctxBufPoolReserve(pGpu, pKernelChannelGroup->pCtxBufPool, bufInfoList, bufCount) @ kernel_channel_group_api.c:557
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertFailedNoLog: Assertion failed: pLockInfo->state == initialLockState @ rs_server.c:872
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertFailedNoLog: Assertion failed: rmGpuLocksGetOwnedMask() == 0 @ rmapi.c:566
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertFailedNoLog: Assertion failed: rmGpuLocksGetOwnedMask() == 0 @ rmapi.c:566
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertFailedNoLog: Assertion failed: rmGpuLocksGetOwnedMask() == 0 @ rmapi.c:566
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertFailedNoLog: Assertion failed: rmGpuLocksGetOwnedMask() == 0 @ rmapi.c:566
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertFailedNoLog: Assertion failed: rmGpuLocksGetOwnedMask() == 0 @ rmapi.c:566
[Sat Sep 20 22:13:36 2025] NVRM: nvAssertFailedNoLog: Assertion failed: rmGpuLocksGetOwnedMask() == 0 @ rmapi.c:566
解决方案:
-
使用大 SWAP 空间:
sudo swapoff /swap.img # 禁用 swap 文件 sudo fallocate -l 512G /swap.img # 为 swap 文件分配空间 sudo mkswap /swap.img # 格式化为 swap 格式 sudo swapon /swap.img # 启用 swap 文件
-
启用 IOMMU:
sudoedit /etc/default/grub
GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt pcie_aspm=off" # AMD 的 IOMMU 选项
sudo update-grub