研究cuda算子用得着。
用cuobjdump工具,linux window上都有。安装cuda toolkit后就有了,和nvcc一套的
wget https://download.pytorch.org/libtorch/cu128/libtorch-shared-with-deps-2.9.0%2Bcu128.zip
下载后解压
cd libtorch/lib
cuobjdump -symbols libtorch_cuda.so > libtorch_cuda.txt
在 libtorch_cuda.txt 里查找 div_floor_kernel_cuda 可以找到1000多个重载函数,用c++filt翻译他们,找到需要的具体类型实现的函数名。比如 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_15gpu_kernel_implINS0_13BinaryFunctorIN3c108BFloat16ES5_S5_ZZZNS0_15binary_internal21div_floor_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE2_clEvEUlS5_S5_E_EEEEvS8_RKT_EUliE_EEviT1_
用类似这样的命令提取具体函数的sass汇编。sass汇编已经是针对特定架构的,不是ptx汇编。所以要指定arch
cuobjdump -sass -arch sm_70 -fun _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_15gpu_kernel_implINS0_13BinaryFunctorIN3c108BFloat16ES5_S5_ZZZNS0_15binary_internal21div_floor_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE2_clEvEUlS5_S5_E_EEEEvS8_RKT_EUliE_EEviT1_ libtorch_cuda.so > div-floor0.sass
注意可能打印很多no found提示,不必在意。这是因为库里很多fatbin。某个fatbin找不到就会报。
输出文件很长,在某个地方可以找到:
Fatbin elf code:
================
arch = sm_70
code version = [1,7]
host = linux
compile_size = 64bit
compressed
code for sm_70
Fatbin elf code:
================
arch = sm_70
code version = [1,7]
host = linux
compile_size = 64bit
compressed
code for sm_70
Function : _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_15gpu_kernel_implINS0_13BinaryFunctorIN3c108BFloat16ES5_S5_ZZZNS0_15binary_internal21div_floor_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE2_clEvEUlS5_S5_E_EEEEvS8_RKT_EUliE_EEviT1_
.headerflags @"EF_CUDA_64BIT_ADDRESS EF_CUDA_SM70 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM70)"
/*0000*/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ; /* 0x00000a00ff017624 */
/* 0x000fe400078e00ff */
/*0010*/ @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ; /* 0x000000fffffff389 */
/* 0x000fe200000e00ff */
/*0020*/ S2R R2, SR_CTAID.X ; /* 0x0000000000027919 */
/* 0x000e220000002500 */
/*0030*/ BMOV.32.CLEAR RZ, B6 ; /* 0x0000000006ff7355 */
/* 0x000fe20000100000 */
/*0040*/ BSSY B6, 0x5cb0 ; /* 0x00005c6000067945 */
/* 0x000fe40003800000 */
/*0050*/ S2R R3, SR_TID.X ; /* 0x0000000000037919 */
/* 0x000e240000002100 */
/*0060*/ IMAD R3, R2, 0x200, R3 ; /* 0x0000020002037824 */
/* 0x001fc800078e0203 */