一:背景
1. 讲故事
前段时间有位训练营的学员找到我,说他们的软件在客户那边崩溃了,没找到是什么原因,比较着急,让我帮忙看下是怎么回事?毕竟我的学员是永久的免费dump分析,必须给他上一卦。
二:崩溃分析
1. 为什么会崩溃
关于怎么分析崩溃dump,这个在训练营里面早已整出来了套路,先用 !analyze -v 自动化分析崩溃原因,简化后如下:
0:000> !analyze -v
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************
CONTEXT:  (.ecxr)
eax=15c96638 ebx=010fecb0 ecx=00000000 edx=000109a8 esi=000109a8 edi=0000001c
eip=02f1d218 esp=010fec7c ebp=010feca8 iopl=0         nv up ei pl nz na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010206
02f1d218 8b410c          mov     eax,dword ptr [ecx+0Ch] ds:002b:0000000c=????????
Resetting default scopeEXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 02f1d218ExceptionCode: c0000005 (Access violation)ExceptionFlags: 00000000
NumberParameters: 2Parameter[0]: 00000000Parameter[1]: 0000000c
Attempt to read from address 0000000cSTACK_TEXT:  
WARNING: Frame IP not in any known module. Following frames may be wrong.
010feca8 758d139b     000109a8 0000001c 00000000 0x2f1d218
010fecd4 758c836a     15c9664e 000109a8 0000001c user32!_InternalCallWinProc+0x2b
010fedb8 758c7f6a     15c9664e 00000000 0000001c user32!UserCallWinProcCheckWow+0x33a
010fee1c 758cbb2f     01aef180 00000000 0000001c user32!DispatchClientMessage+0xea
010fee58 77a64f5d     010fee74 00000020 010ff110 user32!__fnDWORD+0x3f
010feee0 758cbdca     010fefb8 00000000 00000000 ntdll!KiUserCallbackDispatcher+0x4d
010feee0 758cbd3e     00000000 00000000 00000000 user32!_PeekMessage+0x2a
010fef1c 6f8a707c     010fefb8 00000000 00000000 user32!PeekMessageW+0x16e
010fef68 6f85443a     00000000 00000000 00000000 System_Windows_Forms_ni+0x22707c
010feffc 6f8540d1     00000000 ffffffff 00000000 System_Windows_Forms_ni!System.Windows.Forms.Application.ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop+0x1b6
010ff050 6f853f23     00000000 00000000 00000000 System_Windows_Forms_ni!System.Windows.Forms.Application.ThreadContext.RunMessageLoopInner+0x175
010ff07c 6f82c83d     00000000 00000000 00000000 System_Windows_Forms_ni!System.Windows.Forms.Application.ThreadContext.RunMessageLoop+0x4f
010ff094 02fa0b04     00000000 00000000 00000000 System_Windows_Forms_ni!System.Windows.Forms.Application.Run+0x35
010ff0f8 7337f066     00000000 00000000 00000000 xxx!xxx.Program.Main+0x2bc
...从卦中的 DispatchClientMessage 来看,这是提取到了消息队列中的消息,在 0x2f1d218 处出现了访问违例,接下来的问题是寻找到底在处理啥消息?
2. 到底在处理什么消息
要想找到这个问题的答案,可以通过 !dso 在调用栈上寻找 MSG 结构体,简化后的输出如下:
0:000> !dso
OS Thread Id: 0x20b0 (0)
ESP/REG  Object   Name
010FEF9C 175ea6ec System.Windows.Forms.NativeMethods+MSG[]0:000> !mdt -e:2 175ea6ec
175ea6ec (System.Windows.Forms.NativeMethods+MSG[], Elements: 1, ElementMT=6f688e60)
[0] (System.Windows.Forms.NativeMethods+MSG) VALTYPE (MT=6f688e60, ADDR=175ea6f4)hwnd:00140488 (System.IntPtr)message:0x113 (System.Int32)wParam:00000531 (System.IntPtr)lParam:00000000 (System.IntPtr)time:0xfbf4f32 (System.Int32)pt_x:0x118 (System.Int32)pt_y:0x42d (System.Int32)从卦中的 message:0x113 来看,这是经典的 WM_TIMER 消息,即定时器事件,用 C# 的话术就是窗体的 Timer 控件,参考MSDN截图:

接下来的关注点就是分析崩溃处的汇编代码了,使用 ub 命令反编译,输出如下:
0:000> .ecxr
eax=15c96638 ebx=010fecb0 ecx=00000000 edx=000109a8 esi=000109a8 edi=0000001c
eip=02f1d218 esp=010fec7c ebp=010feca8 iopl=0         nv up ei pl nz na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010206
02f1d218 8b410c          mov     eax,dword ptr [ecx+0Ch] ds:002b:0000000c=????????
0:000> ub 02f1d218 La
02f1d200 50              push    eax
02f1d201 107567          adc     byte ptr [ebp+67h],dh
02f1d204 51              push    ecx
02f1d205 83ec04          sub     esp,4
02f1d208 ff7304          push    dword ptr [ebx+4]
02f1d20b ff7308          push    dword ptr [ebx+8]
02f1d20e ff730c          push    dword ptr [ebx+0Ch]
02f1d211 8b13            mov     edx,dword ptr [ebx]
02f1d213 8b4808          mov     ecx,dword ptr [eax+8]
02f1d216 8b09            mov     ecx,dword ptr [ecx]由于 02f1d218 处没有显示函数名,根据经验猜测,这个应该是 JIT 动态生成的小函数,并且 02f1d204 是函数的入口点,程序崩溃是因为执行了 ecx=0 导致的,接下来根据 ecx 的来源进行反推看看有没有新的发现,输出如下:
0:000> dp 15c96638+0x8 L1
15c96640  015a86580:000> dp 015a8658 L1
015a8658  000000000:000> !do 015a8658
<Note: this object has an invalid CLASS field>
Invalid object0:000> !dumpmd 015a8658
015a8658 is not a MethodDesc0:000> !dumpmt 015a8658
015a8658 is not a MethodTable从卦中看没有任何发现,015a8658 既不是 obj,也不是 mt,也不是 md ,这一下子就把我打入了黑暗之渊。。。
3. 在绝望中寻找希望
一时也没想到好办法,到门口边抽烟边思考, message:0x113 是一个 Win32 的 Timer,应该是 Timer 的定时回调在JIT的函数中意外崩掉了,按道理说在崩溃处的内存附近应该能找到与之对应的C# Timer,有了这个想法之后就在 015a8658 附近内存查找,还真给找到了,参考如下:
0:000> dp 015a8658 L4
015a8658  00000000 2d61d1a8 2d4ef48c 00000000
0:000> !do 2d61d1a8
Name:        System.Windows.Forms.NativeMethods+WndProc
MethodTable: 6f687200
EEClass:     6f681458
Size:        32(0x20) bytes
File:        C:\windows\Microsoft.Net\assembly\GAC_MSIL\System.Windows.Forms\v4.0_4.0.0.0__b77a5c561934e089\System.Windows.Forms.dll
Fields:MT    Field   Offset                 Type VT     Attr    Value Name
71ec2734  40002f3        4        System.Object  0 instance 2d61d164 _target
71ec2734  40002f4        8        System.Object  0 instance 00000000 _methodBase
71ec7b18  40002f5        c        System.IntPtr  1 instance  5b73c34 _methodPtr
71ec7b18  40002f6       10        System.IntPtr  1 instance        0 _methodPtrAux
71ec2734  4000300       14        System.Object  0 instance 00000000 _invocationList
71ec7b18  4000301       18        System.IntPtr  1 instance        0 _invocationCount0:000> !do 2d61d164
Name:        System.Windows.Forms.Timer+TimerNativeWindow
MethodTable: 6f6995e4
EEClass:     6f6ede04
Size:        56(0x38) bytes
File:        C:\windows\Microsoft.Net\assembly\GAC_MSIL\System.Windows.Forms\v4.0_4.0.0.0__b77a5c561934e089\System.Windows.Forms.dll
Fields:MT    Field   Offset                 Type VT     Attr    Value Name
71ec2734  40005ba        4        System.Object  0 instance 00000000 __identity
71ec7b18  4001cf9       18        System.IntPtr  1 instance        0 handle
6f687200  4001cfa        8 ...veMethods+WndProc  0 instance 2d61d1a8 windowProc
71ec7b18  4001cfb       1c        System.IntPtr  1 instance 15ca1bee windowProcPtr
71ec7b18  4001cfc       20        System.IntPtr  1 instance 77a77f70 defWindowProc
71ec878c  4001cfd       28       System.Boolean  1 instance        1 suppressedGC
71ec878c  4001cfe       29       System.Boolean  1 instance        0 ownHandle
6f685da8  4001cff        c ...orms.NativeWindow  0 instance 00000000 previousWindow
6f685da8  4001d00       10 ...orms.NativeWindow  0 instance 00000000 nextWindow
71ec6018  4001d01       14 System.WeakReference  0 instance 2d61d19c weakThisPtr
70229854  4001d02       24         System.Int32  1 instance        0 windowDpiAwarenessContext
713fe7cc  4001ce3      b88 ...stics.TraceSwitch  0   static 00000000 WndProcChoice
71ec426c  4001ce4      b8c       System.Int32[]  0   static 03111988 primes
71ec878c  4001ceb     1312       System.Boolean  1   static        1 anyHandleCreatedInApp
71ec42a8  4001ced     1304         System.Int32  1   static     1786 handleCount
71ec42a8  4001cee     1308         System.Int32  1   static     2915 hashLoadSize
6f685e9c  4001cef      b90 ...ow+HandleBucket[]  0   static 2c7b5f14 hashBuckets
71ec7b18  4001cf0     130c        System.IntPtr  1   static 77a77f70 userDefWindowProc
71ec3a08  4001cf3     1313          System.Byte  1   static        0 userSetProcFlagsForApp
71ec882c  4001cf4     1310         System.Int16  1   static        1 globalID
71f1c594  4001cf5      b94 ...ntPtr, mscorlib]]  0   static 03111bc8 hashForIdHandle
71f1c6d0  4001cf6      b98 ...Int16, mscorlib]]  0   static 03111c3c hashForHandleId
71ec2734  4001cf7      b9c        System.Object  0   static 03111b90 internalSyncObject
71ec2734  4001cf8      ba0        System.Object  0   static 03111b9c createWindowSyncObject
71ec878c  4001cea      979       System.Boolean  1 TLstatic  anyHandleCreated>> Thread:Value 20b0:1 <<
71ec3a08  4001cf1      97a          System.Byte  1 TLstatic  wndProcFlags>> Thread:Value 20b0:1 <<
71ec3a08  4001cf2      97b          System.Byte  1 TLstatic  userSetProcFlags>> Thread:Value 20b0:1 <<
6f69ad98  400415e       2c ...ndows.Forms.Timer  0 instance 14462858 _owner
71ec42a8  400415f       30         System.Int32  1 instance        0 _timerID
71ec878c  4004161       2a       System.Boolean  1 instance        0 _stoppingTimer
71ec42a8  4004160     190c         System.Int32  1   static     2462 TimerID冥冥之中自有天意。。。一顿欣喜若狂之后,赶紧看看这个 Timer 来自于哪里,使用 !gcroot 2d61d164 即可。
0:000> !gcroot 2d61d164
Thread 20b0:010fef80 6f85443a System.Windows.Forms.Application+ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(IntPtr, Int32, Int32)ebx:  (interior)->  040e5568 System.Object[]->  031ced2c System.Windows.Forms.FormCollection->  031ced44 System.Collections.ArrayList->  1df9aaec System.Object[]...->  144625fc DevComponents.DotNetBar.Controls.ComboBoxEx->  14462858 System.Windows.Forms.Timer->  2d61d164 System.Windows.Forms.Timer+TimerNativeWindow从卦中的引用链来看,原来它是挂在 DevComponents.DotNetBar.Controls.ComboBoxEx 控件之下的,赶紧反向寻找源代码,截图如下:

尼玛居然是加密的,也是无语了,由于是 DevComponents 组件中的代码,赶紧看看组件的版本,结果发现是 2002 年的第一场雪,距今 23年,没有bug也是奇怪了。。。截图如下:

最后给到朋友的建议就是升级 DevComponents 或者寻找替代品。
三:总结
有人说bug分析就是一门法医学,不断的在绝望中寻找希望,千淘万漉虽辛苦,吹尽狂沙始到金!
 
