我们的文章会在微信公众号IT民工的龙马人生和博客网站( www.htz.pw )同步更新 ,欢迎关注收藏,也欢迎大家转载,但是请在文章开始地方标注文章出处,谢谢!
由于博客中有大量代码,通过页面浏览效果更佳。
本案例来自一位朋友的分享,也算是比较诡异的案例,这里分享给大家。在企业级数据库集群环境中,时钟同步是保障集群稳定运行的基础。我们可以选在通过操作系统自带的NTP来实现,也可以选择Oracle提供的CTSS功能来实现,从过去的服务的客户来看,选择操作性系统自带的NTP方案的客户占多数。
一、案例背景
某客户部署了6节点的Oracle RAC(版本11.2.0.4),集群未配置NTP服务器,而是采用RAC自带的CTSS组件进行节点间时钟同步。客户反馈,节点1和节点2时钟同步正常,但节点3至节点6始终无法同步,问题长期未能解决。
二、CTSS时钟同步原理简述
CTSS是Oracle RAC自带的时间同步服务。当集群未配置NTP时,CTSS会自动启用,并在各节点间选举一个“参考节点”,其他节点以其为基准进行时钟微调。这样可以保证集群内各节点的时间基本一致,避免因时钟漂移导致的各种异常。
三、问题现象与初步分析
通过查看节点3的GI(Grid Infrastructure) alert日志,发现CTSS参考节点为节点1,但节点3与参考节点通信时出现大量报错:
- “Msg NOT meant for this member”
- “gipcInternalSend: cannot send empty buffer”
- “gipcSend failed. rc= 1.”
- “failed to resolve ret gipcretKeyNotFound (36)”
- “disconnect”相关日志
这些日志表明,CTSS服务依赖的GIPC(Grid Interprocess Communication)通信出现异常,节点3无法与参考节点(节点1)正常通信。
详细日志如下:
Msg NOT meant for this member. This member id (3, 1718544606). Dest in Msg (3, 4294802950). Message dropped.
2023-03-16 10:50:32.856: [ CRSCCL][3769472768] ****
Connection NOT meant for this member. Dest in Msg (3, 4294802950)
2023-03-16 10:50:32.856: [GIPCXCPT][3769472768] gipcInternalSend: cannot send empty buffer buf 0x7fa9c800c140, len 0, ret gipcretFail (1)
2023-03-16 10:50:32.856: [GIPCXCPT][3769472768] gipcSendF [clsCclGipcSend : clsCclCommHandler.c : 3756]: EXCEPTION[ ret gipcretFail (1) ] failed to send on endp 0x7fa9cc57c0e0 [00000000000538de] { gipcEndpoint : localAddr 'gipcha://core-js3:CTSSGROUP_3/af57-c2f1-bdb7-8cc', remoteAddr 'gipcha://core-js1:6208-4b90-6a5f-02e', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 81272, readyRef (nil), ready 0, wobj 0x7fa9cc59cb40, sendp (nil)flags 0x138606, usrFlags 0x20 }, addr 0000000000000000, buf 0x7fa9c800c140, len 0, cookie (nil), flags 0x0
2023-03-16 10:50:32.856: [ CRSCCL][3769472768]gipcSend failed. rc= 1.
2023-03-16 10:50:32.856: [ CRSCCL][3769472768]Failed to send new connection deny msg.
2023-03-16 10:50:33.857: [GIPCHAUP][3776157440] gipchaUpperDisconnect: initiated discconnect umsg 0x7fa9cc523880 { msg 0x7fa9cc58ef68, ret gipcretRequestPending (15), flags 0x2 }, msg 0x7fa9cc58ef68 { type gipchaMsgTypeDisconnect (5), srcCid 00000000-000538d3, dstCid 00000000-0ebca523 }, endp 0x7fa9cc58e010 [00000000000538d3] { gipchaEndpoint : port 'CTSSGROUP_3/af57-c2f1-bdb7-8cc2', peer 'core-js1:6208-4b90-6a5f-02ec', srcCid 00000000-000538d3, dstCid 00000000-0ebca523, numSend 0, maxSend 100, groupListType 2, hagroup 0x1079f90, usrFlags 0x4000, flags 0x21c }
2023-03-16 10:50:33.858: [GIPCHAUP][3776157440] gipchaUpperCallbackDisconnect: completed DISCONNECT ret gipcretSuccess (0), umsg 0x7fa9cc523880 { msg 0x7fa9cc58ef68, ret gipcretSuccess (0), flags 0x2 }, msg 0x7fa9cc58ef68 { type gipchaMsgTypeDisconnect (5), srcCid 00000000-000538d3, dstCid 00000000-0ebca523 }, hendp 0x7fa9cc58e010 [00000000000538d3] { gipchaEndpoint : port 'CTSSGROUP_3/af57-c2f1-bdb7-8cc2', peer 'core-js1:6208-4b90-6a5f-02ec', srcCid 00000000-000538d3, dstCid 00000000-0ebca523, numSend 0, maxSend 100, groupListType 2, hagroup 0x1079f90, usrFlags 0x4000, flags 0x21c }
2023-03-16 10:50:39.519: [GIPCXCPT][3763169024] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'core-js3', port '8588-94d7-4477-b30a', hctx 0x104ea40 [0000000000000010] { gipchaContext : host 'core-js3', name '57d6-c579-303e-038b', luid '4b08e6b2-00000000', numNode 1, numInf 2, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36)
2023-03-16 10:50:39.519: [GIPCHGEN][3763169024] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 815]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 0x104ea40 [0000000000000010] { gipchaContext : host 'core-js3', name '57d6-c579-303e-038b', luid '4b08e6b2-00000000', numNode 1, numInf 2, usrFlags 0x0, flags 0x5 }, host 'core-js3', port '8588-94d7-4477-b30a', flags 0x0
2023-03-16 10:50:39.519: [GIPCHAUP][3776157440] gipchaUpperDisconnect: initiated discconnect umsg 0x7fa9cc5045c0 { msg 0x7fa9cc57a788, ret gipcretRequestPending (15), flags 0x2 }, msg 0x7fa9cc57a788 { type gipchaMsgTypeDisconnect (5), srcCid 00000000-00053886, dstCid 00000000-0ebca3e8 }, endp 0x7fa9c0014af0 [0000000000053886] { gipchaEndpoint : port 'd707-f99e-f6ed-759a', peer 'core-js1:CTSSGROUP_1/ca26-4aaf-b16c-efcf', srcCid 00000000-00053886, dstCid 00000000-0ebca3e8, numSend 0, maxSend 100, groupListType 2, hagroup 0x1079f90, usrFlags 0x4000, flags 0x21c }
2023-03-16 10:50:39.519: [ CRSCCL][3769472768]GW: No record of this endp= 0x53871
2023-03-16 10:50:39.519: [ GIPCLIB][3769472768] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:0x1016010, idxPtr:0x1020720, key:0x7fa9e0ad8550, flags:0x0
2023-03-16 10:50:39.519: [GIPCXCPT][3769472768] gipcObjectLookupF [gipcDissociateF : gipc.c : 2230]: search found no matching oid 0000000000053871, ret gipcretKeyNotFound (36), ret gipcretInvalidObject (3)
2023-03-16 10:50:39.519: [GIPCXCPT][3769472768] gipcDissociateF [clsCclGipcDisconnect : clsCclCommHandler.c : 4031]: EXCEPTION[ ret gipcretInvalidObject (3) ] failed to dissociate obj 0000000000053871, flags 0x0
2023-03-16 10:50:39.519: [ CRSCCL][3769472768]clsCclGipcDisconnect: gipcDissociate() failed. rc= 3.
2023-03-16 10:50:39.519: [ GIPCLIB][3769472768] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:0x1016010, idxPtr:0x1020720, key:0x7fa9e0ad8550, flags:0x0
2023-03-16 10:50:39.519: [GIPCXCPT][3769472768] gipcObjectLookupF [gipcDisconnectF : gipc.c : 1572]: search found no matching oid 0000000000053871, ret gipcretKeyNotFound (36), ret gipcretInvalidObject (3)
2023-03-16 10:50:39.519: [GIPCTRAC][3769472768] gipcDisconnectF [clsCclGipcDisconnect : clsCclCommHandler.c : 4037]: EXCEPTION[ ret gipcretInvalidObject (3) ] failed disconnect endp 0000000000053871, flags 0x0
2023-03-16 10:50:39.520: [ GIPCLIB][3769472768] gipclibMapSearch: gipcMapSearch() -> gipcMapGetNodeAddr() failed: ret:gipcretKeyNotFound (36), ht:0x1016010, idxPtr:0x1020720, key:0x7fa9e0ad8560, flags:0x0
2023-03-16 10:50:39.520: [GIPCXCPT][3769472768] gipcObjectLookupF [gipcDestroyF : gipc.c : 2982]: search found no matching oid 0000000000053871, ret gipcretKeyNotFound (36), ret gipcretInvalidObject (3)
2023-03-16 10:50:39.520: [GIPCXCPT][3769472768] gipcDestroyF [clsCclGipcDisconnect : clsCclCommHandler.c : 4046]: EXCEPTION[ ret gipcretInvalidObject (3) ] failure to destroy obj 0000000000053871, flags 0x0
2023-03-16 10:50:39.520: [GIPCHAUP][3776157440] gipchaUpperCallbackDisconnect: completed DISCONNECT ret gipcretSuccess (0), umsg 0x7fa9cc5045c0 { msg 0x7fa9cc57a788, ret gipcretSuccess (0), flags 0x2 }, msg 0x7fa9cc57a788 { type gipchaMsgTypeDisconnect (5), srcCid 00000000-00053886, dstCid 00000000-0ebca3e8 }, hendp 0x7fa9c0014af0 [0000000000053886] { gipchaEndpoint : port 'd707-f99e-f6ed-759a', peer 'core-js1:CTSSGROUP_1/ca26-4aaf-b16c-efcf', srcCid 00000000-00053886, dstCid 00000000-0ebca3e8, numSend 0, maxSend 100, groupListType 2, hagroup 0x1079f90, usrFlags 0x4000, flags 0x21c }
2023-03-16 10:50:50.371: [ CTSS][3784607488]ctss_checkcb: clsdm requested check alive. checkcb_data{mode[0x84], offset[0 ms]}, length=[8].
2023-03-16 10:51:03.861: [GIPCXCPT][3776157440] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'core-js3', port 'b5b6-c3d5-47e6-d678', hctx 0x104ea40 [0000000000000010] { gipchaContext : host 'core-js3', name '57d6-c579-303e-038b', luid '4b08e6b2-00000000', numNode 1, numInf 2, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36)
2023-03-16 10:51:03.861: [GIPCHGEN][3776157440] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 815]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 0x104ea40 [0000000000000010] { gipchaContext : host 'core-js3', name '57d6-c579-303e-038b', luid '4b08e6b2-00000000', numNode 1, numInf 2, usrFlags 0x0, flags 0x5 }, host 'core-js3', port 'b5b6-c3d5-47e6-d678', flags 0x0
2023-03-16 10:51:03.862: [GIPCXCPT][3776157440] gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'core-js3', port '09ad-25a6-7338-b805', hctx 0x104ea40 [0000000000000010] { gipchaContext : host 'core-js3', name '57d6-c579-303e-038b', luid '4b08e6b2-00000000', numNode 1, numInf 2, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36)
2023-03-16 10:51:03.862: [GIPCHGEN][3776157440] gipchaResolveF [gipcmodGipcResolve : gipcmodGipc.c : 815]: EXCEPTION[ ret gipcretKeyNotFound (36) ] failed to resolve ctx 0x104ea40 [0000000000000010] { gipchaContext : host 'core-js3', name '57d6-c579-303e-038b', luid '4b08e6b2-00000000', numNode 1, numInf 2, usrFlags 0x0, flags 0x5 }, host 'core-js3', port '09ad-25a6-7338-b805', flags 0x0
2023-03-16 10:51:03.862: [ CRSCCL][3769472768]clsCclNewConn: added new conn to tempConList: newPeerCon = c800af60
2023-03-16 10:51:03.862: [ CRSCCL][3769472768] ****
四、排查过程
1. 检查防火墙配置
客户在多节点中配置了操作系统的防火墙,所以首先怀疑是防火墙阻断了节点间通信。检查防火墙规则,发现私网IP和HAIP(High Availability IP)均已放行,排除防火墙问题。
详细防火墙规则如下:
# For RAC HAIP
-A INPUT -s 169.254.0.0/16 -j ACCEPT# For RAC and Storage
-A INPUT -m iprange --src-range 10.xxx.xxx.xxx-10.xxx.xxx.xxx -j ACCEPT
-A INPUT -m iprange --src-range 10.xxx.xxx.xxx-10.xxx.xxx.xxx -j ACCEPT
-A INPUT -m iprange --src-range 10.xxx.xxx.xxx-10.xxx.xxx.xxx -j ACCEPT-A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT
-A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT
-A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT
-A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT
-A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT
-A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPTxxx.xxx
-A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT
-A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT
-A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT
-A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT
-A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT
-A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT
-A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT
-A INPUT -m iprange --src-range 172.xxx.xxx.xxx-172.xxx.xxx.xxx -j ACCEPT
2. 检查GIPC进程与网络
尝试kill节点3的GIPC进程,问题依旧。通过watch + netstat -s
观察网络统计信息,未发现异常增长,说明网络本身没有丢包或错误。
3. 查看其他节点日志
继续分析1节点ctss日志,只截取关键信息:
只成功接收到2节点的信息。
2023-03-16 14:29:13.716: [GIPCHAUP][2805810944] gipchaUpperProcessDisconnect: processing DISCONNECT for hendp 0x7fb623b3e150 [000000000ebfc6ab] { gipchaEndpoint : port 'CTSSGROUP_1/7879-176e-8975-a49d', peer 'core-js5:3c81-0971-ff00-3b4c', srcCid 00000000-0ebfc6ab, dstCid 00000000-00f82fbc, numSend 0, maxSend 100, groupListType 1, hagroup 0x1654f90, usrFlags 0x4000, flags 0x204 }
2023-03-16 14:29:22.945: [GIPCHAUP][2805810944] gipchaUpperProcessDisconnect: processing DISCONNECT for hendp 0x7fb623b08b10 [000000000ebfc7a1] { gipchaEndpoint : port 'CTSSGROUP_1/073a-7810-bae1-426a', peer 'core-js4:388e-ae87-6944-0202', srcCid 00000000-0ebfc7a1, dstCid 00000000-02848bb3, numSend 0, maxSend 100, groupListType 1, hagroup 0x1654f90, usrFlags 0x4000, flags 0x204 }
2023-03-16 14:29:18.616: [ CRSCCL][2799126272]PNC: Waiting for peer join from grpstat for peer (3,1718544606).
2023-03-16 14:29:20.327: [GIPCHAUP][2805810944] gipchaUpperProcessDisconnect: processing DISCONNECT for hendp 0x7fb623b5b3a0 [000000000ebfc755] { gipchaEndpoint : port 'CTSSGROUP_1/4e5d-8fc4-b709-d718', peer 'core-js6:d1a4-904c-bd8f-3cb1', srcCid 00000000-0ebfc755, dstCid 00000000-00f83f34, numSend 0, maxSend 100, groupListType 1, hagroup 0x1654f90, usrFlags 0x4000, flags 0x204 }
2023-03-16 14:29:20.328: [ CRSCCL][2799126272]clsCclGipcWait: GW: Marking Disconnected on temp con list.
通过查看节点1的CTSS日志,发现只成功接收到节点2的信息,节点3-6均为disconnect状态。节点1日志中多次出现“processing DISCONNECT”相关内容,表明节点1主动断开了与其他节点的连接。
4. 参考节点切换实验
尝试kill节点1的GIPC进程,CTSS参考节点自动切换为节点2,此时所有节点的时钟同步恢复正常!
五、原因分析
通过与客户沟通得知,最初集群只有节点1和节点2,后续通过“加节点”方式扩展到6节点。3-6节点自加入后,时钟同步一直异常。
综合日志和现象分析,推测CTSS在“加节点”场景下,新增节点的CTSS组件通过GIPC与原有参考节点通信时,可能因历史配置或内部状态异常,导致通信失败,进而无法同步时钟。而切换参考节点后,所有节点重新建立连接,问题随之消失。
六、经验总结与建议
- CTSS适用场景:CTSS目前国内还是使用比较少小,相关的经验还是有欠缺。生产环境建议优先配置操作系统的NTP,避免依赖CTSS。
- 加节点注意事项:RAC集群扩容后,建议重启CTSS服务或切换参考节点,确保所有节点能正常通信。
- 排查思路:遇到时钟同步异常,优先排查网络、防火墙、GIPC进程状态及参考节点配置。
- 日志分析:善用GI alert日志、CTSS日志、GIPC日志,关注“disconnect”、“failed to resolve”等关键字。
------------------作者介绍-----------------------
姓名:黄廷忠
现就职:Oracle中国高级服务团队
曾就职:OceanBase、云和恩墨、东方龙马等
电话、微信、QQ:18081072613
个人博客: (http://www.htz.pw)
CSDN地址: (https://blog.csdn.net/wwwhtzpw)
博客园地址: (https://www.cnblogs.com/www-htz-pw)