ora--00481引起RAC 脑裂，节点2反复重启【已解决，感谢B总】

Fung920 · 发表于 2014-8-26 12:19:02

本帖最后由 Fung920 于 2014-8-26 17:04 编辑

环境：11.2.0.3 + Linux 64bit 双节点 RAC
1.状况描述
在8月22号开始，节点2首次出现
Fri Aug 22 19:26:27 2014
PMON (ospid: 13912): terminating the instance due to error 481
Instance terminated by PMON, pid = 13912

节点2开始重启，重启过程中出现
NOTE: client +ASM2:+ASM registered, osid 20296, mbr 0x0
WARNING: failed to online diskgroup resource ora.DATA.dg (unable to communicate with CRSD/OHASD)
WARNING: failed to online diskgroup resource ora.RECOVERY.dg (unable to communicate with CRSD/OHASD)

再次报错
Fri Aug 22 19:41:42 2014
PMON (ospid: 20218): terminating the instance due to error 481
Instance terminated by PMON, pid = 20218

之后便起不来，昨天下午17点后手动启动节点2 Cluster，但今天客户巡检的时候发现节点2 Cluster一直在重启，查看lmon的trace文件，初步怀疑是DRM问题，导致脑裂的原因是时间的不同步，但是集群是ctss同步时间，offset也是0，同时检测两个节点的时间，发现时间也是一致的，该如何解决这个问题？

补充：
22号相关日志：
--alter_db2.log
Fri Aug 22 19:26:25 2014 --第一次错误发生时间为22号19:26
NOTE: ASMB terminating
Errors in file /opt/u01/app/oracle/diag/rdbms/dbrac/dbrac2/trace/dbrac2_asmb_26030.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID:
?? ID: 82 ???: 35
Errors in file /opt/u01/app/oracle/diag/rdbms/dbrac/dbrac2/trace/dbrac2_asmb_26030.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID:
?? ID: 82 ???: 35
ASMB (ospid: 26030): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 26030
Fri Aug 22 19:28:02 2014
Starting ORACLE instance (normal) --开始自动重启Cluster

--alternode2.log
2014-08-22 19:26:16.065 --时间为22号19:26
[cssd(13669)]CRS-1612:50% 的超时时间间隔内缺少与节点 dbserver_node1 (1) 的网络通信。将在 14.140 秒后从集群中删除此节点
2014-08-22 19:26:23.079
[cssd(13669)]CRS-1611:75% 的超时时间间隔内缺少与节点 dbserver_node1 (1) 的网络通信。将在 7.130 秒后从集群中删除此节点

--ocssd.log：
2014-08-22 19:26:16.065: [ CSSD][1113200960]clssnmPollingThread: node dbserver_node1 (1) at 50% heartbeat fatal, removal in 14.140 seconds --貌似是Heartbeat不通
2014-08-22 19:26:16.065: [ CSSD][1113200960]clssnmPollingThread: node dbserver_node1 (1) is impending reconfig, flag 2491406, misstime 15860
2014-08-22 19:26:16.065: [ CSSD][1113200960]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2014-08-22 19:26:16.066: [ CSSD][1106893120]clssnmvDHBValidateNCopy: node 1, dbserver_node1, has a disk HB, but no network HB, DHB has rcfg 295030317, wrtcnt, 27424336, LATS 576293252, lastSeqNo 25251022, uniqueness 1399539197, timestamp 1408706775/576374502
2014-08-22 19:26:17.005: [ CSSD][1114777920]clssnmSendingThread: sending status msg to all nodes
2014-08-22 19:26:17.005: [ CSSD][1114777920]clssnmSendingThread: sent 4 status msgs to all nodes
2014-08-22 19:26:17.068: [ CSSD][1106893120]clssnmvDHBValidateNCopy: node 1, dbserver_node1, has a disk HB, but no network HB, DHB has rcfg 295030317, wrtcnt, 27424342, LATS 576294252, lastSeqNo 27424336, uniqueness 1399539197, timestamp 1408706776/576375512 --心跳出问题，内部通信有问题

客户使用的是VMWARE下部署的RAC，之前就跟他们说别在虚拟机下搞，唉。。。

26号部分日志：
--db alter_node2.log
Tue Aug 26 15:15:21 2014 --时间在15:15分左右
NOTE: ASMB terminating
Errors in file /opt/u01/app/oracle/diag/rdbms/dbrac/dbrac2/trace/dbrac2_asmb_27935.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID:
?? ID: 4 ???: 5
Errors in file /opt/u01/app/oracle/diag/rdbms/dbrac/dbrac2/trace/dbrac2_asmb_27935.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID:
?? ID: 4 ???: 5
ASMB (ospid: 27935): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 27935
Tue Aug 26 15:17:00 2014
Starting ORACLE instance (normal)

Fung920 · 发表于 2014-8-26 16:49:59

--crsd.log
2014-08-26 15:15:19.380: [UiServer][1179924800] CS(0x2aaaac22a520)set Properties ( grid,0x2aaab40337d0) --时间在15:15分左右
2014-08-26 15:15:19.393: [UiServer][1177823552] {2:12891:157} Container [ Name: FENCESERVER
      API_HDR_VER:
      TextMessage[2]
      CLIENT:
      TextMessage[]
      CLIENT_NAME:
      TextMessage[ocssd.bin]
      CLIENT_PID:
      TextMessage[27158]
      CLIENT_PRIMARY_GROUP:
      TextMessage[oinstall]
      LOCALE:
      TextMessage[SIMPLIFIED CHINESE_CHINA.AL32UTF8]
] -- 开始在fence 节点2了？
2014-08-26 15:15:19.393: [UiServer][1177823552] {2:12891:157} Sending message to AGFW. ctx= 0x2aaab0022430, Client PID: 27158
2014-08-26 15:15:19.393: [UiServer][1177823552] {2:12891:157} Force-disconnecting [0]  existing PE clients...
2014-08-26 15:15:19.393: [UiServer][1177823552] {2:12891:157} Sending message: 714 to AGFW proxy server.
2014-08-26 15:15:19.394: [ AGFW][1165216064] {2:12891:157} Agfw Proxy Server received the message: UnKnown[Proxy] ID 20489:714
2014-08-26 15:15:19.394: [ CRSPE][1165216064] {2:12891:157} Skipping Fence of : ora.CRS.dg
2014-08-26 15:15:19.394: [ CRSPE][1165216064] {2:12891:157} Skipping Fence of : ora.DATA.dg
2014-08-26 15:15:19.394: [ AGFW][1165216064] {2:12891:157} Agfw Proxy Server sending message: RESOURCE_CLEAN[ora.LISTENER.lsnr dbserver_node2 1] ID 4100:715 to the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:19.394: [ CRSPE][1165216064] {2:12891:157} Skipping Fence of : ora.RECOVERY.dg
2014-08-26 15:15:19.394: [ CRSPE][1165216064] {2:12891:157} Skipping Fence of : ora.asm
2014-08-26 15:15:19.394: [ CRSPE][1165216064] {2:12891:157} Skipping Fence of : ora.dbrac.db
2014-08-26 15:15:19.394: [ AGFW][1165216064] {2:12891:157} Agfw Proxy Server sending message: RESOURCE_CLEAN[ora.dbserver_node2.vip 1 1] ID 4100:716 to the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.395: [ AGFW][1165216064] {2:12891:157} Agfw Proxy Server sending message: RESOURCE_CLEAN[ora.net1.network dbserver_node2 1] ID 4100:717 to the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.395: [ AGFW][1165216064] {2:12891:157} Agfw Proxy Server sending message: RESOURCE_CLEAN[ora.ons dbserver_node2 1] ID 4100:718 to the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:19.395: [ CRSPE][1165216064] {2:12891:157} Skipping Fence of : ora.registry.acfs
2014-08-26 15:15:19.419: [ AGFW][1165216064] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN[ora.dbserver_node2.vip 1 1] ID 4100:716 from the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.419: [ AGFW][1165216064] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN[ora.net1.network dbserver_node2 1] ID 4100:717 from the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.420: [ AGFW][1165216064] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN[ora.net1.network dbserver_node2 1] ID 4100:717 from the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.420: [ AGFW][1165216064] {2:12891:157} Fenced off the resource [ora.net1.network]
2014-08-26 15:15:19.421: [ AGFW][1165216064] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN[ora.dbserver_node2.vip 1 1] ID 4100:716 from the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.421: [ AGFW][1165216064] {2:12891:157} Fenced off the resource [ora.dbserver_node2.vip]
2014-08-26 15:15:19.508: [ AGFW][1165216064] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN[ora.LISTENER.lsnr dbserver_node2 1] ID 4100:715 from the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:20.514: [ AGFW][1165216064] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN[ora.ons dbserver_node2 1] ID 4100:718 from the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:20.623: [ AGFW][1165216064] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN[ora.ons dbserver_node2 1] ID 4100:718 from the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:20.623: [ AGFW][1165216064] {2:12891:157} Fenced off the resource [ora.ons]
--ctssd.log
2014-08-26 15:15:18.799: [ CTSS][1107577152]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].
2014-08-26 15:15:18.799: [ CTSS][1107577152]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2014-08-26 15:15:18.799: [ CTSS][1111779648]ctssslave_swm2_3: Received time sync message from master.
2014-08-26 15:15:18.799: [ CTSS][1111779648]ctssslave_swm: sendtime{sec[1409037290], usec[789768]}, receivetime{sec[1409037318], usec[799071]}.
2014-08-26 15:15:18.799: [ CTSS][1111779648]ctssslave_swm: The RTT of sync msg [28009303] is too large for time sync to be accurate. Recommends retry. Returns [17].
2014-08-26 15:15:18.799: [ CTSS][1111779648]ctssslave_swm: Received from master (mode [0xcc] nodenum [1] hostname [dbserver_node1] )
2014-08-26 15:15:18.799: [ CTSS][1111779648]ctsselect_msm: Failed in sync_with_master [17]
2014-08-26 15:15:18.799: [ CTSS][1111779648]ctsselect_msm: Sync interval returned in [1]
2014-08-26 15:15:18.799: [ CTSS][1107577152]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2014-08-26 15:15:21.539: [GIPCHDEM][1103374656] gipchaDaemonProcessHAInvalidate: completed ha name invalidate for node 0x1d826f90 { host 'dbserver_node1', haName '26ce-13b4-ca6c-3882', srcLuid 1caa0a47-6941674c, dstLuid c2a91655-c788666c numInf 1, contigSeq 152, lastAck 152, lastValidAck 152, sendSeq [152 : 152], createTime 905619662, sentRegister 1, localMonitor 0, flags 0x28 }
2014-08-26 15:15:21.665: [ CTSS][3027770656]Oracle Database CTSS Release 11.2.0.3.0 Production Copyright 2006, 2011 Oracle.  All rights reserved.
2014-08-26 15:15:21.665: [ CTSS][3027770656]ctss_scls_init: SCLs Context is 0x6d94d30
[  clsdmt][1098041664]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=dbserver_node2DBG_CTSSD))
2014-08-26 15:15:21.667: [  clsdmt][1098041664]PID for the Process [30476], connkey 11

还需要提供哪些信息呢？

Fung920 · 发表于 2014-8-26 17:03:28

最后确认心跳有丢包

parknkjun · 发表于 2014-9-26 11:10:24

初步看是由于你的rac-scan无法解析导致的，时间是8月22日，但是你的css,crs日志只有8月26号当天的，没办法进一步判断。

Fung920 · 发表于 2014-10-17 10:19:20

parknkjun 发表于 2014-9-26 11:10
初步看是由于你的rac-scan无法解析导致的，时间是8月22日，但是你的css,crs日志只有8月26号当天的，没办法 ...

跟scan一点关系都没有哇，就是心跳中断

new_db · 发表于 2014-11-6 17:23:06

哥们，最后咋解决的，分享下呗，

		自动登录	找回密码
密码			立即注册

ora--00481引起RAC 脑裂，节点2反复重启【已解决，感谢B总】

本帖子中包含更多资源

本帖子中包含更多资源

浏览过的版块