ORACLE SOS-故障诊断-ORACLE运维管理-ora-00481引起RAC 脑裂，节点2反复重启【已解决，感谢B总】

Fung920 发表于 2014-8-26 12:19:02

ora--00481引起RAC 脑裂，节点2反复重启【已解决，感谢B总】

本帖最后由 Fung920 于 2014-8-26 17:04 编辑

环境：11.2.0.3 + Linux 64bit 双节点 RAC
1.状况描述
在8月22号开始，节点2首次出现
Fri Aug 22 19:26:27 2014
PMON (ospid: 13912): terminating the instance due to error 481
Instance terminated by PMON, pid = 13912

节点2开始重启，重启过程中出现
NOTE: client +ASM2:+ASM registered, osid 20296, mbr 0x0
WARNING: failed to online diskgroup resource ora.DATA.dg (unable to communicate with CRSD/OHASD)
WARNING: failed to online diskgroup resource ora.RECOVERY.dg (unable to communicate with CRSD/OHASD)

再次报错
Fri Aug 22 19:41:42 2014
PMON (ospid: 20218): terminating the instance due to error 481
Instance terminated by PMON, pid = 20218

之后便起不来，昨天下午17点后手动启动节点2 Cluster，但今天客户巡检的时候发现节点2 Cluster一直在重启，查看lmon的trace文件，初步怀疑是DRM问题，导致脑裂的原因是时间的不同步，但是集群是ctss同步时间，offset也是0，同时检测两个节点的时间，发现时间也是一致的，该如何解决这个问题？

补充：
22号相关日志：
--alter_db2.log
Fri Aug 22 19:26:25 2014--第一次错误发生时间为22号19:26
NOTE: ASMB terminating
Errors in file /opt/u01/app/oracle/diag/rdbms/dbrac/dbrac2/trace/dbrac2_asmb_26030.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID:
?? ID: 82 ???: 35
Errors in file /opt/u01/app/oracle/diag/rdbms/dbrac/dbrac2/trace/dbrac2_asmb_26030.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID:
?? ID: 82 ???: 35
ASMB (ospid: 26030): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 26030
Fri Aug 22 19:28:02 2014
Starting ORACLE instance (normal) --开始自动重启Cluster

--alternode2.log
2014-08-22 19:26:16.065 --时间为22号19:26
CRS-1612:50% 的超时时间间隔内缺少与节点 dbserver_node1 (1) 的网络通信。将在 14.140 秒后从集群中删除此节点
2014-08-22 19:26:23.079
CRS-1611:75% 的超时时间间隔内缺少与节点 dbserver_node1 (1) 的网络通信。将在 7.130 秒后从集群中删除此节点

--ocssd.log：
2014-08-22 19:26:16.065: [ CSSD]clssnmPollingThread: node dbserver_node1 (1) at 50% heartbeat fatal, removal in 14.140 seconds --貌似是Heartbeat不通
2014-08-22 19:26:16.065: [ CSSD]clssnmPollingThread: node dbserver_node1 (1) is impending reconfig, flag 2491406, misstime 15860
2014-08-22 19:26:16.065: [ CSSD]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2014-08-22 19:26:16.066: [ CSSD]clssnmvDHBValidateNCopy: node 1, dbserver_node1, has a disk HB, but no network HB, DHB has rcfg 295030317, wrtcnt, 27424336, LATS 576293252, lastSeqNo 25251022, uniqueness 1399539197, timestamp 1408706775/576374502
2014-08-22 19:26:17.005: [ CSSD]clssnmSendingThread: sending status msg to all nodes
2014-08-22 19:26:17.005: [ CSSD]clssnmSendingThread: sent 4 status msgs to all nodes
2014-08-22 19:26:17.068: [ CSSD]clssnmvDHBValidateNCopy: node 1, dbserver_node1, has a disk HB, but no network HB, DHB has rcfg 295030317, wrtcnt, 27424342, LATS 576294252, lastSeqNo 27424336, uniqueness 1399539197, timestamp 1408706776/576375512 --心跳出问题，内部通信有问题

客户使用的是VMWARE下部署的RAC，之前就跟他们说别在虚拟机下搞，唉。。。

26号部分日志：
--db alter_node2.log
Tue Aug 26 15:15:21 2014--时间在15:15分左右
NOTE: ASMB terminating
Errors in file /opt/u01/app/oracle/diag/rdbms/dbrac/dbrac2/trace/dbrac2_asmb_27935.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID:
?? ID: 4 ???: 5
Errors in file /opt/u01/app/oracle/diag/rdbms/dbrac/dbrac2/trace/dbrac2_asmb_27935.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID:
?? ID: 4 ???: 5
ASMB (ospid: 27935): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 27935
Tue Aug 26 15:17:00 2014
Starting ORACLE instance (normal)

Fung920 发表于 2014-8-26 16:49:59

--crsd.log
2014-08-26 15:15:19.380: CS(0x2aaaac22a520)set Properties ( grid,0x2aaab40337d0) --时间在15:15分左右
2014-08-26 15:15:19.393: {2:12891:157} Container [ Name: FENCESERVER
   API_HDR_VER:
   TextMessage
   CLIENT:
   TextMessage[]
   CLIENT_NAME:
   TextMessage
   CLIENT_PID:
   TextMessage
   CLIENT_PRIMARY_GROUP:
   TextMessage
   LOCALE:
   TextMessage
] -- 开始在fence 节点2了？
2014-08-26 15:15:19.393: {2:12891:157} Sending message to AGFW. ctx= 0x2aaab0022430, Client PID: 27158
2014-08-26 15:15:19.393: {2:12891:157} Force-disconnecting existing PE clients...
2014-08-26 15:15:19.393: {2:12891:157} Sending message: 714 to AGFW proxy server.
2014-08-26 15:15:19.394: [ AGFW] {2:12891:157} Agfw Proxy Server received the message: UnKnown ID 20489:714
2014-08-26 15:15:19.394: [ CRSPE] {2:12891:157} Skipping Fence of : ora.CRS.dg
2014-08-26 15:15:19.394: [ CRSPE] {2:12891:157} Skipping Fence of : ora.DATA.dg
2014-08-26 15:15:19.394: [ AGFW] {2:12891:157} Agfw Proxy Server sending message: RESOURCE_CLEAN ID 4100:715 to the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:19.394: [ CRSPE] {2:12891:157} Skipping Fence of : ora.RECOVERY.dg
2014-08-26 15:15:19.394: [ CRSPE] {2:12891:157} Skipping Fence of : ora.asm
2014-08-26 15:15:19.394: [ CRSPE] {2:12891:157} Skipping Fence of : ora.dbrac.db
2014-08-26 15:15:19.394: [ AGFW] {2:12891:157} Agfw Proxy Server sending message: RESOURCE_CLEAN ID 4100:716 to the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.395: [ AGFW] {2:12891:157} Agfw Proxy Server sending message: RESOURCE_CLEAN ID 4100:717 to the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.395: [ AGFW] {2:12891:157} Agfw Proxy Server sending message: RESOURCE_CLEAN ID 4100:718 to the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:19.395: [ CRSPE] {2:12891:157} Skipping Fence of : ora.registry.acfs
2014-08-26 15:15:19.419: [ AGFW] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN ID 4100:716 from the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.419: [ AGFW] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN ID 4100:717 from the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.420: [ AGFW] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN ID 4100:717 from the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.420: [ AGFW] {2:12891:157} Fenced off the resource
2014-08-26 15:15:19.421: [ AGFW] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN ID 4100:716 from the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.421: [ AGFW] {2:12891:157} Fenced off the resource
2014-08-26 15:15:19.508: [ AGFW] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN ID 4100:715 from the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:20.514: [ AGFW] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN ID 4100:718 from the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:20.623: [ AGFW] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN ID 4100:718 from the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:20.623: [ AGFW] {2:12891:157} Fenced off the resource
--ctssd.log
2014-08-26 15:15:18.799: [ CTSS]ctsscomm_recv_cb2: Receive incoming message event. Msgtype .
2014-08-26 15:15:18.799: [ CTSS]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state.
2014-08-26 15:15:18.799: [ CTSS]ctssslave_swm2_3: Received time sync message from master.
2014-08-26 15:15:18.799: [ CTSS]ctssslave_swm: sendtime{sec, usec}, receivetime{sec, usec}.
2014-08-26 15:15:18.799: [ CTSS]ctssslave_swm: The RTT of sync msg is too large for time sync to be accurate. Recommends retry. Returns .
2014-08-26 15:15:18.799: [ CTSS]ctssslave_swm: Received from master (mode nodenum hostname )
2014-08-26 15:15:18.799: [ CTSS]ctsselect_msm: Failed in sync_with_master
2014-08-26 15:15:18.799: [ CTSS]ctsselect_msm: Sync interval returned in
2014-08-26 15:15:18.799: [ CTSS]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2014-08-26 15:15:21.539: gipchaDaemonProcessHAInvalidate: completed ha name invalidate for node 0x1d826f90 { host 'dbserver_node1', haName '26ce-13b4-ca6c-3882', srcLuid 1caa0a47-6941674c, dstLuid c2a91655-c788666c numInf 1, contigSeq 152, lastAck 152, lastValidAck 152, sendSeq , createTime 905619662, sentRegister 1, localMonitor 0, flags 0x28 }
2014-08-26 15:15:21.665: [ CTSS]Oracle Database CTSS Release 11.2.0.3.0 Production Copyright 2006, 2011 Oracle.All rights reserved.
2014-08-26 15:15:21.665: [ CTSS]ctss_scls_init: SCLs Context is 0x6d94d30
Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=dbserver_node2DBG_CTSSD))
2014-08-26 15:15:21.667: PID for the Process , connkey 11

还需要提供哪些信息呢？

Fung920 发表于 2014-8-26 17:03:28

最后确认心跳有丢包

parknkjun 发表于 2014-9-26 11:10:24

初步看是由于你的rac-scan无法解析导致的，时间是8月22日，但是你的css,crs日志只有8月26号当天的，没办法进一步判断。

Fung920 发表于 2014-10-17 10:19:20

parknkjun 发表于 2014-9-26 11:10
初步看是由于你的rac-scan无法解析导致的，时间是8月22日，但是你的css,crs日志只有8月26号当天的，没办法 ...

跟scan一点关系都没有哇，就是心跳中断

new_db 发表于 2014-11-6 17:23:06

哥们，最后咋解决的，分享下呗，

页: [1]

ORACLE SOS's Archiver

ora--00481引起RAC 脑裂，节点2反复重启【已解决，感谢B总】