Fung920 发表于 2014-8-26 12:19:02

ora--00481引起RAC 脑裂,节点2反复重启【已解决,感谢B总】

本帖最后由 Fung920 于 2014-8-26 17:04 编辑

环境:11.2.0.3 + Linux 64bit 双节点 RAC
1.状况描述
在8月22号开始,节点2首次出现
Fri Aug 22 19:26:27 2014
PMON (ospid: 13912): terminating the instance due to error 481
Instance terminated by PMON, pid = 13912

节点2开始重启,重启过程中出现
NOTE: client +ASM2:+ASM registered, osid 20296, mbr 0x0
WARNING: failed to online diskgroup resource ora.DATA.dg (unable to communicate with CRSD/OHASD)
WARNING: failed to online diskgroup resource ora.RECOVERY.dg (unable to communicate with CRSD/OHASD)

再次报错
Fri Aug 22 19:41:42 2014
PMON (ospid: 20218): terminating the instance due to error 481
Instance terminated by PMON, pid = 20218

之后便起不来,昨天下午17点后手动启动节点2 Cluster,但今天客户巡检的时候发现节点2 Cluster一直在重启,查看lmon的trace文件,初步怀疑是DRM问题,导致脑裂的原因是时间的不同步,但是集群是ctss同步时间,offset也是0,同时检测两个节点的时间,发现时间也是一致的,该如何解决这个问题?

补充:
22号相关日志:
--alter_db2.log
Fri Aug 22 19:26:25 2014--第一次错误发生时间为22号19:26
NOTE: ASMB terminating
Errors in file /opt/u01/app/oracle/diag/rdbms/dbrac/dbrac2/trace/dbrac2_asmb_26030.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID:
?? ID: 82 ???: 35
Errors in file /opt/u01/app/oracle/diag/rdbms/dbrac/dbrac2/trace/dbrac2_asmb_26030.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID:
?? ID: 82 ???: 35
ASMB (ospid: 26030): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 26030
Fri Aug 22 19:28:02 2014
Starting ORACLE instance (normal) --开始自动重启Cluster

--alternode2.log
2014-08-22 19:26:16.065   --时间为22号19:26
CRS-1612:50% 的超时时间间隔内缺少与节点 dbserver_node1 (1) 的网络通信。将在 14.140 秒后从集群中删除此节点
2014-08-22 19:26:23.079
CRS-1611:75% 的超时时间间隔内缺少与节点 dbserver_node1 (1) 的网络通信。将在 7.130 秒后从集群中删除此节点

--ocssd.log:
2014-08-22 19:26:16.065: [    CSSD]clssnmPollingThread: node dbserver_node1 (1) at 50% heartbeat fatal, removal in 14.140 seconds --貌似是Heartbeat不通
2014-08-22 19:26:16.065: [    CSSD]clssnmPollingThread: node dbserver_node1 (1) is impending reconfig, flag 2491406, misstime 15860
2014-08-22 19:26:16.065: [    CSSD]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2014-08-22 19:26:16.066: [    CSSD]clssnmvDHBValidateNCopy: node 1, dbserver_node1, has a disk HB, but no network HB, DHB has rcfg 295030317, wrtcnt, 27424336, LATS 576293252, lastSeqNo 25251022, uniqueness 1399539197, timestamp 1408706775/576374502
2014-08-22 19:26:17.005: [    CSSD]clssnmSendingThread: sending status msg to all nodes
2014-08-22 19:26:17.005: [    CSSD]clssnmSendingThread: sent 4 status msgs to all nodes
2014-08-22 19:26:17.068: [    CSSD]clssnmvDHBValidateNCopy: node 1, dbserver_node1, has a disk HB, but no network HB, DHB has rcfg 295030317, wrtcnt, 27424342, LATS 576294252, lastSeqNo 27424336, uniqueness 1399539197, timestamp 1408706776/576375512 --心跳出问题,内部通信有问题

客户使用的是VMWARE下部署的RAC,之前就跟他们说别在虚拟机下搞,唉。。。


26号部分日志:
--db alter_node2.log
Tue Aug 26 15:15:21 2014--时间在15:15分左右
NOTE: ASMB terminating
Errors in file /opt/u01/app/oracle/diag/rdbms/dbrac/dbrac2/trace/dbrac2_asmb_27935.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID:
?? ID: 4 ???: 5
Errors in file /opt/u01/app/oracle/diag/rdbms/dbrac/dbrac2/trace/dbrac2_asmb_27935.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID:
?? ID: 4 ???: 5
ASMB (ospid: 27935): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 27935
Tue Aug 26 15:17:00 2014
Starting ORACLE instance (normal)


Fung920 发表于 2014-8-26 16:49:59

--crsd.log
2014-08-26 15:15:19.380: CS(0x2aaaac22a520)set Properties ( grid,0x2aaab40337d0) --时间在15:15分左右
2014-08-26 15:15:19.393: {2:12891:157} Container [ Name: FENCESERVER
      API_HDR_VER:
      TextMessage
      CLIENT:
      TextMessage[]
      CLIENT_NAME:
      TextMessage
      CLIENT_PID:
      TextMessage
      CLIENT_PRIMARY_GROUP:
      TextMessage
      LOCALE:
      TextMessage
] -- 开始在fence 节点2了?
2014-08-26 15:15:19.393: {2:12891:157} Sending message to AGFW. ctx= 0x2aaab0022430, Client PID: 27158
2014-08-26 15:15:19.393: {2:12891:157} Force-disconnecting existing PE clients...
2014-08-26 15:15:19.393: {2:12891:157} Sending message: 714 to AGFW proxy server.
2014-08-26 15:15:19.394: [    AGFW] {2:12891:157} Agfw Proxy Server received the message: UnKnown ID 20489:714
2014-08-26 15:15:19.394: [   CRSPE] {2:12891:157} Skipping Fence of : ora.CRS.dg
2014-08-26 15:15:19.394: [   CRSPE] {2:12891:157} Skipping Fence of : ora.DATA.dg
2014-08-26 15:15:19.394: [    AGFW] {2:12891:157} Agfw Proxy Server sending message: RESOURCE_CLEAN ID 4100:715 to the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:19.394: [   CRSPE] {2:12891:157} Skipping Fence of : ora.RECOVERY.dg
2014-08-26 15:15:19.394: [   CRSPE] {2:12891:157} Skipping Fence of : ora.asm
2014-08-26 15:15:19.394: [   CRSPE] {2:12891:157} Skipping Fence of : ora.dbrac.db
2014-08-26 15:15:19.394: [    AGFW] {2:12891:157} Agfw Proxy Server sending message: RESOURCE_CLEAN ID 4100:716 to the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.395: [    AGFW] {2:12891:157} Agfw Proxy Server sending message: RESOURCE_CLEAN ID 4100:717 to the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.395: [    AGFW] {2:12891:157} Agfw Proxy Server sending message: RESOURCE_CLEAN ID 4100:718 to the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:19.395: [   CRSPE] {2:12891:157} Skipping Fence of : ora.registry.acfs
2014-08-26 15:15:19.419: [    AGFW] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN ID 4100:716 from the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.419: [    AGFW] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN ID 4100:717 from the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.420: [    AGFW] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN ID 4100:717 from the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.420: [    AGFW] {2:12891:157} Fenced off the resource
2014-08-26 15:15:19.421: [    AGFW] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN ID 4100:716 from the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.421: [    AGFW] {2:12891:157} Fenced off the resource
2014-08-26 15:15:19.508: [    AGFW] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN ID 4100:715 from the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:20.514: [    AGFW] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN ID 4100:718 from the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:20.623: [    AGFW] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN ID 4100:718 from the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:20.623: [    AGFW] {2:12891:157} Fenced off the resource
--ctssd.log
2014-08-26 15:15:18.799: [    CTSS]ctsscomm_recv_cb2: Receive incoming message event. Msgtype .
2014-08-26 15:15:18.799: [    CTSS]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state.
2014-08-26 15:15:18.799: [    CTSS]ctssslave_swm2_3: Received time sync message from master.
2014-08-26 15:15:18.799: [    CTSS]ctssslave_swm: sendtime{sec, usec}, receivetime{sec, usec}.
2014-08-26 15:15:18.799: [    CTSS]ctssslave_swm: The RTT of sync msg is too large for time sync to be accurate. Recommends retry. Returns .
2014-08-26 15:15:18.799: [    CTSS]ctssslave_swm: Received from master (mode nodenum hostname )
2014-08-26 15:15:18.799: [    CTSS]ctsselect_msm: Failed in sync_with_master
2014-08-26 15:15:18.799: [    CTSS]ctsselect_msm: Sync interval returned in
2014-08-26 15:15:18.799: [    CTSS]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2014-08-26 15:15:21.539: gipchaDaemonProcessHAInvalidate: completed ha name invalidate for node 0x1d826f90 { host 'dbserver_node1', haName '26ce-13b4-ca6c-3882', srcLuid 1caa0a47-6941674c, dstLuid c2a91655-c788666c numInf 1, contigSeq 152, lastAck 152, lastValidAck 152, sendSeq , createTime 905619662, sentRegister 1, localMonitor 0, flags 0x28 }
2014-08-26 15:15:21.665: [    CTSS]Oracle Database CTSS Release 11.2.0.3.0 Production Copyright 2006, 2011 Oracle.All rights reserved.
2014-08-26 15:15:21.665: [    CTSS]ctss_scls_init: SCLs Context is 0x6d94d30
Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=dbserver_node2DBG_CTSSD))
2014-08-26 15:15:21.667: PID for the Process , connkey 11

还需要提供哪些信息呢?

Fung920 发表于 2014-8-26 17:03:28

最后确认心跳有丢包

parknkjun 发表于 2014-9-26 11:10:24

初步看是由于你的rac-scan无法解析导致的,时间是8月22日,但是你的css,crs日志只有8月26号当天的,没办法进一步判断。

Fung920 发表于 2014-10-17 10:19:20

parknkjun 发表于 2014-9-26 11:10
初步看是由于你的rac-scan无法解析导致的,时间是8月22日,但是你的css,crs日志只有8月26号当天的,没办法 ...

跟scan一点关系都没有哇,就是心跳中断

new_db 发表于 2014-11-6 17:23:06

哥们,最后咋解决的,分享下呗,
页: [1]
查看完整版本: ora--00481引起RAC 脑裂,节点2反复重启【已解决,感谢B总】