ORACLE SOS

 找回密码
 立即注册

QQ登录

只需一步,快速开始

搜索
查看: 8883|回复: 5

ora--00481引起RAC 脑裂,节点2反复重启【已解决,感谢B总】

[复制链接]

5

主题

16

帖子

74

积分

注册会员

Rank: 2

积分
74
发表于 2014-8-26 12:19:02 | 显示全部楼层 |阅读模式
本帖最后由 Fung920 于 2014-8-26 17:04 编辑

环境:11.2.0.3 + Linux 64bit 双节点 RAC
1.状况描述
在8月22号开始,节点2首次出现
Fri Aug 22 19:26:27 2014
PMON (ospid: 13912): terminating the instance due to error 481
Instance terminated by PMON, pid = 13912

节点2开始重启,重启过程中出现
NOTE: client +ASM2:+ASM registered, osid 20296, mbr 0x0
WARNING: failed to online diskgroup resource ora.DATA.dg (unable to communicate with CRSD/OHASD)
WARNING: failed to online diskgroup resource ora.RECOVERY.dg (unable to communicate with CRSD/OHASD)

再次报错
Fri Aug 22 19:41:42 2014
PMON (ospid: 20218): terminating the instance due to error 481
Instance terminated by PMON, pid = 20218

之后便起不来,昨天下午17点后手动启动节点2 Cluster,但今天客户巡检的时候发现节点2 Cluster一直在重启,查看lmon的trace文件,初步怀疑是DRM问题,导致脑裂的原因是时间的不同步,但是集群是ctss同步时间,offset也是0,同时检测两个节点的时间,发现时间也是一致的,该如何解决这个问题?

补充:
22号相关日志:
--alter_db2.log
Fri Aug 22 19:26:25 2014  --第一次错误发生时间为22号19:26
NOTE: ASMB terminating
Errors in file /opt/u01/app/oracle/diag/rdbms/dbrac/dbrac2/trace/dbrac2_asmb_26030.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID:
?? ID: 82 ???: 35
Errors in file /opt/u01/app/oracle/diag/rdbms/dbrac/dbrac2/trace/dbrac2_asmb_26030.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID:
?? ID: 82 ???: 35
ASMB (ospid: 26030): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 26030
Fri Aug 22 19:28:02 2014
Starting ORACLE instance (normal) --开始自动重启Cluster

--alternode2.log
2014-08-22 19:26:16.065   --时间为22号19:26
[cssd(13669)]CRS-1612:50% 的超时时间间隔内缺少与节点 dbserver_node1 (1) 的网络通信。将在 14.140 秒后从集群中删除此节点
2014-08-22 19:26:23.079
[cssd(13669)]CRS-1611:75% 的超时时间间隔内缺少与节点 dbserver_node1 (1) 的网络通信。将在 7.130 秒后从集群中删除此节点

--ocssd.log:
2014-08-22 19:26:16.065: [    CSSD][1113200960]clssnmPollingThread: node dbserver_node1 (1) at 50% heartbeat fatal, removal in 14.140 seconds --貌似是Heartbeat不通
2014-08-22 19:26:16.065: [    CSSD][1113200960]clssnmPollingThread: node dbserver_node1 (1) is impending reconfig, flag 2491406, misstime 15860
2014-08-22 19:26:16.065: [    CSSD][1113200960]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2014-08-22 19:26:16.066: [    CSSD][1106893120]clssnmvDHBValidateNCopy: node 1, dbserver_node1, has a disk HB, but no network HB, DHB has rcfg 295030317, wrtcnt, 27424336, LATS 576293252, lastSeqNo 25251022, uniqueness 1399539197, timestamp 1408706775/576374502
2014-08-22 19:26:17.005: [    CSSD][1114777920]clssnmSendingThread: sending status msg to all nodes
2014-08-22 19:26:17.005: [    CSSD][1114777920]clssnmSendingThread: sent 4 status msgs to all nodes
2014-08-22 19:26:17.068: [    CSSD][1106893120]clssnmvDHBValidateNCopy: node 1, dbserver_node1, has a disk HB, but no network HB, DHB has rcfg 295030317, wrtcnt, 27424342, LATS 576294252, lastSeqNo 27424336, uniqueness 1399539197, timestamp 1408706776/576375512 --心跳出问题,内部通信有问题

客户使用的是VMWARE下部署的RAC,之前就跟他们说别在虚拟机下搞,唉。。。


26号部分日志:
--db alter_node2.log
Tue Aug 26 15:15:21 2014  --时间在15:15分左右
NOTE: ASMB terminating
Errors in file /opt/u01/app/oracle/diag/rdbms/dbrac/dbrac2/trace/dbrac2_asmb_27935.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID:
?? ID: 4 ???: 5
Errors in file /opt/u01/app/oracle/diag/rdbms/dbrac/dbrac2/trace/dbrac2_asmb_27935.trc:
ORA-15064: ? ASM ??????
ORA-03113: ?????????
?? ID:
?? ID: 4 ???: 5
ASMB (ospid: 27935): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 27935
Tue Aug 26 15:17:00 2014
Starting ORACLE instance (normal)


本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
回复

使用道具 举报

5

主题

16

帖子

74

积分

注册会员

Rank: 2

积分
74
 楼主| 发表于 2014-8-26 16:49:59 | 显示全部楼层
--crsd.log
2014-08-26 15:15:19.380: [UiServer][1179924800] CS(0x2aaaac22a520)set Properties ( grid,0x2aaab40337d0) --时间在15:15分左右
2014-08-26 15:15:19.393: [UiServer][1177823552] {2:12891:157} Container [ Name: FENCESERVER
        API_HDR_VER:
        TextMessage[2]
        CLIENT:
        TextMessage[]
        CLIENT_NAME:
        TextMessage[ocssd.bin]
        CLIENT_PID:
        TextMessage[27158]
        CLIENT_PRIMARY_GROUP:
        TextMessage[oinstall]
        LOCALE:
        TextMessage[SIMPLIFIED CHINESE_CHINA.AL32UTF8]
] -
- 开始在fence 节点2了?
2014-08-26 15:15:19.393: [UiServer][1177823552] {2:12891:157} Sending message to AGFW. ctx= 0x2aaab0022430, Client PID: 27158
2014-08-26 15:15:19.393: [UiServer][1177823552] {2:12891:157} Force-disconnecting [0]  existing PE clients...
2014-08-26 15:15:19.393: [UiServer][1177823552] {2:12891:157} Sending message: 714 to AGFW proxy server.
2014-08-26 15:15:19.394: [    AGFW][1165216064] {2:12891:157} Agfw Proxy Server received the message: UnKnown[Proxy] ID 20489:714
2014-08-26 15:15:19.394: [   CRSPE][1165216064] {2:12891:157} Skipping Fence of : ora.CRS.dg
2014-08-26 15:15:19.394: [   CRSPE][1165216064] {2:12891:157} Skipping Fence of : ora.DATA.dg
2014-08-26 15:15:19.394: [    AGFW][1165216064] {2:12891:157} Agfw Proxy Server sending message: RESOURCE_CLEAN[ora.LISTENER.lsnr dbserver_node2 1] ID 4100:715 to the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:19.394: [   CRSPE][1165216064] {2:12891:157} Skipping Fence of : ora.RECOVERY.dg
2014-08-26 15:15:19.394: [   CRSPE][1165216064] {2:12891:157} Skipping Fence of : ora.asm
2014-08-26 15:15:19.394: [   CRSPE][1165216064] {2:12891:157} Skipping Fence of : ora.dbrac.db
2014-08-26 15:15:19.394: [    AGFW][1165216064] {2:12891:157} Agfw Proxy Server sending message: RESOURCE_CLEAN[ora.dbserver_node2.vip 1 1] ID 4100:716 to the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.395: [    AGFW][1165216064] {2:12891:157} Agfw Proxy Server sending message: RESOURCE_CLEAN[ora.net1.network dbserver_node2 1] ID 4100:717 to the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.395: [    AGFW][1165216064] {2:12891:157} Agfw Proxy Server sending message: RESOURCE_CLEAN[ora.ons dbserver_node2 1] ID 4100:718 to the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:19.395: [   CRSPE][1165216064] {2:12891:157} Skipping Fence of : ora.registry.acfs
2014-08-26 15:15:19.419: [    AGFW][1165216064] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN[ora.dbserver_node2.vip 1 1] ID 4100:716 from the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.419: [    AGFW][1165216064] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN[ora.net1.network dbserver_node2 1] ID 4100:717 from the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.420: [    AGFW][1165216064] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN[ora.net1.network dbserver_node2 1] ID 4100:717 from the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.420: [    AGFW][1165216064] {2:12891:157} Fenced off the resource [ora.net1.network]
2014-08-26 15:15:19.421: [    AGFW][1165216064] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN[ora.dbserver_node2.vip 1 1] ID 4100:716 from the agent /opt/u01/app/11gr2/grid/bin/orarootagent_root
2014-08-26 15:15:19.421: [    AGFW][1165216064] {2:12891:157} Fenced off the resource [ora.dbserver_node2.vip]
2014-08-26 15:15:19.508: [    AGFW][1165216064] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN[ora.LISTENER.lsnr dbserver_node2 1] ID 4100:715 from the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:20.514: [    AGFW][1165216064] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN[ora.ons dbserver_node2 1] ID 4100:718 from the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:20.623: [    AGFW][1165216064] {2:12891:157} Received the reply to the message: RESOURCE_CLEAN[ora.ons dbserver_node2 1] ID 4100:718 from the agent /opt/u01/app/11gr2/grid/bin/oraagent_grid
2014-08-26 15:15:20.623: [    AGFW][1165216064] {2:12891:157} Fenced off the resource [ora.ons]
--ctssd.log
2014-08-26 15:15:18.799: [    CTSS][1107577152]ctsscomm_recv_cb2: Receive incoming message event. Msgtype [2].
2014-08-26 15:15:18.799: [    CTSS][1107577152]ctssslave_msg_handler4_1: Waiting for slave_sync_with_master to finish sync process. sync_state[3].
2014-08-26 15:15:18.799: [    CTSS][1111779648]ctssslave_swm2_3: Received time sync message from master.
2014-08-26 15:15:18.799: [    CTSS][1111779648]ctssslave_swm: sendtime{sec[1409037290], usec[789768]}, receivetime{sec[1409037318], usec[799071]}.
2014-08-26 15:15:18.799: [    CTSS][1111779648]ctssslave_swm: The RTT of sync msg [28009303] is too large for time sync to be accurate. Recommends retry. Returns [17].
2014-08-26 15:15:18.799: [    CTSS][1111779648]ctssslave_swm: Received from master (mode [0xcc] nodenum [1] hostname [dbserver_node1] )
2014-08-26 15:15:18.799: [    CTSS][1111779648]ctsselect_msm: Failed in sync_with_master [17]
2014-08-26 15:15:18.799: [    CTSS][1111779648]ctsselect_msm: Sync interval returned in [1]
2014-08-26 15:15:18.799: [    CTSS][1107577152]ctssslave_msg_handler4_3: slave_sync_with_master finished sync process. Exiting clsctssslave_msg_handler
2014-08-26 15:15:21.539: [GIPCHDEM][1103374656] gipchaDaemonProcessHAInvalidate: completed ha name invalidate for node 0x1d826f90 { host 'dbserver_node1', haName '26ce-13b4-ca6c-3882', srcLuid 1caa0a47-6941674c, dstLuid c2a91655-c788666c numInf 1, contigSeq 152, lastAck 152, lastValidAck 152, sendSeq [152 : 152], createTime 905619662, sentRegister 1, localMonitor 0, flags 0x28 }
2014-08-26 15:15:21.665: [    CTSS][3027770656]Oracle Database CTSS Release 11.2.0.3.0 Production Copyright 2006, 2011 Oracle.  All rights reserved.
2014-08-26 15:15:21.665: [    CTSS][3027770656]ctss_scls_init: SCLs Context is 0x6d94d30
[  clsdmt][1098041664]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=dbserver_node2DBG_CTSSD))
2014-08-26 15:15:21.667: [  clsdmt][1098041664]PID for the Process [30476], connkey 11

还需要提供哪些信息呢?
回复 支持 反对

使用道具 举报

5

主题

16

帖子

74

积分

注册会员

Rank: 2

积分
74
 楼主| 发表于 2014-8-26 17:03:28 | 显示全部楼层
最后确认心跳有丢包

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
回复 支持 反对

使用道具 举报

1

主题

8

帖子

41

积分

新手上路

Rank: 1

积分
41
发表于 2014-9-26 11:10:24 | 显示全部楼层
初步看是由于你的rac-scan无法解析导致的,时间是8月22日,但是你的css,crs日志只有8月26号当天的,没办法进一步判断。
回复 支持 反对

使用道具 举报

5

主题

16

帖子

74

积分

注册会员

Rank: 2

积分
74
 楼主| 发表于 2014-10-17 10:19:20 | 显示全部楼层
parknkjun 发表于 2014-9-26 11:10
初步看是由于你的rac-scan无法解析导致的,时间是8月22日,但是你的css,crs日志只有8月26号当天的,没办法 ...

跟scan一点关系都没有哇,就是心跳中断
回复 支持 反对

使用道具 举报

2

主题

6

帖子

28

积分

新手上路

Rank: 1

积分
28
发表于 2014-11-6 17:23:06 | 显示全部楼层
哥们,最后咋解决的,分享下呗,
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|Archiver|手机版|ORACLE SOS 技术论坛

GMT+8, 2024-5-2 22:55 , Processed in 0.021943 second(s), 21 queries .

Powered by Discuz! X3.4

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表