gotoywh 发表于 2015-3-12 11:20:14

我的RAC节点2挂掉了,无法启动

节点1正常
$ crsctl stat res -t
--------------------------------------------------------------------------------
NAME         TARGETSTATE      SERVER                   STATE_DETAILS      
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINEONLINE       rac1                                       
ora.FRA.dg
               ONLINEONLINE       rac1                                       
ora.LISTENER.lsnr
               ONLINEONLINE       rac1                                       
ora.asm
               ONLINEONLINE       rac1                     Started            
ora.gsd
               OFFLINE OFFLINE      rac1                                       
ora.net1.network
               ONLINEONLINE       rac1                                       
ora.ons
               ONLINEONLINE       rac1                                       
ora.registry.acfs
               ONLINEONLINE       rac1                                       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1      ONLINEONLINE       rac1                                       
ora.cvu
      1      ONLINEONLINE       rac1                                       
ora.itsm.db
      1      ONLINEONLINE       rac1                     Open               
      2      ONLINEOFFLINE                                                   
ora.oc4j
      1      ONLINEONLINE       rac1                                       
ora.rac1.vip
      1      ONLINEONLINE       rac1                                       
ora.rac2.vip
      1      ONLINEINTERMEDIATE rac1                     FAILED OVER         
ora.scan1.vip
      1      ONLINEONLINE       rac1                        



节点二就出问题了:
# crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.

我尝试启动:
# crsctl start cluster -all
CRS-4404: The following nodes did not reply within the allotted time:
rac1, rac2
CRS-2672: Attempting to start 'ora.cssd' on 'rac2'
CRS-2672: Attempting to start 'ora.diskmon' on 'rac2'
CRS-2676: Start of 'ora.diskmon' on 'rac2' succeeded
CRS-4705: Start of Clusterware failed on node rac2.
CRS-4000: Command Start failed, or completed with errors.


我查看了下OCSSD日志,节点PING都通的
015-02-22 18:41:03.029: [    CSSD]clssgmClientConnectMsg: Connect from con(0x28b9) proc(0x2b651d0) pid(3845) version 11:2:1:4, properties: 1,2,3,4,5
2015-02-22 18:41:03.029: [    CSSD]clssgmClientConnectMsg: msg flags 0x0000
2015-02-22 18:41:03.031: [    CSSD]clssscSelect: cookie accept request 0x2b651d0
2015-02-22 18:41:03.031: [    CSSD]clssscevtypSHRCON: getting client with cmproc 0x2b651d0
2015-02-22 18:41:03.031: [    CSSD]clssgmRegisterClient: proc(4/0x2b651d0), client(1/0x2b50a80)
2015-02-22 18:41:03.031: [    CSSD]clssgmJoinGrock: global grock CRF- new client 0x2b50a80 with con 0x7f1e000028e8, requested num -1, flags 0x4000e00
2015-02-22 18:41:03.031: [    CSSD]clssgmJoinGrock: ignoring grock join for client not requiring fencing until group information has been received from the master; group name CRF-, member number -1, flags 0x4000e00
2015-02-22 18:41:03.032: [    CSSD]clssgmDiscEndpcl: gipcDestroy 0x28e8
2015-02-22 18:41:03.509: [    CSSD]clssgmWaitOnEventValue: after CmInfo Stateval 3, eval 1 waited 0
2015-02-22 18:41:03.709: [    CSSD]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 320163447, wrtcnt, 109095, LATS 41143754, lastSeqNo 109094, uniqueness 1424715417, timestamp 1426129157/97505444
2015-02-22 18:41:04.510: [    CSSD]clssgmWaitOnEventValue: after CmInfo Stateval 3, eval 1 waited 0
2015-02-22 18:41:04.712: [    CSSD]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 320163447, wrtcnt, 109096, LATS 41144754, lastSeqNo 109095, uniqueness 1424715417, timestamp 1426129158/97506444

然后我看下数据库是未启动的,我尝试启动数据库:
$ sqlplus / as sysdba


SQL*Plus: Release 11.2.0.4.0 Production on Sun Feb 22 18:42:59 2015


Copyright (c) 1982, 2013, Oracle.All rights reserved.


Connected to an idle instance.


SQL> startup
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file '+DATA/itsm/spfileitsm.ora'
ORA-17503: ksfdopn:2 Failed to open file +DATA/itsm/spfileitsm.ora
ORA-15077: could not locate ASM instance serving a required diskgroup
SQL>

我有尝试启动节点2ASM实例,节点一的数据库和ASM实例都是正常的:


$ sqlplus / as sysasm

SQL*Plus: Release 11.2.0.4.0 Production on Sun Feb 22 18:44:04 2015

Copyright (c) 1982, 2013, Oracle.All rights reserved.

Connected to an idle instance.

SQL> startup
ORA-01078: failure in processing system parameters
ORA-29701: unable to connect to Cluster Synchronization Service
不知道什么原因了,请赐教


gotoywh 发表于 2015-3-12 11:47:00

附件是日志,节点1,2互PING SSH都没问题

gotoywh 发表于 2015-3-12 11:49:18

没有附件?

xifenfei 发表于 2015-3-12 12:04:06

2015-02-22 07:31:49.082: CS(0x7f1c5c064560)set Properties ( root,0x361e410)
2015-02-22 07:31:49.094: {2:7263:43} Sending message to PE. ctx= 0x7f1c5c07c890, Client PID: 7246
2015-02-22 07:31:49.094: {2:7263:43} Master is not known. Rejecting the command: 13
2015-02-22 07:31:49.470: gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'rac2', port 'bb5b-e793-981d-7626', hctx 0x2d85bf0 { gipchaContext : host 'rac2', name '84bf-806f-7ec4-8d29', luid '42b44c60-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36)
2015-02-22 07:31:49.471: gipchaResolveF : EXCEPTION[ ret gipcretKeyNotFound (36) ]failed to resolve ctx 0x2d85bf0 { gipchaContext : host 'rac2', name '84bf-806f-7ec4-8d29', luid '42b44c60-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x5 }, host 'rac2', port 'bb5b-e793-981d-7626', flags 0x0
2015-02-22 07:31:49.473: gipchaInternalResolve: failed to resolve ret gipcretKeyNotFound (36), host 'rac2', port '0da4-309d-2db4-bf1b', hctx 0x2d85bf0 { gipchaContext : host 'rac2', name '84bf-806f-7ec4-8d29', luid '42b44c60-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x5 }, ret gipcretKeyNotFound (36)
2015-02-22 07:31:49.473: gipchaResolveF : EXCEPTION[ ret gipcretKeyNotFound (36) ]failed to resolve ctx 0x2d85bf0 { gipchaContext : host 'rac2', name '84bf-806f-7ec4-8d29', luid '42b44c60-00000000', numNode 1, numInf 1, usrFlags 0x0, flags 0x5 }, host 'rac2', port '0da4-309d-2db4-bf1b', flags 0x0
2015-02-22 07:31:49.668: CS(0x7f1c5c03e110)set Properties ( grid,0x34f5050)
2015-02-22 07:31:49.680: {2:7263:44} Sending message to PE. ctx= 0x7f1c5c03fdd0, Client PID: 3652
2015-02-22 07:31:49.680: {2:7263:44} Master is not known. Rejecting the command: 14
2015-02-22 07:31:49.795: [   CRSPE]{2:7263:2} Join request has been processed by the Master.

尝试ping rac2 试试看,另外贴出来hosts文件

gotoywh 发表于 2015-3-12 12:08:53

如下是RAC2的,RAC1也都能PING通,SSH也都没问题
$ ping rac2
PING rac2.localdomain (192.168.0.106) 56(84) bytes of data.
64 bytes from rac2.localdomain (192.168.0.106): icmp_seq=1 ttl=64 time=0.037 ms
64 bytes from rac2.localdomain (192.168.0.106): icmp_seq=2 ttl=64 time=0.037 ms
^C
--- rac2.localdomain ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.037/0.037/0.037/0.000 ms
$ ping rac2-priv
PING rac2-priv.localdomain (192.168.1.106) 56(84) bytes of data.
64 bytes from rac2-priv.localdomain (192.168.1.106): icmp_seq=1 ttl=64 time=0.039 ms
^C
--- rac2-priv.localdomain ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.039/0.039/0.039/0.000 ms
$ ping rac2-vip
PING rac2-vip.localdomain (192.168.0.110) 56(84) bytes of data.
64 bytes from rac2-vip.localdomain (192.168.0.110): icmp_seq=1 ttl=64 time=5.34 ms
^C
--- rac2-vip.localdomain ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 5.342/5.342/5.342/0.000 ms
$ cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1                localhost.localdomain localhost
::1                localhost6.localdomain6 localhost6
192.168.0.105   rac1.localdomain      rac1
192.168.0.106   rac2.localdomain      rac2
# Private
192.168.1.105   rac1-priv.localdomain   rac1-priv
192.168.1.106   rac2-priv.localdomain   rac2-priv
# Virtual
192.168.0.109   rac1-vip.localdomain    rac1-vip
192.168.0.110   rac2-vip.localdomain    rac2-vip
# SCAN
192.168.0.11   scan.localdomain      scan
$

页: [1]
查看完整版本: 我的RAC节点2挂掉了,无法启动