ORACLE SOS-故障诊断-ORACLE运维管理-tnsping时vip能通，public ip报错无监听

travel.liu 发表于 2014-3-16 16:01:04

root 46727224 1 0 Mar 14 -0:00 /oracle/app/crs/bin/crsd.bin reboot
crs的时间不对，是3月14号，出现问题了。检查日志

mos : Listener endpoint for public host is removed for a brief period when crsd.bin crash (Doc ID 1605983.1)

xifenfei 发表于 2014-3-16 16:01:44

2014-03-14 07:08:02.074: {0:7:2} (:CLSN00010:)ons is running ...
2014-03-14 07:08:02.075: TM is changing desired thread # to 8. Current # is 7
2014-03-14 07:08:02.076: {0:7:2} (:CLSN00010:)
2014-03-14 07:08:02.076: {0:7:2} execCmd ret = 0
2014-03-14 07:08:02.628: {0:7:2} (:CLSN00010:)
2014-03-14 07:08:02.628: {0:7:2} OnsAgent::check }
2014-03-14 07:08:02.629: {0:7:2} (:CLSN00010:)LSNRCTL for IBM/AIX RISC System/6000: Version 11.2.0.2.0 - Production on 14-MAR-2014 07:08:02
2014-03-14 07:08:02.629: {0:7:2} (:CLSN00010:)
2014-03-14 07:08:02.629: {0:7:2} (:CLSN00010:)Copyright (c) 1991, 2010, Oracle.All rights reserved.
2014-03-14 07:08:02.629: {0:7:2} (:CLSN00010:)
2014-03-14 07:08:02.645: {0:7:2} (:CLSN00008:)Utils:execCmd scls_process_spawn() failed 1
2014-03-14 07:08:02.693: {0:7:2} (:CLSN00010:)Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER)))
2014-03-14 07:08:02.693: {0:7:2} (:CLSN00008:) category: -2, operation: fork, loc: spawnproc28, OS error: 12, other: forked failed [-1]
2014-03-14 07:07:59.706: [ COMMCRS]clsc_thrd_spawn: (1123b1a18) waited 60000 milliseconds for thread to become active
2014-03-14 07:08:02.716: [ AGFW] {0:7:2} Created alert : (:CRSAGF00113:) :Aborting the command: check for resource: ora.LISTENER.lsnr gdgz-ps-esb-db01-p780 1
2014-03-14 07:08:02.716: {0:7:2} (:CLSN00010:)
2014-03-14 07:08:02.783: {0:7:2} clsn_agent::abort {
2014-03-14 07:08:02.783: {0:7:2} abort {
2014-03-14 07:08:02.783: {0:7:2} abort command: check
2014-03-14 07:08:02.784: {0:7:2} tryActionLock {
2014-03-14 07:08:02.883: {0:7:2} (:CLSN00009:)Utils:execCmd abort request: killing /oracle/app/crs/bin/lsnrctl
2014-03-14 07:08:02.914: [ AGFW] {1:18767:989} Agent received the message: AGENT_HB ID 12293:2510237
2014-03-14 07:08:03.115: [ AGENT] {0:7:2} UserErrorException: Locale is
2014-03-14 07:08:03.115: [ AGENT] {0:7:2} UserErrorException: Locale is
2014-03-14 07:08:03.279: {0:7:2} clsnUtils::error Exception type=2 string=
CRS-5013: Agent "/oracle/app/crs/bin/oraagent.bin" failed to start process "/oracle/app/crs/bin/lsnrctl" for action "check": details at "(:CLSN00008:)" in "/oracle/app/crs/log/gdgz-ps-esb-db01-p780/agent/crsd/oraagent_grid/oraagent_grid.log"

2014-03-14 07:08:03.280: {0:7:2} clsnUtils::error Exception type=2 string=
CRS-5014: Agent "/oracle/app/crs/bin/oraagent.bin" timed out starting process "/oracle/app/crs/bin/lsnrctl" for action "check": details at "(:CLSN00009:)" in "/oracle/app/crs/log/gdgz-ps-esb-db01-p780/agent/crsd/oraagent_grid/oraagent_grid.log"

2014-03-14 07:08:03.316: [ AGFW] {0:7:2} Created alert : (:CRSAGF00113:) :Aborting the command: check for resource: ora.asm gdgz-ps-esb-db01-p780 1
2014-03-14 07:08:03.316: {0:7:2} clsn_agent::abort {
2014-03-14 07:08:03.453: {0:7:2} abort {
2014-03-14 07:08:03.453: {0:7:2} abort command: check
2014-03-14 07:08:03.488: {0:7:2} tryActionLock {
2014-03-14 07:08:03.537: [ AGENT] {0:7:2} UserErrorException: Locale is
2014-03-14 07:08:03.537: {0:7:2} clsnUtils::error Exception type=2 string=
CRS-5017: The resource action "ora.LISTENER.lsnr check" encountered the following error:
(:CLSN00009:)Utils:execCmd aborted

2014-03-14 07:08:03.622: {0:7:2} clsn_agent::check: Exception SclsProcessSpawnException
2014-03-14 07:08:03.693: [ AGFW] {0:7:2} ora.LISTENER_SCAN1.lsnr 1 1 state changed from: ONLINE to: UNKNOWN
2014-03-14 07:08:03.693: [ AGFW] {0:7:2} ora.LISTENER_SCAN1.lsnr 1 1 would be continued to monitored!
2014-03-14 07:08:03.724: [ AGFW] {0:7:38} Generating new Tint for unplanned state change. Original Tint: {0:7:2}
2014-03-14 07:08:03.724: [ AGFW] {0:7:38} Agent sending message to PE: RESOURCE_STATUS ID 20481:681928
2014-03-14 07:08:05.844: {0:7:2} clsnUtils::setResAttrib attribute GEN_USR_ORA_INST_NAME value +ASM1
2014-03-14 07:08:06.784: {0:7:2} got lock
2014-03-14 07:08:06.784: {0:7:2} tryActionLock }
2014-03-14 07:08:06.784: {0:7:2} abort}
2014-03-14 07:08:06.788: {0:7:2} clsn_agent::abort }
2014-03-14 07:08:06.788: [ AGFW] {0:7:2} Command: check for resource: ora.LISTENER.lsnr gdgz-ps-esb-db01-p780 1 completed with status: TIMEDOUT
2014-03-14 07:08:06.789: [ AGFW] {0:7:2} ora.LISTENER.lsnr gdgz-ps-esb-db01-p780 1 state changed from: ONLINE to: UNKNOWN
2014-03-14 07:08:06.789: [ AGFW] {0:7:2} ora.LISTENER.lsnr gdgz-ps-esb-db01-p780 1 would be continued to monitored!
2014-03-14 07:08:06.808: [ AGFW] {0:7:2} ora.LISTENER.lsnr gdgz-ps-esb-db01-p780 1 state details has changed from:to: CHECK TIMED OUT
2014-03-14 07:08:06.808: [ AGFW] {0:7:39} Generating new Tint for unplanned state change. Original Tint: {0:7:2}
2014-03-14 07:08:06.808: [ AGFW] {0:7:39} Agent sending message to PE: RESOURCE_STATUS ID 20481:681929
2014-03-14 07:08:08.114: {0:7:2} setResAttrib: attr GEN_USR_ORA_INST_NAME@SERVERNAME(gdgz-ps-esb-db01-p780) clsagfw attribVal +ASM1 clscrs tmpAttrValue +ASM1
2014-03-14 07:08:09.138: {0:7:2} AsmProxyAgent::check clsagfw_res_status 0
2014-03-14 07:08:11.488: {0:7:2} got lock
2014-03-14 07:08:11.488: {0:7:2} tryActionLock }
2014-03-14 07:08:11.488: {0:7:2} abort}
2014-03-14 07:08:11.499: {0:7:2} clsn_agent::abort }
2014-03-14 07:08:11.499: [ AGFW] {0:7:2} Command: check for resource: ora.asm gdgz-ps-esb-db01-p780 1 completed with status: TIMEDOUT
2014-03-14 07:08:11.499: [ AGFW] {0:7:2} ora.asm gdgz-ps-esb-db01-p780 1 state changed from: ONLINE to: UNKNOWN
2014-03-14 07:08:11.499: [ AGFW] {0:7:2} ora.asm gdgz-ps-esb-db01-p780 1 would be continued to monitored!
2014-03-14 07:08:11.521: [ AGFW] {0:7:2} ora.asm gdgz-ps-esb-db01-p780 1 state details has changed from: Started to: CHECK TIMED OUT
2014-03-14 07:08:11.521: [ AGFW] {0:7:40} Generating new Tint for unplanned state change. Original Tint: {0:7:2}
2014-03-14 07:08:11.521: [ AGFW] {0:7:40} Agent sending message to PE: RESOURCE_STATUS ID 20481:681930
2014-03-14 07:08:12.362: [ USRTHRD] ClusterSubscriber::SubscriberWorker::InternalClusterSubscriber::handleEventCBexecuting for reason 1

2014-03-14 07:08:27.084: {0:7:2} (:CLSN00010:)The command completed successfully
2014-03-14 07:08:27.097: {0:7:2} (:CLSN00010:)
2014-03-14 07:08:27.098: {0:7:2} execCmd ret = 0
2014-03-14 07:08:27.098: {0:7:2} return val (clsagfwret) = 0
2014-03-14 07:08:27.098: {0:7:2} LsnrAgent::check }
2014-03-14 07:08:27.108: [ AGFW] {0:7:2} ora.LISTENER.lsnr gdgz-ps-esb-db01-p780 1 state changed from: UNKNOWN to: ONLINE
2014-03-14 07:08:27.108: [ AGFW] {0:7:41} Generating new Tint for unplanned state change. Original Tint: {0:7:2}
2014-03-14 07:08:27.108: [ AGFW] {0:7:41} Agent sending message to PE: RESOURCE_STATUS ID 20481:681944
2014-03-14 07:08:27.196: {0:7:2} clsnUtils::setResAttrib attribute GEN_USR_ORA_INST_NAME value +ASM1
2014-03-14 07:08:28.381: {0:7:40} ConnectionPool::releaseConnection InstConnection 11cd75f0
2014-03-14 07:08:28.382: {0:7:40} DgpAgent::runCheck:clsagfw_res_status 0
2014-03-14 07:08:28.382: [ AGFW] {0:7:40} Agent sending last reply for: RESOURCE_PROBE ID 4097:2510260
2014-03-14 07:08:32.660: {0:7:2} InstConnection:connectInt:is up
2014-03-14 07:08:33.541: {0:7:2} ConnectionPool::releaseConnection InstConnection 11ccce10
2014-03-14 07:08:33.541: {0:7:2} DgpAgent::runCheck:clsagfw_res_status 0
2014-03-14 07:08:33.625: TM is changing desired thread # to 7. Current # is 8
2014-03-14 07:08:33.626: {0:7:43} Worker thread is exiting in TM to meet the desired count of 7. New count is 7
2014-03-14 07:08:35.377: {0:7:2} did not get lock
2014-03-14 07:08:35.377: {0:7:2} tryActionLock }
2014-03-14 07:08:35.414: {0:7:2} clsn_agent::abort: Exception LockAcquireTimeoutException
2014-03-14 07:08:35.414: {0:7:2} clsn_agent::abort, agent exiting }
2014-03-14 07:08:35.453: [ AGFW] {0:7:2} Agent is exiting with exit code: -1

从这里看应该是你的listener应该值在7:08的时候发生了一次震荡导致public ip在监听中丢失，至于震荡的具体原因，需要结合系统和其他crs日志进行分析该震荡应该是由于root 46727224 1 0 Mar 14 -0:00 /oracle/app/crs/bin/crsd.bin reboot的重启有一定关系
建议可以考虑在系统负载轻的时候，尝试kill crsd.bin让其再次重启，观察看看效果

travel.liu 发表于 2014-3-16 16:04:13

你在节点2上 ps -ef | grep d.bin 看下
以下是11.2.0.3的正常进程状态root    3188 12 15:53 ?    00:00:17 /u01/11.2.0/grid/bin/ohasd.bin reboot
grid    3303 10 15:53 ?    00:00:08 /u01/11.2.0/grid/bin/oraagent.bin
grid    3315 10 15:53 ?    00:00:00 /u01/11.2.0/grid/bin/mdnsd.bin
grid    3325 10 15:53 ?    00:00:01 /u01/11.2.0/grid/bin/gpnpd.bin
root    3336 10 15:53 ?    00:00:07 /u01/11.2.0/grid/bin/orarootagent.bin
grid    3347 10 15:53 ?    00:00:03 /u01/11.2.0/grid/bin/gipcd.bin
root    3353 12 15:53 ?    00:00:23 /u01/11.2.0/grid/bin/osysmond.bin
root    3365 10 15:53 ?    00:00:01 /u01/11.2.0/grid/bin/cssdmonitor
root    3383 10 15:53 ?    00:00:02 /u01/11.2.0/grid/bin/cssdagent
grid    3395 11 15:53 ?    00:00:09 /u01/11.2.0/grid/bin/ocssd.bin
root    3425 10 15:53 ?    00:00:03 /u01/11.2.0/grid/bin/ologgerd -M -d /u01/11.2.0/grid/crf/db/rac1
root    3464 11 15:53 ?    00:00:09 /u01/11.2.0/grid/bin/octssd.bin reboot
grid    3487 11 15:53 ?    00:00:10 /u01/11.2.0/grid/bin/evmd.bin
root    3622 13 15:54 ?    00:00:27 /u01/11.2.0/grid/bin/crsd.bin reboot
grid    369134870 15:54 ?    00:00:00 /u01/11.2.0/grid/bin/evmlogger.bin -o /u01/11.2.0/grid/evm/log/evmlogger.info -l /u01/11.2.0/grid/evm/log/evmlogger.log
grid    3733 10 15:54 ?    00:00:06 /u01/11.2.0/grid/bin/oraagent.bin
root    3743 12 15:54 ?    00:00:19 /u01/11.2.0/grid/bin/orarootagent.bin
grid    3858 10 15:54 ?    00:00:00 /u01/11.2.0/grid/bin/scriptagent.bin
grid    3940 10 15:54 ?    00:00:00 /u01/11.2.0/grid/bin/tnslsnr LISTENER_SCAN1 -inherit
grid    3954 10 15:54 ?    00:00:00 /u01/11.2.0/grid/bin/tnslsnr LISTENER -inherit
oracle 3986 11 15:54 ?    00:00:08 /u01/11.2.0/grid/bin/oraagent.bin
root    468631520 16:06 pts/1 00:00:00 grep d.bin

蜗牛发表于 2014-3-16 16:13:54

OSWatch中的当时vmstat日志

蜗牛发表于 2014-3-16 16:17:22

节点二状态是正常的：
$ps -ef | grep d.bin
root6160394    1 1 Nov 16    - 2086:40 /oracle/app/crs/bin/ohasd.bin reboot
grid6619362    1 0 Nov 16    - 132:12 /oracle/app/crs/bin/gpnpd.bin
root7143648    1 1 Nov 26    - 2801:59 /oracle/app/crs/bin/crsd.bin reboot
grid7471350    1 0 Nov 16    -2:51 /oracle/app/crs/bin/mdnsd.bin
grid8126652    1 0 Nov 16    - 1903:27 /oracle/app/crs/bin/gipcd.bin
grid9830450    1 0 Nov 26    - 1611:11 /oracle/app/crs/bin/evmd.bin
root 13828190    1 0 Nov 26    - 1717:38 /oracle/app/crs/bin/octssd.bin
grid 21364908 36044948 0 16:18:05pts/10:00 grep d.bin
grid 24641786 12648566 2 Nov 26    - 3597:31 /oracle/app/crs/bin/ocssd.bin

travel.liu 发表于 2014-3-16 16:23:48

系统负载比较高！这个机器还是微分区的
pc消耗物理处理器的数目。只在使用共享处理器运行的分区显示（只在微分区环境中显示）
ec消耗授权容量的百分比。只在使用共享处理器运行的分区显示（只在微分区环境中显示）zzz ***Fri Mar 14 07:07:08 GMT+08:00 2014

System configuration: lcpu=16 mem=32768MB ent=4.00

kthr memory          page          faults          cpu
----- ----------- ------------------------ ------------ -----------------------
rb avm frerepipofr srcyin sycs us sy id wa pc ec
2 19 92716715405 0 438 2383 2270 2270 0 7491 69727 21737 369 42 132.6766.7
2 17 92778733081 0 205 4129 6310 6311 0 8033 70028 22515 38 11 41 112.8471.1
3 27 9275577 10133 0 246 5016 2836 2836 0 8916 72609 25018 40 11 33 162.9172.7
zzz ***Fri Mar 14 07:07:19 GMT+08:00 2014

System configuration: lcpu=16 mem=32768MB ent=4.00

kthr memory          page          faults          cpu
----- ----------- ------------------------ ------------ -----------------------
rb avm frerepipofr srcyin sycs us sy id wa pc ec
3 22 9306734 12855 0 145 4163 2481 2481 0 7362 61736 20302 269 53 122.1353.4
1 16 93176122998 0 152 1180 3347 3347 0 7137 63813 20834 329 45 142.3759.3
0 27 93213383012 0 0 3763 3894 3894 0 7811 60712 22373 178 53 211.7242.9
zzz ***Fri Mar 14 07:08:11 GMT+08:00 2014

System configuration: lcpu=16 mem=32768MB ent=4.00

kthr memory          page          faults          cpu
----- ----------- ------------------------ ------------ -----------------------
rb avm frerepipofr srcyin sycs us sy id wa pc ec
6 22 9375642 63708 0 143 0 0 0 0 6838 56665 23709 527 30 113.2982.3
6 28 9375802 63440 078 0 0 0 0 5804 51272 21176 425 40 122.8170.3
2 26 9375919 63278 035 0 0 0 0 5552 59807 19973 286 45 212.1954.7
zzz ***Fri Mar 14 07:08:25 GMT+08:00 2014

System configuration: lcpu=16 mem=32768MB ent=4.00

kthr memory          page          faults          cpu
----- ----------- ------------------------ ------------ -----------------------
rb avm frerepipofr srcyin sycs us sy id wa pc ec
11 14 9413395 15620 0 707 0 0 0 0 6997 82487 21825 64 12 1683.7994.7
7 13 9407597 18168 0 639 0 0 0 0 6062 79402 20079 57 11 2573.5388.1
9 14 9409496 13235 0 552 0 0 0 0 6252 65025 19924 567 3073.4085.1
zzz ***Fri Mar 14 07:08:32 GMT+08:00 2014

System configuration: lcpu=16 mem=32768MB ent=4.00

kthr memory          page          faults          cpu
----- ----------- ------------------------ ------------ -----------------------
rb avm frerepipofr srcyin sycs us sy id wa pc ec
8 10 94199144587 0 564 4377 7035 10042 0 8222 80437 24056 57 12 14 173.5689.1
72 9414540 10552 0 1531 2071 1131 1476 0 7610 68948 21092 519 30 103.2781.8
44 9414343 10002 0 745 0 0 0 0 6419 63686 19852 537 3373.3082.5

页: 1 [2]

ORACLE SOS's Archiver