travel.liu 发表于 2014-3-16 16:01:04

root 46727224      1   0   Mar 14      -0:00 /oracle/app/crs/bin/crsd.bin reboot
crs的时间不对,是3月14号,出现问题了 。检查日志

mos : Listener endpoint for public host is removed for a brief period when crsd.bin crash (Doc ID 1605983.1)

xifenfei 发表于 2014-3-16 16:01:44

2014-03-14 07:08:02.074: {0:7:2} (:CLSN00010:)ons is running ...
2014-03-14 07:08:02.075: TM is changing desired thread # to 8. Current # is 7
2014-03-14 07:08:02.076: {0:7:2} (:CLSN00010:)
2014-03-14 07:08:02.076: {0:7:2} execCmd ret = 0
2014-03-14 07:08:02.628: {0:7:2} (:CLSN00010:)
2014-03-14 07:08:02.628: {0:7:2} OnsAgent::check }
2014-03-14 07:08:02.629: {0:7:2} (:CLSN00010:)LSNRCTL for IBM/AIX RISC System/6000: Version 11.2.0.2.0 - Production on 14-MAR-2014 07:08:02
2014-03-14 07:08:02.629: {0:7:2} (:CLSN00010:)
2014-03-14 07:08:02.629: {0:7:2} (:CLSN00010:)Copyright (c) 1991, 2010, Oracle.All rights reserved.
2014-03-14 07:08:02.629: {0:7:2} (:CLSN00010:)
2014-03-14 07:08:02.645: {0:7:2} (:CLSN00008:)Utils:execCmd scls_process_spawn() failed 1
2014-03-14 07:08:02.693: {0:7:2} (:CLSN00010:)Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER)))
2014-03-14 07:08:02.693: {0:7:2} (:CLSN00008:) category: -2, operation: fork, loc: spawnproc28, OS error: 12, other: forked failed [-1]
2014-03-14 07:07:59.706: [ COMMCRS]clsc_thrd_spawn: (1123b1a18) waited 60000 milliseconds for thread to become active
2014-03-14 07:08:02.716: [    AGFW] {0:7:2} Created alert : (:CRSAGF00113:) :Aborting the command: check for resource: ora.LISTENER.lsnr gdgz-ps-esb-db01-p780 1
2014-03-14 07:08:02.716: {0:7:2} (:CLSN00010:)
2014-03-14 07:08:02.783: {0:7:2} clsn_agent::abort {
2014-03-14 07:08:02.783: {0:7:2} abort {
2014-03-14 07:08:02.783: {0:7:2} abort command: check
2014-03-14 07:08:02.784: {0:7:2} tryActionLock {
2014-03-14 07:08:02.883: {0:7:2} (:CLSN00009:)Utils:execCmd abort request: killing /oracle/app/crs/bin/lsnrctl
2014-03-14 07:08:02.914: [    AGFW] {1:18767:989} Agent received the message: AGENT_HB ID 12293:2510237
2014-03-14 07:08:03.115: [   AGENT] {0:7:2} UserErrorException: Locale is
2014-03-14 07:08:03.115: [   AGENT] {0:7:2} UserErrorException: Locale is
2014-03-14 07:08:03.279: {0:7:2} clsnUtils::error Exception type=2 string=
CRS-5013: Agent "/oracle/app/crs/bin/oraagent.bin" failed to start process "/oracle/app/crs/bin/lsnrctl" for action "check": details at "(:CLSN00008:)" in "/oracle/app/crs/log/gdgz-ps-esb-db01-p780/agent/crsd/oraagent_grid/oraagent_grid.log"

2014-03-14 07:08:03.280: {0:7:2} clsnUtils::error Exception type=2 string=
CRS-5014: Agent "/oracle/app/crs/bin/oraagent.bin" timed out starting process "/oracle/app/crs/bin/lsnrctl" for action "check": details at "(:CLSN00009:)" in "/oracle/app/crs/log/gdgz-ps-esb-db01-p780/agent/crsd/oraagent_grid/oraagent_grid.log"

2014-03-14 07:08:03.316: [    AGFW] {0:7:2} Created alert : (:CRSAGF00113:) :Aborting the command: check for resource: ora.asm gdgz-ps-esb-db01-p780 1
2014-03-14 07:08:03.316: {0:7:2} clsn_agent::abort {
2014-03-14 07:08:03.453: {0:7:2} abort {
2014-03-14 07:08:03.453: {0:7:2} abort command: check
2014-03-14 07:08:03.488: {0:7:2} tryActionLock {
2014-03-14 07:08:03.537: [   AGENT] {0:7:2} UserErrorException: Locale is
2014-03-14 07:08:03.537: {0:7:2} clsnUtils::error Exception type=2 string=
CRS-5017: The resource action "ora.LISTENER.lsnr check" encountered the following error:
(:CLSN00009:)Utils:execCmd aborted

2014-03-14 07:08:03.622: {0:7:2} clsn_agent::check: Exception SclsProcessSpawnException
2014-03-14 07:08:03.693: [    AGFW] {0:7:2} ora.LISTENER_SCAN1.lsnr 1 1 state changed from: ONLINE to: UNKNOWN
2014-03-14 07:08:03.693: [    AGFW] {0:7:2} ora.LISTENER_SCAN1.lsnr 1 1 would be continued to monitored!
2014-03-14 07:08:03.724: [    AGFW] {0:7:38} Generating new Tint for unplanned state change. Original Tint: {0:7:2}
2014-03-14 07:08:03.724: [    AGFW] {0:7:38} Agent sending message to PE: RESOURCE_STATUS ID 20481:681928
2014-03-14 07:08:05.844: {0:7:2} clsnUtils::setResAttrib attribute GEN_USR_ORA_INST_NAME value +ASM1
2014-03-14 07:08:06.784: {0:7:2} got lock
2014-03-14 07:08:06.784: {0:7:2} tryActionLock }
2014-03-14 07:08:06.784: {0:7:2} abort}
2014-03-14 07:08:06.788: {0:7:2} clsn_agent::abort }
2014-03-14 07:08:06.788: [    AGFW] {0:7:2} Command: check for resource: ora.LISTENER.lsnr gdgz-ps-esb-db01-p780 1 completed with status: TIMEDOUT
2014-03-14 07:08:06.789: [    AGFW] {0:7:2} ora.LISTENER.lsnr gdgz-ps-esb-db01-p780 1 state changed from: ONLINE to: UNKNOWN
2014-03-14 07:08:06.789: [    AGFW] {0:7:2} ora.LISTENER.lsnr gdgz-ps-esb-db01-p780 1 would be continued to monitored!
2014-03-14 07:08:06.808: [    AGFW] {0:7:2} ora.LISTENER.lsnr gdgz-ps-esb-db01-p780 1 state details has changed from:to: CHECK TIMED OUT
2014-03-14 07:08:06.808: [    AGFW] {0:7:39} Generating new Tint for unplanned state change. Original Tint: {0:7:2}
2014-03-14 07:08:06.808: [    AGFW] {0:7:39} Agent sending message to PE: RESOURCE_STATUS ID 20481:681929
2014-03-14 07:08:08.114: {0:7:2} setResAttrib: attr GEN_USR_ORA_INST_NAME@SERVERNAME(gdgz-ps-esb-db01-p780) clsagfw attribVal +ASM1 clscrs tmpAttrValue +ASM1
2014-03-14 07:08:09.138: {0:7:2} AsmProxyAgent::check clsagfw_res_status 0
2014-03-14 07:08:11.488: {0:7:2} got lock
2014-03-14 07:08:11.488: {0:7:2} tryActionLock }
2014-03-14 07:08:11.488: {0:7:2} abort}
2014-03-14 07:08:11.499: {0:7:2} clsn_agent::abort }
2014-03-14 07:08:11.499: [    AGFW] {0:7:2} Command: check for resource: ora.asm gdgz-ps-esb-db01-p780 1 completed with status: TIMEDOUT
2014-03-14 07:08:11.499: [    AGFW] {0:7:2} ora.asm gdgz-ps-esb-db01-p780 1 state changed from: ONLINE to: UNKNOWN
2014-03-14 07:08:11.499: [    AGFW] {0:7:2} ora.asm gdgz-ps-esb-db01-p780 1 would be continued to monitored!
2014-03-14 07:08:11.521: [    AGFW] {0:7:2} ora.asm gdgz-ps-esb-db01-p780 1 state details has changed from: Started to: CHECK TIMED OUT
2014-03-14 07:08:11.521: [    AGFW] {0:7:40} Generating new Tint for unplanned state change. Original Tint: {0:7:2}
2014-03-14 07:08:11.521: [    AGFW] {0:7:40} Agent sending message to PE: RESOURCE_STATUS ID 20481:681930
2014-03-14 07:08:12.362: [ USRTHRD] ClusterSubscriber::SubscriberWorker::InternalClusterSubscriber::handleEventCBexecuting for reason 1



2014-03-14 07:08:27.084: {0:7:2} (:CLSN00010:)The command completed successfully
2014-03-14 07:08:27.097: {0:7:2} (:CLSN00010:)
2014-03-14 07:08:27.098: {0:7:2} execCmd ret = 0
2014-03-14 07:08:27.098: {0:7:2} return val (clsagfwret) = 0
2014-03-14 07:08:27.098: {0:7:2} LsnrAgent::check }
2014-03-14 07:08:27.108: [    AGFW] {0:7:2} ora.LISTENER.lsnr gdgz-ps-esb-db01-p780 1 state changed from: UNKNOWN to: ONLINE
2014-03-14 07:08:27.108: [    AGFW] {0:7:41} Generating new Tint for unplanned state change. Original Tint: {0:7:2}
2014-03-14 07:08:27.108: [    AGFW] {0:7:41} Agent sending message to PE: RESOURCE_STATUS ID 20481:681944
2014-03-14 07:08:27.196: {0:7:2} clsnUtils::setResAttrib attribute GEN_USR_ORA_INST_NAME value +ASM1
2014-03-14 07:08:28.381: {0:7:40} ConnectionPool::releaseConnection InstConnection 11cd75f0
2014-03-14 07:08:28.382: {0:7:40} DgpAgent::runCheck:clsagfw_res_status 0
2014-03-14 07:08:28.382: [    AGFW] {0:7:40} Agent sending last reply for: RESOURCE_PROBE ID 4097:2510260
2014-03-14 07:08:32.660: {0:7:2} InstConnection:connectInt:is up
2014-03-14 07:08:33.541: {0:7:2} ConnectionPool::releaseConnection InstConnection 11ccce10
2014-03-14 07:08:33.541: {0:7:2} DgpAgent::runCheck:clsagfw_res_status 0
2014-03-14 07:08:33.625: TM is changing desired thread # to 7. Current # is 8
2014-03-14 07:08:33.626: {0:7:43} Worker thread is exiting in TM to meet the desired count of 7. New count is 7
2014-03-14 07:08:35.377: {0:7:2} did not get lock
2014-03-14 07:08:35.377: {0:7:2} tryActionLock }
2014-03-14 07:08:35.414: {0:7:2} clsn_agent::abort: Exception LockAcquireTimeoutException
2014-03-14 07:08:35.414: {0:7:2} clsn_agent::abort, agent exiting }
2014-03-14 07:08:35.453: [    AGFW] {0:7:2} Agent is exiting with exit code: -1



从这里看应该是你的listener应该值在7:08的时候发生了一次震荡导致public ip在监听中丢失,至于震荡的具体原因,需要结合系统和其他crs日志进行分析该震荡应该是由于root 46727224      1   0   Mar 14    -0:00 /oracle/app/crs/bin/crsd.bin reboot的重启有一定关系
建议可以考虑在系统负载轻的时候,尝试kill crsd.bin让其再次重启,观察看看效果

travel.liu 发表于 2014-3-16 16:04:13

你在节点2上 ps -ef | grep d.bin 看下
以下是11.2.0.3的正常进程状态root      3188   12 15:53 ?      00:00:17 /u01/11.2.0/grid/bin/ohasd.bin reboot
grid      3303   10 15:53 ?      00:00:08 /u01/11.2.0/grid/bin/oraagent.bin
grid      3315   10 15:53 ?      00:00:00 /u01/11.2.0/grid/bin/mdnsd.bin
grid      3325   10 15:53 ?      00:00:01 /u01/11.2.0/grid/bin/gpnpd.bin
root      3336   10 15:53 ?      00:00:07 /u01/11.2.0/grid/bin/orarootagent.bin
grid      3347   10 15:53 ?      00:00:03 /u01/11.2.0/grid/bin/gipcd.bin
root      3353   12 15:53 ?      00:00:23 /u01/11.2.0/grid/bin/osysmond.bin
root      3365   10 15:53 ?      00:00:01 /u01/11.2.0/grid/bin/cssdmonitor
root      3383   10 15:53 ?      00:00:02 /u01/11.2.0/grid/bin/cssdagent
grid      3395   11 15:53 ?      00:00:09 /u01/11.2.0/grid/bin/ocssd.bin
root      3425   10 15:53 ?      00:00:03 /u01/11.2.0/grid/bin/ologgerd -M -d /u01/11.2.0/grid/crf/db/rac1
root      3464   11 15:53 ?      00:00:09 /u01/11.2.0/grid/bin/octssd.bin reboot
grid      3487   11 15:53 ?      00:00:10 /u01/11.2.0/grid/bin/evmd.bin
root      3622   13 15:54 ?      00:00:27 /u01/11.2.0/grid/bin/crsd.bin reboot
grid      369134870 15:54 ?      00:00:00 /u01/11.2.0/grid/bin/evmlogger.bin -o /u01/11.2.0/grid/evm/log/evmlogger.info -l /u01/11.2.0/grid/evm/log/evmlogger.log
grid      3733   10 15:54 ?      00:00:06 /u01/11.2.0/grid/bin/oraagent.bin
root      3743   12 15:54 ?      00:00:19 /u01/11.2.0/grid/bin/orarootagent.bin
grid      3858   10 15:54 ?      00:00:00 /u01/11.2.0/grid/bin/scriptagent.bin
grid      3940   10 15:54 ?      00:00:00 /u01/11.2.0/grid/bin/tnslsnr LISTENER_SCAN1 -inherit
grid      3954   10 15:54 ?      00:00:00 /u01/11.2.0/grid/bin/tnslsnr LISTENER -inherit
oracle    3986   11 15:54 ?      00:00:08 /u01/11.2.0/grid/bin/oraagent.bin
root      468631520 16:06 pts/1    00:00:00 grep d.bin

蜗牛 发表于 2014-3-16 16:13:54

OSWatch中的当时vmstat日志

蜗牛 发表于 2014-3-16 16:17:22

节点二状态是正常的:
$ps -ef | grep d.bin
    root6160394      1   1   Nov 16      - 2086:40 /oracle/app/crs/bin/ohasd.bin reboot
    grid6619362      1   0   Nov 16      - 132:12 /oracle/app/crs/bin/gpnpd.bin
    root7143648      1   1   Nov 26      - 2801:59 /oracle/app/crs/bin/crsd.bin reboot
    grid7471350      1   0   Nov 16      -2:51 /oracle/app/crs/bin/mdnsd.bin
    grid8126652      1   0   Nov 16      - 1903:27 /oracle/app/crs/bin/gipcd.bin
    grid9830450      1   0   Nov 26      - 1611:11 /oracle/app/crs/bin/evmd.bin
    root 13828190      1   0   Nov 26      - 1717:38 /oracle/app/crs/bin/octssd.bin
    grid 21364908 36044948   0 16:18:05pts/10:00 grep d.bin
    grid 24641786 12648566   2   Nov 26      - 3597:31 /oracle/app/crs/bin/ocssd.bin

travel.liu 发表于 2014-3-16 16:23:48

系统负载比较高!这个机器还是微分区的
pc消耗物理处理器的数目。只在使用共享处理器运行的分区显示(只在微分区环境中显示)
ec消耗授权容量的百分比。只在使用共享处理器运行的分区显示(只在微分区环境中显示)zzz ***Fri Mar 14 07:07:08 GMT+08:00 2014

System configuration: lcpu=16 mem=32768MB ent=4.00

kthr    memory            page            faults            cpu         
----- ----------- ------------------------ ------------ -----------------------
rb   avm   frerepipofr   srcyin   sycs us sy id wa    pc    ec
2 19 92716715405   0 438 2383 2270 2270   0 7491 69727 21737 369 42 132.6766.7
2 17 92778733081   0 205 4129 6310 6311   0 8033 70028 22515 38 11 41 112.8471.1
3 27 9275577 10133   0 246 5016 2836 2836   0 8916 72609 25018 40 11 33 162.9172.7
zzz ***Fri Mar 14 07:07:19 GMT+08:00 2014

System configuration: lcpu=16 mem=32768MB ent=4.00

kthr    memory            page            faults            cpu         
----- ----------- ------------------------ ------------ -----------------------
rb   avm   frerepipofr   srcyin   sycs us sy id wa    pc    ec
3 22 9306734 12855   0 145 4163 2481 2481   0 7362 61736 20302 269 53 122.1353.4
1 16 93176122998   0 152 1180 3347 3347   0 7137 63813 20834 329 45 142.3759.3
0 27 93213383012   0   0 3763 3894 3894   0 7811 60712 22373 178 53 211.7242.9
zzz ***Fri Mar 14 07:08:11 GMT+08:00 2014

System configuration: lcpu=16 mem=32768MB ent=4.00

kthr    memory            page            faults            cpu         
----- ----------- ------------------------ ------------ -----------------------
rb   avm   frerepipofr   srcyin   sycs us sy id wa    pc    ec
6 22 9375642 63708   0 143   0   0    0   0 6838 56665 23709 527 30 113.2982.3
6 28 9375802 63440   078   0   0    0   0 5804 51272 21176 425 40 122.8170.3
2 26 9375919 63278   035   0   0    0   0 5552 59807 19973 286 45 212.1954.7
zzz ***Fri Mar 14 07:08:25 GMT+08:00 2014

System configuration: lcpu=16 mem=32768MB ent=4.00

kthr    memory            page            faults            cpu         
----- ----------- ------------------------ ------------ -----------------------
rb   avm   frerepipofr   srcyin   sycs us sy id wa    pc    ec
11 14 9413395 15620   0 707   0   0    0   0 6997 82487 21825 64 12 1683.7994.7
7 13 9407597 18168   0 639   0   0    0   0 6062 79402 20079 57 11 2573.5388.1
9 14 9409496 13235   0 552   0   0    0   0 6252 65025 19924 567 3073.4085.1
zzz ***Fri Mar 14 07:08:32 GMT+08:00 2014

System configuration: lcpu=16 mem=32768MB ent=4.00

kthr    memory            page            faults            cpu         
----- ----------- ------------------------ ------------ -----------------------
rb   avm   frerepipofr   srcyin   sycs us sy id wa    pc    ec
8 10 94199144587   0 564 4377 7035 10042   0 8222 80437 24056 57 12 14 173.5689.1
72 9414540 10552   0 1531 2071 1131 1476   0 7610 68948 21092 519 30 103.2781.8
44 9414343 10002   0 745   0   0    0   0 6419 63686 19852 537 3373.3082.5
页: 1 [2]
查看完整版本: tnsping时vip能通,public ip报错无监听