ORACLE SOS

 找回密码
 立即注册

QQ登录

只需一步,快速开始

搜索
查看: 6846|回复: 3

asm磁盘写入超时导致CRS服务abort

[复制链接]

3

主题

10

帖子

50

积分

注册会员

Rank: 2

积分
50
发表于 2014-3-28 17:41:14 | 显示全部楼层 |阅读模式
平台:AIX6.1数据库版本:11.2.0.4  RAC
故障描述:
2节点CRS服务abort,但监听和数据库实例正常,且无报错,未发生脑裂,查看CRS的磁盘组被dismounted
处理方法:
手工挂载CRS磁盘组,启动CRS服务
asm实例报错:
2014-03-25 15:32:42.352000 +08:00
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 3 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 4 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 3 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 4 in group 1.
NOTE: process _b000_+asm2 (8716440) initiating offline of disk 0.2901283133 (CRS_0000) with mask 0x7e in group 1
NOTE: process _b000_+asm2 (8716440) initiating offline of disk 1.2901283134 (CRS_0001) with mask 0x7e in group 1
NOTE: process _b000_+asm2 (8716440) initiating offline of disk 2.2901283135 (CRS_0002) with mask 0x7e in group 1
NOTE: process _b000_+asm2 (8716440) initiating offline of disk 3.2901283136 (CRS_0003) with mask 0x7e in group 1
NOTE: process _b000_+asm2 (8716440) initiating offline of disk 4.2901283137 (CRS_0004) with mask 0x7e in group 1
NOTE: checking PST: grp = 1
GMON checking disk modes for group 1 at 21 for pid 33, osid 8716440
ERROR: no read quorum in group: required 3, found 0 disks
NOTE: checking PST for grp 1 done.
NOTE: initiating PST update: grp = 1, dsk = 0/0xacee113d, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 1, dsk = 1/0xacee113e, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 1, dsk = 2/0xacee113f, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 1, dsk = 3/0xacee1140, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 1, dsk = 4/0xacee1141, mask = 0x6a, op = clear
GMON updating disk modes for group 1 at 22 for pid 33, osid 8716440
ERROR: no read quorum in group: required 3, found 0 disks
NOTE: cache dismounting (not clean) group 1/0x3BBEE1E0 (CRS)
WARNING: Offline for disk CRS_0000 in mode 0x7f failed.
WARNING: Offline for disk CRS_0001 in mode 0x7f failed.
WARNING: Offline for disk CRS_0002 in mode 0x7f failed.
WARNING: Offline for disk CRS_0003 in mode 0x7f failed.
NOTE: messaging CKPT to quiesce pins Unix process pid: 22741528, image: oracle@OSS-JH-DB2 (B001)
WARNING: Offline for disk CRS_0004 in mode 0x7f failed.
NOTE: halting all I/Os to diskgroup 1 (CRS)
NOTE: LGWR doing non-clean dismount of group 1 (CRS)
NOTE: LGWR sync ABA=9.63 last written ABA 9.63
NOTE: No asm libraries found in the system
kjbdomdet send to inst 1
detach from dom 1, sending detach message to inst 1
List of instances:
1 2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 4)
Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 1 invalid = TRUE
520 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
WARNING: dirty detached from domain 1
NOTE: cache dismounted group 1/0x3BBEE1E0 (CRS)
SQL> alter diskgroup CRS dismount force /* ASM SERVER:1002365408 */
NOTE: cache deleting context for group CRS 1/0x3bbee1e0
GMON dismounting group 1 at 23 for pid 34, osid 22741528
NOTE: Disk CRS_0000 in mode 0x7f marked for de-assignment
NOTE: Disk CRS_0001 in mode 0x7f marked for de-assignment
NOTE: Disk CRS_0002 in mode 0x7f marked for de-assignment
NOTE: Disk CRS_0003 in mode 0x7f marked for de-assignment
NOTE: Disk CRS_0004 in mode 0x7f marked for de-assignment
NOTE: Disk CRS_0005 in mode 0x7f marked for de-assignment
NOTE:Waiting for all pending writes to complete before de-registering: grpnum 1
2014-03-25 15:32:47.666000 +08:00
ASM Health Checker found 1 new failures
2014-03-25 15:33:12.917000 +08:00
SUCCESS: diskgroup CRS was dismounted
SUCCESS: alter diskgroup CRS dismount force /* ASM SERVER:1002365408 */
SUCCESS: ASM-initiated MANDATORY DISMOUNT of group CRS
NOTE: diskgroup resource ora.CRS.dg is offline
Errors in file /u01/db/grid/base/diag/asm/+asm/+ASM2/trace/+ASM2_ora_7143732.trc:
ORA-15078: ASM diskgroup was forcibly dismounted



METALINK建议:
ASM Disks Offline When Few Paths In The Storage Is Lost (文档 ID 1581684.1)



If possible,please check with multipath vendor if OS level timeout value can be reduced to atleast 15 seconds or less.
If not ,then set  parameter "  _asm_hbeatiowait  " from 15 to 35 secs at ASM level for this kind of issue after consulting oracle support only.
问题:
1、在哪里检查os level timeout value
2、可否修改_asm_hbeatiowait隐含参数,官方推荐仅在oracle支持的情况下修改。
请各位帮忙看看,谢谢!

回复

使用道具 举报

4

主题

46

帖子

259

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
259
发表于 2014-3-28 19:04:41 | 显示全部楼层
什么存储,和什么多路径软件
-------------------------------------------
Travel
长路漫漫, 所思在远道
Email:travel.liu@outlook.com
www.traveldba.com
--------------------------------------------
回复 支持 反对

使用道具 举报

3

主题

10

帖子

50

积分

注册会员

Rank: 2

积分
50
 楼主| 发表于 2014-4-3 09:21:15 | 显示全部楼层
travel.liu 发表于 2014-3-28 19:04
什么存储,和什么多路径软件

日立存储,日立的多路径软件,8条路径。
回复 支持 反对

使用道具 举报

3

主题

10

帖子

50

积分

注册会员

Rank: 2

积分
50
 楼主| 发表于 2014-4-3 09:25:16 | 显示全部楼层
数据库CRS服务异常,导致2节点无法访问,分析具体原因为数据库二节点CRS服务写入asm磁盘组超时,导致oracle自动dismounted OCR的asm磁盘组,进而CRS服务被停止,此问题经查看Oracle官方文档,确定需修改oracle的隐含参数_asm_hbeatiowait到35s,但此参数官方文档仅建议在咨询了官方支持的情况下修改。集成人员也咨询了日立存储的工程师,日立方面提出修改CRS的MISSCOUNT和DISKTIMEOUT,这两个参数的修改对此次问题无任何影响,MISSCOUNT和DISKTIMEOUT参数是CRS服务的2种心跳机制: 一种是通过私有网络的Network Heartbeat,另一种是通过Voting Disk的Disk Heartbeat用来判断节点的活动状态以便保证RAC的正常运行的,跟此次问题并无多大关系,故建议修改隐含参数_asm_hbeatiowait到35s。在操作系统级别还有一个磁盘的rw_timeout延迟写入的参数,此参数以前的操作系统老版本默认为30S,现在的新版本默认为60s,此参数的意思是如果操作系统在60s内还未写入存储成功,则会返回超时,重新进行写入操作。跟冯工讨论后我们认为修改此参数可能导致意想不到的读写问题,因此暂时不予以修改。
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

QQ|Archiver|手机版|ORACLE SOS 技术论坛

GMT+8, 2024-4-28 11:47 , Processed in 0.022528 second(s), 20 queries .

Powered by Discuz! X3.4

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表