vote 仲裁盘offline,导致cssd进程异常,节点集群资源无法使用

    技术2022-07-13  69

    系统环境:

    虚拟化平台;华为fusion sphere

    操作系统平台:RedHat

    存储:EMC unit-400

    故障现象:应用部门反馈数据库无法连接,提示监听故障

    故障分析:

    1、登录数据库查看集群资源状态

    crsctl status res -t

    提示集群资源异常,crs资源offline

    2、lsblk卡住,无内容输出

    3、查看crsd.log,具体报错信息如下图所示:

    2020-06-30 02:17:18.112: [    CRSD][2161772320] Logging level for Module: OCRASM  1

    2020-06-30 02:17:18.112: [ CRSMAIN][2161772320] Checking the OCR device

    2020-06-30 02:17:18.112: [ CRSMAIN][2161772320] Sync-up with OCR

    2020-06-30 02:17:18.112: [ CRSMAIN][2161772320] Connecting to the CSS Daemon

    2020-06-30 02:17:18.123: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD

    2020-06-30 02:17:48.185: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD

    2020-06-30 02:18:18.190: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD

    2020-06-30 02:18:48.194: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD

    2020-06-30 02:19:18.197: [ CSSCLNT][2155321088]clssnsquerymode: not connected to CSSD

    2020-06-30 02:19:18.214: [ CSSCLNT][2161772320]clssscConnect: gipcWait failed with 16 (0x1a)

    2020-06-30 02:19:18.214: [ CSSCLNT][2161772320]clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_oracle-rac01_)) failed, rc 16

    2020-06-30 02:19:18.218: [  CRSRTI][2161772320] CSS is not ready. Received status 3

    2020-06-30 02:19:18.218: [    CRSD][2161772320] Created alert : (:CRSD00109:) :  Could not init the CSS context, error: 3

    2020-06-30 02:19:18.218: [    CRSD][2161772320][PANIC] CRSD exiting: Could not init the CSS context, error: 3

    2020-06-30 02:19:18.218: [    CRSD][2161772320] Done.

    2020-06-30 02:21:18.504: [ CSSCLNT][387819296]clssscConnect: gipcWait failed with 16 (0x1a)

    2020-06-30 02:21:18.504: [ CSSCLNT][387819296]clsssInitNative: connect to (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_oracle-rac01_)) failed, rc 16

    2020-06-30 02:21:18.508: [  CRSRTI][387819296] CSS is not ready. Received status 3

    2020-06-30 02:21:18.508: [ CRSMAIN][387819296] First attempt: init CSS context failed. Error = 3

    [  clsdmt][381368064]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=oracle-rac01DBG_CRSD))

    2020-06-30 02:21:18.511: [  clsdmt][381368064]PID for the Process [23043], connkey 1

    2020-06-30 02:21:18.512: [  clsdmt][381368064]Creating PID [23043] file for home /u01/11.2.0/grid host oracle-rac01 bin crs to /u01/11.2.0/grid/crs/init/

    2020-06-30 02:21:18.512: [  clsdmt][381368064]Writing PID [23043] to the file [/u01/11.2.0/grid/crs/init/oracle-rac01.pid]

    Crsd.log提示CSS进程不能不连接,处于不可以状态

    4、查看css日志,具体日志信息报错如下:

    2020-06-30 02:04:04.814: [    CSSD][2121881344]clssscMonitorThreads clssnmvWorkerThread not scheduled for 196870 msecs

    2020-06-30 02:04:04.814: [    CSSD][2121881344]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 811354450 msecs

    2020-06-30 02:04:04.814: [    CSSD][2121881344]clssscMonitorThreads clssnmvWorkerThread not scheduled for 811354540 msecs

    2020-06-30 02:04:04.942: [    CSSD][2095437568]clssnmSendingThread: sending status msg to all nodes

    2020-06-30 02:04:04.942: [    CSSD][2095437568]clssnmSendingThread: sent 5 status msgs to all nodes

    2020-06-30 02:04:07.813: [    CSSD][2101761792](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for 200460 ms for voting file /dev/asm-diskf)

    2020-06-30 02:04:07.813: [    CSSD][2101761792]clssnmCompleteGMReq: Completed request type 17 with status 1

    2020-06-30 02:04:07.813: [    CSSD][2101761792]clssgmDoneQEle: re-queueing req 0x7fba7653d510 status 1

    2020-06-30 02:04:07.813: [    CSSD][2101761792]clssnmvDiskAvailabilityChange: voting file /dev/asm-diskf now offline

    2020-06-30 02:04:07.813: [    CSSD][2101761792](:CSSNM00018:)clssnmvDiskCheck: Aborting, 1 of 3 configured voting disks available, need 2

    2020-06-30 02:04:07.813: [    CSSD][2101761792]###################################

    2020-06-30 02:04:07.813: [    CSSD][2101761792]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread

    2020-06-30 02:04:07.813: [    CSSD][2101761792]###################################

    2020-06-30 02:04:07.813: [    CSSD][2101761792](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

    Cssd.log日志提示asm-diskf I/O等待超时,系统offline asm-diskf,导致css资源异常中断

    5、故障处理步骤

    5.1 查看虚拟化和存储相关联磁盘,发现并没有报错日志,仲裁盘应该是健康状态。

    5.2 尝试重启集群资源,ocss进程无法关闭,crs集群资源重启失败,怀疑系统管理进程处于无法通信状态,操作系统进程异常

    5.3 尝试重启操作系统,操作系统重启完毕,集群资源恢复正常,业务恢复

    疑问:什么原因导致vote盘offline还有待分析

    Processed: 0.015, SQL: 9