蓝鲸平台mongodb集群异常处理

    技术2022-07-10  82

    问题回顾

    蓝鲸平台中的配置平台(cmdb)数据存放在了mongodb集群中(三台mongodb服务器组成的集群),偶然间发现集群中的一个节点日志有报错。 报错信息如下:

    2020-06-30T19:27:50.622+0800 I REPL [replication-0] We are too stale to use 10.10.10.2:27017 as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1592818085, 6) is before their earliest timestamp: Timestamp(1593465706, 461) for 1min until:

    从日志中看出, 具体命令查看mongodb中集群另外两个节点与主节点是否同步,脱敏的部分是ip+port(由于是客户生产环境,需要脱敏一下) 结果发现第二台mongodb节点与主节点之间差了184.81小时的数据,并且节点状态为recovering。 由于是三个节点组成的mongodb集群,另外两个节点正常,可以为配置平台提供服务,如果再有一台mongodb节点出问题,恢复起来就会变得麻烦。(其实我也不知道该咋办,只能用备份恢复?那如果没有备份呢?)

    修复节点的过程

    1、备份mongodb数据库中所有的数据(蓝鲸中的mongodb数据库有gse、cmdb、config三个库都需要备份,还有mongodb自身的local和admin库) 2、停掉所有mongodb服务 3、将有问题的节点数据移动到新创建的文件夹内,这时原来的存放数据的路径下除了保留mongo.key文件,该路径下没有别的文件(相当于清空了数据文件) 4、将三台mongodb服务起来 5、问题节点的mongodb数据库会从另外一个节点同步过来 同步过程中的日志记录: 同步中:

    2020-06-30T19:48:44.023+0800 I - [repl writer worker 7] cmdb.cc_OperationLog collection clone progress: 16253350/47188084 34% (documents copied)

    同步完成

    2020-06-30T20:14:24.825+0800 I STORAGE [replication-28] Finishing collection drop for local.temp_oplog_buffer (259eb844-9e40-48d4-92c3-e7a65e6c7ae0). 2020-06-30T20:14:24.830+0800 I REPL [replication-28] initial sync done; took 2245s. 2020-06-30T20:14:24.830+0800 I REPL [replication-28] transition to RECOVERING from STARTUP2 2020-06-30T20:14:24.830+0800 I REPL [replication-28] Starting replication fetcher thread 2020-06-30T20:14:24.830+0800 I REPL [replication-28] Starting replication applier thread 2020-06-30T20:14:24.830+0800 I REPL [replication-28] Starting replication reporter thread 2020-06-30T20:14:24.830+0800 I REPL [rsBackgroundSync] could not find member to sync from 2020-06-30T20:14:24.831+0800 I REPL [rsSync] transition to SECONDARY from RECOVERING 2020-06-30T20:14:34.835+0800 I REPL [rsBackgroundSync] sync source candidate: 10.10.10.2:110 2020-06-30T20:14:34.837+0800 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 10.10.10.2:110 2020-06-30T20:14:34.840+0800 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 10.10.10.2:110, took 3ms (2 connections now open to 10.10.10.2:110) mongo --host $MONGODB_IP --port $MONGODB_PORT -username root --password $MONGODB_PASS --authenticationDatabase admin 同步的过程中查看状态: rs0:PRIMARY> rs.status(); { "set" : "rs0", "date" : ISODate("2020-06-30T12:11:18.630Z"), "myState" : 1, "term" : NumberLong(7), "heartbeatIntervalMillis" : NumberLong(2000), "optimes" : { "lastCommittedOpTime" : { "ts" : Timestamp(1593519073, 1), "t" : NumberLong(7) }, "readConcernMajorityOpTime" : { "ts" : Timestamp(1593519073, 1), "t" : NumberLong(7) }, "appliedOpTime" : { "ts" : Timestamp(1593519073, 1), "t" : NumberLong(7) }, "durableOpTime" : { "ts" : Timestamp(1593519073, 1), "t" : NumberLong(7) } }, "members" : [ { "_id" : 0, "name" : "10.10.10.1:110", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 2194, "optime" : { "ts" : Timestamp(1593519073, 1), "t" : NumberLong(7) }, "optimeDate" : ISODate("2020-06-30T12:11:13Z"), "electionTime" : Timestamp(1593516951, 1), "electionDate" : ISODate("2020-06-30T11:35:51Z"), "configVersion" : 4, "self" : true }, { "_id" : 1, "name" : "10.10.10.2:110", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 2127, "optime" : { "ts" : Timestamp(1593519073, 1), "t" : NumberLong(7) }, "optimeDurable" : { "ts" : Timestamp(1593519073, 1), "t" : NumberLong(7) }, "optimeDate" : ISODate("2020-06-30T12:11:13Z"), "optimeDurableDate" : ISODate("2020-06-30T12:11:13Z"), "lastHeartbeat" : ISODate("2020-06-30T12:11:18.429Z"), "lastHeartbeatRecv" : ISODate("2020-06-30T12:11:18.332Z"), "pingMs" : NumberLong(0), "syncingTo" : "10.10.10.1:110", "configVersion" : 4 }, { "_id" : 2, "name" : "10.10.10.3:110", "health" : 1, "state" : 5, "stateStr" : "STARTUP2", "uptime" : 2059, "optime" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }, "optimeDurable" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }, "optimeDate" : ISODate("1970-01-01T00:00:00Z"), "optimeDurableDate" : ISODate("1970-01-01T00:00:00Z"), "lastHeartbeat" : ISODate("2020-06-30T12:11:17.281Z"), "lastHeartbeatRecv" : ISODate("2020-06-30T12:11:17.007Z"), "pingMs" : NumberLong(1), "syncingTo" : "10.10.10.2:110", "configVersion" : 4 } ], "ok" : 1, "operationTime" : Timestamp(1593519073, 1), "$clusterTime" : { "clusterTime" : Timestamp(1593519073, 1), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } } } 同步完成后: rs0:PRIMARY> rs.status(); { "set" : "rs0", "date" : ISODate("2020-06-30T12:16:56.590Z"), "myState" : 1, "term" : NumberLong(7), "heartbeatIntervalMillis" : NumberLong(2000), "optimes" : { "lastCommittedOpTime" : { "ts" : Timestamp(1593519413, 1), "t" : NumberLong(7) }, "readConcernMajorityOpTime" : { "ts" : Timestamp(1593519413, 1), "t" : NumberLong(7) }, "appliedOpTime" : { "ts" : Timestamp(1593519413, 1), "t" : NumberLong(7) }, "durableOpTime" : { "ts" : Timestamp(1593519413, 1), "t" : NumberLong(7) } }, "members" : [ { "_id" : 0, "name" : "10.10.10.1:110", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 2532, "optime" : { "ts" : Timestamp(1593519413, 1), "t" : NumberLong(7) }, "optimeDate" : ISODate("2020-06-30T12:16:53Z"), "electionTime" : Timestamp(1593516951, 1), "electionDate" : ISODate("2020-06-30T11:35:51Z"), "configVersion" : 4, "self" : true }, { "_id" : 1, "name" : "10.10.10.2:110", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 2465, "optime" : { "ts" : Timestamp(1593519413, 1), "t" : NumberLong(7) }, "optimeDurable" : { "ts" : Timestamp(1593519413, 1), "t" : NumberLong(7) }, "optimeDate" : ISODate("2020-06-30T12:16:53Z"), "optimeDurableDate" : ISODate("2020-06-30T12:16:53Z"), "lastHeartbeat" : ISODate("2020-06-30T12:16:56.568Z"), "lastHeartbeatRecv" : ISODate("2020-06-30T12:16:56.474Z"), "pingMs" : NumberLong(0), "syncingTo" : "10.10.10.1:110", "configVersion" : 4 }, { "_id" : 2, "name" : "10.10.10.3:110", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 2396, "optime" : { "ts" : Timestamp(1593519413, 1), "t" : NumberLong(7) }, "optimeDurable" : { "ts" : Timestamp(1593519413, 1), "t" : NumberLong(7) }, "optimeDate" : ISODate("2020-06-30T12:16:53Z"), "optimeDurableDate" : ISODate("2020-06-30T12:16:53Z"), "lastHeartbeat" : ISODate("2020-06-30T12:16:55.521Z"), "lastHeartbeatRecv" : ISODate("2020-06-30T12:16:55.932Z"), "pingMs" : NumberLong(0), "syncingTo" : "10.10.10.2:110", "configVersion" : 4 } ], "ok" : 1, "operationTime" : Timestamp(1593519413, 1), "$clusterTime" : { "clusterTime" : Timestamp(1593519413, 1), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } } } rs0:PRIMARY> db.getReplicationInfo(); { "logSizeMB" : 4155.82568359375, "usedMB" : 4126.77, "timeDiff" : 54019, "timeDiffHours" : 15.01, "tFirst" : "Tue Jun 30 2020 05:16:51 GMT+0800 (CST)", "tLast" : "Tue Jun 30 2020 20:17:10 GMT+0800 (CST)", "now" : "Tue Jun 30 2020 20:17:14 GMT+0800 (CST)" } rs0:PRIMARY> db.printSlaveReplicationInfo(); source: 10.10.10.2:110 syncedTo: Tue Jun 30 2020 20:25:33 GMT+0800 (CST) 10 secs (0 hrs) behind the primary source: 10.10.10.3:110 syncedTo: Tue Jun 30 2020 20:25:43 GMT+0800 (CST) 0 secs (0 hrs) behind the primary

    最后一台一台重启下cmdb服务。

    Processed: 0.014, SQL: 9