Redis问题---connect timeout or command timeout

    技术2023-10-28  98

    客户端出现connect timeout 和command timeout,分析思路如下,本文注重讲原因4: 1、网络原因 2、慢查询 3、value值过大 4、aof rewrite

    redis服务排查

    redis配置信息: auto-aof-rewrite-percentage:100% no-appendfsync-on-rewrite:no appendfsync:everysec

    1.查看redis日志如下:

    20949:M 03 Jul 12:28:02.956 * Starting automatic rewriting of AOF on 100% growth 20949:M 03 Jul 12:28:03.080 * Background append only file rewriting started by pid 6394 20949:M 03 Jul 12:30:06.777 * Background AOF buffer size: 80 MB 20949:M 03 Jul 12:30:14.046 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis. 20949:M 03 Jul 12:30:15.408 * Background AOF buffer size: 180 MB 20949:M 03 Jul 12:30:24.953 * Background AOF buffer size: 280 MB 20949:M 03 Jul 12:30:41.336 * AOF rewrite child asks to stop sending diffs. 6394:C 03 Jul 12:30:41.336 * Parent agreed to stop sending diffs. Finalizing AOF... 6394:C 03 Jul 12:30:41.336 * Concatenating 97.17 MB of AOF diff received from parent. 6394:C 03 Jul 12:30:46.735 * SYNC append only file rewrite performed 6394:C 03 Jul 12:30:46.819 * AOF rewrite: 542 MB of memory used by copy-on-write 20949:M 03 Jul 12:30:46.958 * Background AOF rewrite terminated with success 20949:M 03 Jul 12:30:59.909 * Residual parent diff successfully flushed to the rewritten AOF (243.96 MB) 20949:M 03 Jul 12:30:59.911 * Background AOF rewrite finished successfully 20949:M 03 Jul 12:31:00.047 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

    通过日志可以发现: 1、redis出现aof rewrite 2、redis aof fsync由于磁盘bus导致长时间未完成 3、不等待fsync完成的情况下编写aof buffer,促使reids堵塞其他进程命令

    首先我们知道redis是单线程处理,当我们打开AOF持久化功能时,reids在每个事件处理完成后都会调用write(2),将变化写入aof buffer,如果此时write(2)被堵塞,redis将不能处理其他命令,Linux 规定执行 write(2) 时,如果对同一个文件正在执行fdatasync(2)将 buffer写入物理磁盘,write(2)会被Block住,整个Redis被Block住,不能处理其他命令。 如果磁盘IO比较繁忙(aof rewrite或者rdb刷盘 ),导致aof buffer fdatasync(2)时间较长,堵塞write(2)写入aof buffer,进而导致redis block其他命令。 Redis的刷盘策略略有调整,当进程发现文件有在执行 fdatasync(2) 时,就先不调用 write(2),只存在 cache 里,免得被 Block。但如果已经超过两秒都还是这个样子,则会硬着头皮执行 write(2),即使 redis 会被 Block 住。此时那句要命的 log 会打印:“Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.”。此时aof_delayed_fsync(由于fsync(2)堵塞write(2)导致redis block 的次数,)值被加1。因此,对于 fsync 设为 everysec 时丢失数据的可能性的最严谨说法是:如果有 fdatasync 在长时间的执行,此时 redis 意外关闭会造成文件里不多于两秒的数据丢失。如果 fdatasync 运行正常,redis 意外关闭没有影响,只有当操作系统 crash 时才会造成少于1秒的数据丢失。

    2、解决办法:

    1、调整系统参数dirty_bytes(当数据达到值时系统自动sync数据到磁盘,避免一秒刷盘大量脏数据) echo "vm.dirty_bytes=4194304" >> /etc/sysctl.conf  sysctl -p 2、关闭RDB或者AOF 3、关闭RDB并调整redis参数: appendfsync:always(每次事件aof buffer 写入并fsync(2)到文件) everysec(每次事件aof buffer 写入到文件 ,每秒fsync(2)到文件) no(每次事件aof buffer 写入到文件,操作系统控制fsync(2)到文件) no-appendfsync-on-rewrite:yes(aof rewrite时不进行aof buffer的fsync(2)操作,最多可能丢失30S的数据) 4、主库关闭RDB和AOF,持久化在从库执行 注意:主库故障后不能自动重启,否则会导致所有数据清空

    故障案例好文:https://blog.csdn.net/crisschan/article/details/51514087

    Processed: 0.018, SQL: 9