nginx大量返回502排查过程

    技术2022-07-13  89

    排查过程

    1,观察问题时间流量,大量5xx,每秒5000个,实际work的qps很少,几百。

    2,登录机器,观察check health status,除一台健康检查曾经down过之外,其他均正常。

    3,观察日志,有大量no live upstreams while connecting to upstream,但该日志代表所有后端健康检查全部失败,与健康检查结果不符合。

    4,看代码逻辑,在健康检查没问题的情况下,打印no live代表返回了NGX_BUSY状态码,联想我们之前测过的例子,怀疑是不断的connect failed和upstream_next之后返回no live。

    5,排查日志,没有connect() failed对应,该日志是connect失败一定会打印的日志,因此,该怀疑有问题。

    6,继续观察日志,发现有大量的upstream timed out,观察代码发现,调用upstream_next不止有connect失败之后,在proxy_read_timeout(1m)之后,也会继续试用其他的rs,而该日志正是单个rs proxy_read_timeout超时之后,打印的。

    static void

    ngx_http_upstream_send_request(ngx_http_request_t *r, ngx_http_upstream_t *u,

        ngx_uint_t do_write)

    {

    if (!u->request_body_sent) {

            u->request_body_sent = 1;

     

            if (u->header_sent) {

                return;

            }

     

            ngx_add_timer(c->read, u->conf->read_timeout);

     

            if (c->read->ready) {

                ngx_http_upstream_process_header(r, u);

                return;

            }

        }

    }

     

     

    static void

    ngx_http_upstream_process_header(ngx_http_request_t *r, ngx_http_upstream_t *u)

    {

    c->log->action = "reading response header from upstream";

     

        if (c->read->timedout) {

            ngx_http_upstream_next(r, u, NGX_HTTP_UPSTREAM_FT_TIMEOUT);

            return;

        }

    }

     

     

    static void

    ngx_http_upstream_next(ngx_http_request_t *r, ngx_http_upstream_t *u,

        ngx_uint_t ft_type)

    {

    if (u->peer.sockaddr) {

     

            if (u->peer.connection) {

                u->state->bytes_sent = u->peer.connection->sent;

            }

     

            if (ft_type == NGX_HTTP_UPSTREAM_FT_HTTP_403

                || ft_type == NGX_HTTP_UPSTREAM_FT_HTTP_404)

            {

                state = NGX_PEER_NEXT;

     

            } else {

                state = NGX_PEER_FAILED;

            }

    // 调用ngx_http_upstream_free_round_robin_peer,将该peer disable

            u->peer.free(&u->peer, u->peer.data, state);

            u->peer.sockaddr = NULL;

        }

    if (ft_type == NGX_HTTP_UPSTREAM_FT_TIMEOUT) {

            ngx_log_error(NGX_LOG_ERR, r->connection->log, NGX_ETIMEDOUT,

                          "upstream timed out");

        }

    }

     

    7,curl测试,在后端两台超时rs,健康检查成功的条件下,只能打印两条upstream timed out,不会打印no live upstreams 。

    8,改用长连接进行测试,在第七步其他条件不变的情况下,成功复现,打印两条upstream timed out之后,后续的所有请求都会打印no live upstreams。

    9,根据现象排查代码逻辑,单个rs proxy_read_timeout之后,会将代表这个rs的peer的fails++:

     

    void

    ngx_http_upstream_free_round_robin_peer(ngx_peer_connection_t *pc, void *data,

        ngx_uint_t state)

    {

    if (state & NGX_PEER_FAILED) {

            now = ngx_time();

     

            peer->fails++;

            peer->accessed = now;

            peer->checked = now;

     

     

    if (peer->max_fails) {

                peer->effective_weight -= peer->weight / peer->max_fails;

     

                if (peer->fails >= peer->max_fails) {

                    ngx_log_error(NGX_LOG_WARN, pc->log, 0,

                                  "upstream server temporarily disabled");

                }

            }

    }

     

    而在其后的upstream_next,调用ngx_http_upstream_get_peer时,会进行判断:

    默认条件下,max_fails是1,fail_timeout是10,10秒内,被检查失败的rs没有被选择出来作为peer的能力

    if (peer->max_fails

                && peer->fails >= peer->max_fails

                && now - peer->checked <= peer->fail_timeout)

            {

                continue;

            }

    两个peer都continue,之后便会返回busy,在10s内,return NGX_BUSY

     

    10,每个peer对应一个rs,存放在ngx_http_upstream_srv_conf_t中,每个request在ngx_http_upstream_init_round_robin_peer的时候会指向已有的配置,等待后续使用,因此,该问题一旦将某个rs对应的peer设置为fail,fail timeout之前该进程下该rs相当于被摘除。

    ngx_int_t

    ngx_http_upstream_init_round_robin_peer(ngx_http_request_t *r,

        ngx_http_upstream_srv_conf_t *us)

    {

    rrp = r->upstream->peer.data;

    rrp->peers = us->peer.data;

     

    }

    Processed: 0.011, SQL: 9