1,观察问题时间流量,大量5xx,每秒5000个,实际work的qps很少,几百。
2,登录机器,观察check health status,除一台健康检查曾经down过之外,其他均正常。
3,观察日志,有大量no live upstreams while connecting to upstream,但该日志代表所有后端健康检查全部失败,与健康检查结果不符合。
4,看代码逻辑,在健康检查没问题的情况下,打印no live代表返回了NGX_BUSY状态码,联想我们之前测过的例子,怀疑是不断的connect failed和upstream_next之后返回no live。
5,排查日志,没有connect() failed对应,该日志是connect失败一定会打印的日志,因此,该怀疑有问题。
6,继续观察日志,发现有大量的upstream timed out,观察代码发现,调用upstream_next不止有connect失败之后,在proxy_read_timeout(1m)之后,也会继续试用其他的rs,而该日志正是单个rs proxy_read_timeout超时之后,打印的。
static void
ngx_http_upstream_send_request(ngx_http_request_t *r, ngx_http_upstream_t *u,
ngx_uint_t do_write)
{
if (!u->request_body_sent) {
u->request_body_sent = 1;
if (u->header_sent) {
return;
}
ngx_add_timer(c->read, u->conf->read_timeout);
if (c->read->ready) {
ngx_http_upstream_process_header(r, u);
return;
}
}
}
static void
ngx_http_upstream_process_header(ngx_http_request_t *r, ngx_http_upstream_t *u)
{
c->log->action = "reading response header from upstream";
if (c->read->timedout) {
ngx_http_upstream_next(r, u, NGX_HTTP_UPSTREAM_FT_TIMEOUT);
return;
}
}
static void
ngx_http_upstream_next(ngx_http_request_t *r, ngx_http_upstream_t *u,
ngx_uint_t ft_type)
{
if (u->peer.sockaddr) {
if (u->peer.connection) {
u->state->bytes_sent = u->peer.connection->sent;
}
if (ft_type == NGX_HTTP_UPSTREAM_FT_HTTP_403
|| ft_type == NGX_HTTP_UPSTREAM_FT_HTTP_404)
{
state = NGX_PEER_NEXT;
} else {
state = NGX_PEER_FAILED;
}
// 调用ngx_http_upstream_free_round_robin_peer,将该peer disable
u->peer.free(&u->peer, u->peer.data, state);
u->peer.sockaddr = NULL;
}
if (ft_type == NGX_HTTP_UPSTREAM_FT_TIMEOUT) {
ngx_log_error(NGX_LOG_ERR, r->connection->log, NGX_ETIMEDOUT,
"upstream timed out");
}
}
7,curl测试,在后端两台超时rs,健康检查成功的条件下,只能打印两条upstream timed out,不会打印no live upstreams 。
8,改用长连接进行测试,在第七步其他条件不变的情况下,成功复现,打印两条upstream timed out之后,后续的所有请求都会打印no live upstreams。
9,根据现象排查代码逻辑,单个rs proxy_read_timeout之后,会将代表这个rs的peer的fails++:
void
ngx_http_upstream_free_round_robin_peer(ngx_peer_connection_t *pc, void *data,
ngx_uint_t state)
{
if (state & NGX_PEER_FAILED) {
now = ngx_time();
peer->fails++;
peer->accessed = now;
peer->checked = now;
if (peer->max_fails) {
peer->effective_weight -= peer->weight / peer->max_fails;
if (peer->fails >= peer->max_fails) {
ngx_log_error(NGX_LOG_WARN, pc->log, 0,
"upstream server temporarily disabled");
}
}
}
而在其后的upstream_next,调用ngx_http_upstream_get_peer时,会进行判断:
默认条件下,max_fails是1,fail_timeout是10,10秒内,被检查失败的rs没有被选择出来作为peer的能力
if (peer->max_fails
&& peer->fails >= peer->max_fails
&& now - peer->checked <= peer->fail_timeout)
{
continue;
}
两个peer都continue,之后便会返回busy,在10s内,return NGX_BUSY
10,每个peer对应一个rs,存放在ngx_http_upstream_srv_conf_t中,每个request在ngx_http_upstream_init_round_robin_peer的时候会指向已有的配置,等待后续使用,因此,该问题一旦将某个rs对应的peer设置为fail,fail timeout之前该进程下该rs相当于被摘除。
ngx_int_t
ngx_http_upstream_init_round_robin_peer(ngx_http_request_t *r,
ngx_http_upstream_srv_conf_t *us)
{
rrp = r->upstream->peer.data;
rrp->peers = us->peer.data;
}