我们都知道在 Kubernetes 集群中通常会使用 Ingress 方案来统一代理集群内部的流量,而常用的 Ingress 方案为 traefik 和 nginx,和传统的 Nginx 作为企业内部的反向代理以及负载设备一样,在不同的场景下可能需要专有的配置才能满足需求,否则会出现奇奇怪怪的异常状态码。本篇文章一起来看下,我们在 traefik 中遇到的 500 和 502 异常。
在开始之前,我们先来看几个在 Nginx 作为反向代理工具中经常遇到的几个异常状态码:
499: 客户端主动断开连接。通常为一次请求未在客户端指定时间内返回,客户端主动关闭连接,客户端无数据返回 (在 Nginx 中会记录 499)。一般是客户端超时500: 服务器内部错误。服务器遇到未知错误导致无法完成请求处理,通常由于后端业务逻辑异常导致 (本身 Bug)502: 网关错误。通常为网关未从上游服务中获取期望的响应 (上游未返回数据或未按照协议约定返回数据),网关感觉自己没用了,返回了网关错误。一般是后端服务器宕机或业务逻辑超时504: 网关超时。表示网关没有及时从上游获取响应数据。一般是 Nginx 网关作为客户端去向上游服务请求响应的过程中,Nginx 网关超时导致,但此时对于上游服务器来将,它会继续执行,直到结束。(Nginx网关作为客户端时的超时) # 499 的实际情况就是,客户端指定超时时间为N秒,但是该请求在服务端实际需要执行M秒(M>N秒),客户端等的不耐烦了就关闭了# 对于499状态来讲,解决方式是优化后端代码逻辑或者修改nginx参数$ cat nginx.confproxy_ignore_client_abort on;$ curl -i -m 3 http://127.0.0.1/hello.php# 502的实际情况通常是Nginx网关后端的服务器直接宕机了(所以就拿不到上游的响应了)# 当然也有可能是上游服务器真正的执行逻辑超过了上游服务器的超时时间限制(比如php-fpm.conf设置request_terminate_timeout5s,但是实际的业务逻辑需要7s才能完成),此时上游服务器端出现`业务逻辑超时`,给Nginx网关返回了异常的数据造成的# 502时后端的几种错误日志 recv() failed (104: Connection reset by peer) while reading response header from upstream upstream prematurely closed connection while reading response header from upstream connect() failed (111: Connection refused) while connecting to upstream# 整体来说502出现的问题通常是因为后端挂了,或者因为后端负载太高,暂时不可响应# 可以在nginx侧增加proxy_read_timeout来暂时缓解$ cat nginx.confproxy_read_timeout 20s;# 504的实际情况就是客户端->Nginx->Backend,在过程中Nginx需要作为客户端访问Backend服务,但是在Backend还没用执行完成时,Nginx首先超过了自己的客户端超时时间,此时就会出现504的异常(但是对于客户端来说返回什么呢?)# 对于504场景而言,通常的做法就是优化Backend的逻辑,适当减少执行时间;另外也可以适当的增加Nginx作为客户端时的超时时间# 要知道,当Nginx作为客户端时,是以一个Proxy的角色存在的,配置如下参数即可$ cat nginx.confuwsgi_connect_timeout 5;uwsgi_send_timeout 5;uwsgi_read_timeout 5;fastcgi_read_timeout 5;fastcgi_send_timeout 5;proxy_connect_timeout 90;proxy_send_timeout 90;proxy_read_timeout 90;我们当前集群的 traefik 的配置如下:
# traefik的configmap配置文件$ cat traefik-config.yamlapiVersion: v1kind: ConfigMapmetadata: name: traefik-config namespace: kube-systemdata: traefik.toml: | defaultEntryPoints = ["http","https"] debug = false logLevel = "INFO" InsecureSkipVerify = true [entryPoints] [entryPoints.http] address = ":80" compress = true [entryPoints.https] address = ":443" [entryPoints.https.tls] [web] address = ":8080" [kubernetes] [metrics] [metrics.prometheus] buckets=[0.1,0.3,1.2,5.0] entryPoint = "traefik" [ping] entryPoint = "http"# traefik的DaemonSet配置$ cat traefik-ds-v1.7.16.yaml---apiVersion: v1kind: ServiceAccountmetadata: name: traefik-ingress-controller namespace: kube-system---kind: DaemonSetapiVersion: extensions/v1beta1metadata: name: traefik-ingress-controller namespace: kube-system labels: k8s-app: traefik-ingress-lbspec: template: metadata: labels: k8s-app: traefik-ingress-lb name: traefik-ingress-lb spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node-role.kubernetes.io/master operator: DoesNotExist serviceAccountName: traefik-ingress-controller terminationGracePeriodSeconds: 30 hostNetwork: true containers: - image: traefik:v1.7.16 name: traefik-ingress-lb ports: - name: http containerPort: 80 hostPort: 80 - name: admin containerPort: 8080 securityContext: capabilities: drop: - ALL add: - NET_BIND_SERVICE args: - --api - --kubernetes - --logLevel=INFO - --traefikLog.filePath=/logdata/traefik.log - --configfile=/config/traefik.toml - --accesslog.filepath=/logdata/access.log - --accesslog.bufferingsize=100 volumeMounts: - mountPath: /config name: config - mountPath: /logdata name: access-log volumes: - configMap: name: traefik-config name: config - name: access-log hostPath: path: /opt/logs/ingress/---kind: ServiceapiVersion: v1metadata: name: traefik-ingress-service namespace: kube-system labels: k8s-app: traefik-ingress-lbspec: selector: k8s-app: traefik-ingress-lb ports: - protocol: TCP port: 80 name: web - protocol: TCP port: 8080 name: admin在服务部署完成后,一切皆正常,但是简单压测后发现服务出现部分请求失败。
# 使用ab工具进行压测# 由压测结果可以发现,20个并发共压测200个请求,期间出现了7次失败请求$ ab -c 20 -n 200 -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/...Benchmarking data-api.bgbiao.cn (be patient)Completed 100 requestsCompleted 200 requestsFinished 200 requestsServer Software:Server Hostname: data-api.bgbiao.cnServer Port: 80Document Path: /account/users/login/Document Length: 250 bytesConcurrency Level: 20Time taken for tests: 1.340 secondsComplete requests: 200Failed requests: 7 (Connect: 0, Receive: 0, Length: 7, Exceptions: 0)Write errors: 0Non-2xx responses: 7Total transferred: 91371 bytesTotal body sent: 46400HTML transferred: 48387 bytesRequests per second: 149.21 [#/sec] (mean)Time per request: 134.035 [ms] (mean)Time per request: 6.702 [ms] (mean, across all concurrent requests)Transfer rate: 66.57 [Kbytes/sec] received 33.81 kb/s sent 100.38 kb/s totalConnection Times (ms) min mean[+/-sd] median maxConnect: 1 1 0.1 1 1Processing: 2 116 27.8 114 179Waiting: 2 116 27.8 114 179Total: 3 117 27.8 116 180Percentage of the requests served within a certain time (ms) 50% 116 66% 121 75% 125 80% 129 90% 154 95% 167 98% 173 99% 175 100% 180 (longest request)# 将压测结果保存到文本中进行简单分析# 简单分析在200个压测请求中,有4个请求失败,分别为2次500错误和2次502错误$ ab -v 10 -c 20 -n 200 -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/ > ab-log.txt$ cat ab-log.txt | grep HTTP | sort| uniq -c 196 HTTP/1.0 200 OK 2 HTTP/1.0 500 Internal Server Error 2 HTTP/1.0 502 Bad Gateway 1 POST /account/users/login/ HTTP/1.0前面我们提到了在 Nginx 的场景中的 500 和 502 状态码的原因以及相关的解决方式,那么在 Kubernetes 集群中,traefik 起到的作用和 Nginx 的作用也是相似的。
在开始的时候,我们提到了集群内的 traefik 的配置信息,对于 SRE 来讲,任何生产的服务都必须有相关的可观测性数据,因此,我们也默认将 traefik 的访问日志和进程日志进行了持久化 (分别对应 access.log和traefik.log),同时也暴露了 traefik 的 prometheus 的 metrics 接口。
对于上面的压测请求,我们在访问日志里抓到了如下的异常日志:
$ tail -f access.log | grep data-api.bgbiao.cn | grep -v 'HTTP/1.0" 200'192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267376 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 0ms192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267385 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267410 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267418 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267484 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench/2.3" 122267518 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms192.168.0.23 - - [28/Jun/2020:14:57:39 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267550 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 4ms192.168.0.23 - - [28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench/2.3" 122272696 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 2ms192.168.0.23 - - [28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench/2.3" 122272711 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms192.168.0.23 - - [28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122272836 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 0ms192.168.0.23 - - [28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122272837 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 0ms可以看到,和我们压测结果里失败请求的状态码相同,都是 500 和 502,虽然数量上不止 4 个,但这暂时不太重要。
通常对于大多数人来讲,在代理层看到 500 或者 502 都会下意识的想,肯定是上游服务的问题,不过这种猜测也能很快进行排除,排除的方法基本如下:
在压测出现 500 和 502 期间,持续去访问上游服务器使用同样的压测参数直接压上游服务器将上游服务部署在独立的 ECS 上并使用同样参数进行压测经过上述三种方式的测试后,我们基本排除了上游服务的问题,因此正式怀疑 traefik 本身是否有性能或者其他参数上的问题。
修改 traefki 中的日志级别为 DEBUG:
$ cat traefik-ds-v1.7.6.yaml.... args: - --api - --kubernetes - --logLevel=DEBUG....然后在日志中看到如下相关信息:
# 500相关的日志time="2020-06-28T15:35:05Z" level=debug msg="'500 Internal Server Error' caused by: http: server closed idle connection"time="2020-06-28T15:35:05Z" level=debug msg="vulcand/oxy/forward/http: Round trip: http://20.0.26.9:8080, code: 500, Length: 21, duration: 1.486276ms"# 502相关的日志time="2020-06-28T15:35:05Z" level=debug msg="'502 Bad Gateway' caused by: EOF"time="2020-06-28T15:35:05Z" level=debug msg="vulcand/oxy/forward/http: Round trip: http://20.0.26.9:8080, code: 502, Length: 11, duration: 1.530677ms"上面的 500 信息,大概可以看出来是 traefik 的服务进程主动关闭了空闲链接导致的,而下面的 502 是因为 EOF,感觉像是没有读取完响应数据,就被断开了,导致 traefik 返回 502。通常这种情况在 Nginx 中很常见,也很容易调整相关的配置参数 (文章开始有提到),但是 traefik 的部署模式以及参数调整,还是需要花心思去看相关文档的。
然后在 github 上去翻 traefik 相关的 issues,发现该问题曾经出现过很多次。
Traefik return 500 internal error - no 500 on backendSporadic 502 response only when running through traefik一个是 500 的 issues,一个是 502 的 issues,但通常这两个问题都是成双出现的。
在上面第一个 issue 中提到,traefik 在 http 的反向代理功能中默认开启了 http 的 keep-alive 功能,但是 python 的应用中未开启 http 的 keep-alive,因为我们上面的测试程序其实也是使用的 python 开发的,先对该参数进行调整。
# 在uwsgi.ini中增加keep-alive参数即可$ cat uwsgi.ini[uwsgi]http = 0.0.0.0:8080http-keepalive = 1chdir = /opt/app/wsgi-file = /opt/app/main.pycallable = appstats = 0.0.0.0:8081processes = 2threads = 10# 重启应用后,再次进行压测# 之前出现的502和500错误基本都消失了# 并发200,共1万个请求,失败的请求数为0,总耗时1min(qps才到160😸)$ ab -c 200 -n 10000 -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/........Concurrency Level: 200Time taken for tests: 59.323 secondsComplete requests: 10000Failed requests: 0Write errors: 0Total transferred: 4670000 bytesTotal body sent: 2320000HTML transferred: 2500000 bytesRequests per second: 168.57 [#/sec] (mean)Time per request: 1186.454 [ms] (mean)Time per request: 5.932 [ms] (mean, across all concurrent requests)Transfer rate: 76.88 [Kbytes/sec] received 38.19 kb/s sent 115.07 kb/s totalConnection Times (ms) min mean[+/-sd] median maxConnect: 1 3 38.6 1 1035Processing: 101 942 1457.7 857 32684Waiting: 101 942 1457.7 857 32684Total: 102 945 1458.0 861 32685# p99达到7.3sPercentage of the requests served within a certain time (ms) 50% 861 66% 1033 75% 1136 80% 1191 90% 1886 95% 2281 98% 4209 99% 7399 100% 32685 (longest request)通过对业务层的 http 的 keep-alive 参数的开启,来暂时解决了 500 和 502 的问题,那能否通过 traefik 层的参数来优化该问题呢,上面第二个 issue 中其实也提到了。
即,通过修改 traefik 的如下几个参数,并重新部署整个 traefik 集群即可:
# 关闭traefik的keep-alive参数,参数默认为200,如果参数为0,则使用go标准库中的DefaultMaxIdleConnsPerHost参数# keep-alive主要是用来复用链接来减少open files的,但是对于大量的短连接来将这种链接复用就可能出现上述情况--maxidleconnsperhost=-1# 即通过设置重试次数,增加空闲链接的超时时间,增加转发响应的超时时间,默认是0次--retry.attempts=10# 该参数已经替换为--respondingtimeouts.idletimeout参数了,默认为3m0s--idletimeout=60s# 默认是0s--forwardingtimeouts.responseheadertimeout=60s# traefik 空闲链接超时$ ./traefik --help | grep idletimeout --idletimeout (Deprecated) maximum amount of time an idle (keep-alive) connection will remain (default "0s") --respondingtimeouts.idletimeout IdleTimeout is the maximum amount duration an idle (keep-alive) connection will (default "3m0s")# 响应超时相关参数 $ ./traefik --help | grep respondingtimeouts --respondingtimeouts Timeouts for incoming requests to the Traefik instance (default "true") --respondingtimeouts.idletimeout IdleTimeout is the maximum amount duration an idle (keep-alive) connection will (default "3m0s") --respondingtimeouts.readtimeout ReadTimeout is the maximum duration for reading the entire request, including (default "0s") --respondingtimeouts.writetimeout WriteTimeout is the maximum duration before timing out writes of the response. (default "0s")# 转发的超时$ ./traefik --help | grep forwardingtimeouts --forwardingtimeouts Timeouts for requests forwarded to the backend servers (default "true") --forwardingtimeouts.dialtimeout The amount of time to wait until a connection to a backend server can be (default "30s") --forwardingtimeouts.responseheadertimeout The amount of time to wait for a server's response headers after fully writing (default "0s")最终修改后的 traefik 的参数如下:
# 可以根据实际情况考虑是否要关闭keep-alive 即增加参数: --maxidleconnsperhost=-1 - --api - --kubernetes - --logLevel=INFO - --traefikLog.filePath=/logdata/traefik.log - --configfile=/config/traefik.toml - --accesslog.filepath=/logdata/access.log - --accesslog.bufferingsize=100 - --forwardingtimeouts.responseheadertimeout=60s - --respondingtimeouts.idletimeout=180s - --retry.attempts=10 - --idletimeout=180s在发现域名直接解析到 traefik 节点上不再出现 502 和 500 后,我们将 traefik 的节点挂载到阿里云的内网 slb 中,但是又开始出现了诡异的 503 问题。
接入 slb 后的简单压测情况 (内网使用的是免费低配的 slb😹)
# client->ali-slb->traefik->pods$ ab -v 10 -c 200 -n 2000 -T 'application/json' -p postfile.json http://data-api.soulapp.cn/get_doc > slb-log.txt$ cat slb-log.txt | grep 'HTTP/1.1 200' | wc -l1322$ cat slb-log.txt | grep 'HTTP/1.1 503' | wc -l678# client->traefik->pods$ ab -v 10 -c 200 -n 2000 -T 'application/json' -p postfile.json http://data-api.c.bgbiao.cn/get_doc > traefik-log.txtCompleted 200 requestsCompleted 400 requestsCompleted 600 requestsCompleted 800 requestsCompleted 1000 requestsCompleted 1200 requestsCompleted 1400 requestsCompleted 1600 requestsCompleted 1800 requestsCompleted 2000 requestsFinished 2000 requests$ cat traefik-log.txt | grep 'HTTP/1.0 200' | wc -l2000后来在阿里云文档中看到如下文档,基本上就是在阿里云的 SLB 侧对不同规格的 SLB 做了一定的限流策略,此时会向客户端返回 503。
阿里云 TPS-503
