故障现象为服务IP可以PING通,但应用不能访问,查看kube-proxy组件报错(见下日志内容)。搜索报错内容可知这个问题升级到kubernetes 1.18 版本才出现的,在Kubernetes Github已有相关问题上报,Kubernetes 维护人员讨论分析出问题的原因可能是IPVS模块由于是比较新,需要系统内核版本支持,我在Centos7的最新版本上遇到这个问题,yum update kernel解决不了。 先讲结论:用Centos8.2替代Centos7.x即可解决,我已经试过了。 下面是这个问题的分析和生产环境解决建议。
日志内容如下:
[root@node1 logs]# tail -20 kube-proxy.WARNING E0703 13:49:39.609152 1393 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[192 168 50 65 0 0 0 0 0 0 0 0 0 0 0 0] E0703 13:49:39.609230 1393 proxier.go:1192] Failed to sync endpoint for service: 10.244.0.1:443/TCP, err: parseIP Error ip=[192 168 50 65 0 0 0 0 0 0 0 0 0 0 0 0] E0703 13:49:39.609564 1393 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[172 30 1 75 0 0 0 0 0 0 0 0 0 0 0 0] E0703 13:49:39.609621 1393 proxier.go:1192] Failed to sync endpoint for service: 10.244.75.157:80/TCP, err: parseIP Error ip=[172 30 1 75 0 0 0 0 0 0 0 0 0 0 0 0]挑出几个关键的词:
IPVSFailed to list IPVS destinationsFailed to sync endpoint for serviceparseIP Error 首先尝试确认问题范围,日志是在kube-proxy应用IPVS模块报出来的,分别是“列出IPVS目标”、“同步端点给服务”功能出现错误,原因都是一样的“parseIP Error”,日志中的不管是172开头的还是192开头IP都一样会出错,所以跟某个网段无关,现在主要关注parseIP这个关键功能点。 在github上找到这段代码: func parseIP(ip []byte, family uint16) (net.IP, error) { var resIP net.IP switch family { case syscall.AF_INET: resIP = (net.IP)(ip[:4]) case syscall.AF_INET6: resIP = (net.IP)(ip[:16]) default: return nil, fmt.Errorf("parseIP Error ip=%v", ip) } return resIP, nil }调用parseIP函数的代码:
// parse Address after parse AddressFamily incase of parseIP error if addressBytes != nil { ip, err := parseIP(addressBytes, d.AddressFamily) if err != nil { return nil, err } d.Address = ip }还好有kubernetes的工程师重现问题,并分析给出了结论。
netlink will try to get d.AddressFamily attribute, but I find the kernel just does not support it...in /usr/include/linux/ip_vs.h (kernel 3.10) No IPVS_DEST_ATTR_ADDR_FAMILY attribute is defined! But in new kernel version, the Destination Attributes is defined...That is why kube-proxy works well on systems with a higher version of the kernel.