LVS+Keepalived+Nginx的奇怪问题

最近因为项目中服务器架构要升级,考虑到高可用性,决定采用keepalived做LVS Server的双机互备,然后LVS作为DB和前端Nginx的load balancer。

我的环境:
VIP 10.8.12.200
DB RealServer1 10.8.12.208
DB RealServer2 10.8.12.209
Tomcat 1 10.8.12.203
Tomcat 2 10.8.12.204
LVS Server1 & Nginx RealServer1 10.8.12.201
LVS Server2 & Nginx RealServer2 10.8.12.202

gateway 10.8.12.254

上述服务器都只配一块网卡,Ubuntu 11.04 Server

这些都是用vmware创建的虚拟机,考虑到生产环境的服务器数量有限,所以LVS Server和Nginx RealServer是安装在同一台机器上的。ipvsadm、keepalived安装在10.8.12.201(LVS Server1 & Nginx RealServer1)和10.8.12.202(LVS Server2 & Nginx RealServer2)机器上。

我准备了两套方案如下:
方案一:
前端采用Nginx作反向代理服务器并同时作动静分离,load balances到后端的tomcat集群和web服务器。后端用LVS作为DB Server的load balancer。keepalived做双机互备。

keepalived master上的配置文件内容如下:

global_defs {
router_id Nginx_Id_1
}

vrrp_script Monitor_Nginx {
script "/usr/local/keepalived/etc/keepalived/scripts/monitor_nginx.sh"
interval 2
weight 2
}

vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 33
priority 101
advert_int 1

authentication {
auth_type PASS
auth_pass 1111
}

#VIP
virtual_ipaddress {
10.8.12.200
}

track_script {
Monitor_Nginx
}
}

virtual_server 10.8.12.200 5432 {
delay_loop 6
lb_algo rr
lb_kind DR
persistence_timeout 0
protocol TCP

real_server 10.8.12.208 5432 {
weight 1
TCP_CHECK {
connect_port 5432
connect_timeout 10
}
}

real_server 10.8.12.209 5432 {
weight 1
TCP_CHECK {
connect_port 5432
connect_timeout 10
}
}
}

keepalived backup配置文件此处省略...

LVS Server的路由信息:
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.8.12.200:5432 rr
-> 10.8.12.208:5432 Route 1 0 0
-> 10.8.12.209:5432 Route 1 0 0

DB RealServer的lvs脚本如下:

#!/bin/bash
#Description : RealServer Start!

VIP=10.8.12.210
LVS_TYPE=DR

. /lib/lsb/init-functions

case "$1" in
start)
echo "start LVS of REALServer"
/sbin/ifconfig lo:0 $VIP broadcast $VIP netmask 255.255.255.255 up
/sbin/route add -host $VIP dev lo:0
echo "1" > /proc/sys/net/ipv4/conf/lo/arp_ignore
echo "2" > /proc/sys/net/ipv4/conf/lo/arp_announce
echo "1" > /proc/sys/net/ipv4/conf/all/arp_ignore
echo "2" > /proc/sys/net/ipv4/conf/all/arp_announce
;;
stop)
route del -host $VIP dev lo:0
/sbin/ifconfig lo:0 down
echo "close LVS Directorserver"
echo "0" > /proc/sys/net/ipv4/conf/lo/arp_ignore
echo "0" > /proc/sys/net/ipv4/conf/lo/arp_announce
echo "0" > /proc/sys/net/ipv4/conf/all/arp_ignore
echo "0" > /proc/sys/net/ipv4/conf/all/arp_announce
;;
*)
echo "Usage $0 {start|stop}"
exit 1
;;
esac
exit 0

DB RealServer上的route:
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.8.12.200 * 255.255.255.255 UH 0 0 0 lo
10.8.12.0 * 255.255.255.0 U 0 0 0 eth0
default 10.8.12.254 0.0.0.0 UG 100 0 0 eth0

这套方案经测试没有问题。

方案二:
前端采用LVS作为Nginx的load balancer,Nginx再作反向代理服务器并同时作动静分离,load balances到后端的tomcat集群和web服务器,keepalived做LVS双机互备。后端用LVS作为DB Server的load balancer。DB的LVS Server和Nginx的LVS Server是同一个,只是端口不同。LVS Server和Nginx RealServer共享同一台机器,DB RealServer是另外两台机器。
相比方案一,方案二只是在前端Nginx上又加了一层LVS的load balancer,Nginx的角色本身没有变化。

keepalived master配置文件内容如下:

global_defs {
router_id Nginx_Id_1
}

vrrp_script Monitor_Nginx {
script "/usr/local/keepalived/etc/keepalived/scripts/monitor_nginx.sh"
interval 2
weight 2
}

vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 33
priority 100
advert_int 1

authentication {
auth_type PASS
auth_pass 1111
}

#VIP
virtual_ipaddress {
10.8.12.200
}

track_script {
Monitor_Nginx
}
}

virtual_server 10.8.12.200 80 {
delay_loop 6
lb_algo rr
lb_kind DR
persistence_timeout 60
protocol TCP

real_server 10.8.12.201 80 {
weight 1
TCP_CHECK {
connect_port 80
connect_timeout 10
}
}

real_server 10.8.12.202 80 {
weight 1
TCP_CHECK {
connect_port 80
connect_timeout 10
}
}
}

virtual_server 10.8.12.200 5432 {
delay_loop 6
lb_algo rr
lb_kind DR
persistence_timeout 0
protocol TCP

real_server 10.8.12.208 5432 {
weight 1
TCP_CHECK {
connect_port 5432
connect_timeout 10
}
}

real_server 10.8.12.209 5432 {
weight 1
TCP_CHECK {
connect_port 5432
connect_timeout 10
}
}
}

可以看出来,相比方案一,只是多了10.8.12.200 80端口的LVS配置。

此时的LVS Server路由信息如下:
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.8.12.200:80 rr persistent 60
-> 10.8.12.201:80 Route 1 0 0
-> 10.8.12.202:80 Route 1 0 0
TCP 10.8.12.200:5432 rr
-> 10.8.12.208:5432 Route 1 0 0
-> 10.8.12.209:5432 Route 1 0 0

由于LVS Server同时又是Nginx RealServer节点,所以在10.8.12.201(LVS Server1 & Nginx RealServer1)和10.8.12.202(LVS Server2 & Nginx RealServer2)机器上还创建了lvs脚本如下:

#!/bin/bash
#Description : RealServer Start!

VIP=10.8.12.200
LVS_TYPE=DR

. /lib/lsb/init-functions

case "$1" in
start)
echo "start LVS of REALServer"
/sbin/ifconfig lo:0 $VIP broadcast $VIP netmask 255.255.255.255 up
/sbin/route add -host $VIP dev lo:0
echo "1" > /proc/sys/net/ipv4/conf/lo/arp_ignore
echo "2" > /proc/sys/net/ipv4/conf/lo/arp_announce
echo "1" > /proc/sys/net/ipv4/conf/all/arp_ignore
echo "2" > /proc/sys/net/ipv4/conf/all/arp_announce
;;
stop)
route del -host $VIP dev lo:0
/sbin/ifconfig lo:0 down
echo "close LVS Directorserver"
echo "0" > /proc/sys/net/ipv4/conf/lo/arp_ignore
echo "0" > /proc/sys/net/ipv4/conf/lo/arp_announce
echo "0" > /proc/sys/net/ipv4/conf/all/arp_ignore
echo "0" > /proc/sys/net/ipv4/conf/all/arp_announce
;;
*)
echo "Usage $0 {start|stop}"
exit 1
;;
esac
exit 0

Nginx RealServer上的route:
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.8.12.200 * 255.255.255.255 UH 0 0 0 lo
10.8.12.0 * 255.255.255.0 U 0 0 0 eth0
default 10.8.12.254 0.0.0.0 UG 100 0 0 eth0

DB RealServer节点的配置与方案一相同,此处省略。

问题现象如下:
LVS服务器刚启动时,访问10.8.12.200一切正常。
在服务器运行一段时间后(其实也就几分钟,这期间没有做页面访问),再次访问10.8.12.200访问失败,返回502,多次刷新问题依旧。检查LVS Server的路由信息没有变化,route也是正常的。然后我尝试直接访问10.8.12.201上的nginx,访问正常;再尝试直接访问10.8.12.203:8080(后端tomcat),访问也正常。这就是说LVS Server load balance DB RealServer此时是正常的,只是load balance Nginx RealServer不正常。停掉keepalived主机后,备机可以正常接管,接管后再访问10.8.12.200正常。再把keepalived主机启起来后,主机又接管了VIP,但10.8.12.200依然不能访问。在ipvsadm -C之后,访问又正常了(偶然的发现,实在想不明白为什么这样)。

思考:
网上的教程说LVS Server和RealServer节点完全可以共享同一台机器,但在这里只是共享的Nginx RealServer无法访问,DB RealServer是正常的,实在不知道问题出在哪里。