This document details the use of the Linux Virtual Servers kernal patch used in the JWCS pilot LVS service. By allowing us to to finely tune the traffic passing through any server in the cluster, the LVS system helps us to reduce the chance of any single server overloading, and for the effect of any single server failure to prevent a total loss of service.
We have N proxy boxes and wish to divide the request stream between them in a scaleable robust, and flexible system, preferably not limited to physical connections - so that spare capacity at one location can be utilised to ease congestion at the other.
An proxy HTTP request, response stream, at bare minimum, looks like this:
michael@natcache1:~ > telnet man0.sites.wwwcache.ja.net 8080 Trying 126.96.36.199... Connected to paprika.mcc.wwwcache.ja.net. Escape character is '^]'. GET http://rascal.mcc.ac.uk:81/squidtimes/ HTTP/1.0 <Response from server - 53 lines>
Ie the incoming portion of the request is significantly smaller than the size of the response.
These works as follows:
In this setup, the real servers are on a private network, and use the Director as their gateway. The gateway acts in the same way as a masquerading gateway for the servers for all external connections. From a bandwidth availability perspective, this is the least desirable option since the load balancer becomes a bottleneck between 5 & 10Mbit/s. If however the servers are doing processor intensive work, and the aim is to distribute processing (eg across multiple cgi servers, this can be extremely useful).
A big boost here is that no modifications to the servers is required.
Following an incoming request:
People using NAT/firewalls will see that the only interesting part here is the second bullet point - the rest is normal NAT stuff.
For those interested in this version, during the testing phase of the LVS Cluster, Redhat released Redhat 6.1, which includes the necessary software to setup and maintain a NAT based LVS cluster.
In this setup, the Director & realservers do not have to be on the same network - indeed, none of the servers need be on the same network.
Following a proxy HTTP request through the system.
The fact that the realservers do not need to be plugged into the Director and that only a small fraction of the traffic associated with an HTTP request passes through the Director is a major difference between the LVS and the Level 4 switch approach.
A key advantage of this basic appraoch is that if servers at location X are underloaded, and the servers at location Y are overloaded, or there are bandwidth problems at either location, for whatever reason, traffic can be diverted to the other location quickly and easily.
Finally, it should be noted that this approach isn't really very different from the NAT approach, and that NAT still happens on the inbound portion of the data stream.
This case is essentially the same as IPIP encapsulation except it works on the ethernet (etc) link protocol, rather than the IP level. In this scenario, the real servers need to be on the same network segment (eg same ethernet hub/ switch), and the real servers configure a dummy device (that doesn't ARP!) with the same address as the virtual service address. Routing for the dummy device is set up in a similar way to the IPIP setup.
Following a request through the system:
Assume ethernet for purposes of discussion.
Notes: this technique can be potentially made to work with real servers other than Linux - people have reported success with a diverse mix of operating systems for the real servers : Linux (obviously), HP-UX, Solaris, FreeBSD, NT.
This technique clearly has no overhead on doing IPIP encapsulation - and is probably the most scaleable version of the LVS system.
Finally note that in both the IPIP Tunneling & Direct Routing scenarios, users can still connect direct to the real servers in case of catastrophic failure of the Director, and all other failover mechanisms. In the case of NAT if all the possible Directors fail, the entire system needs to be reconfigured.
There is the possibility of using the Director itself as a real server as well, however with our traffic loads it would be unwise for us to do this, due to memory overheads. As a result if you want more info on this, please see the Linux Virtual Servers website. (It's trivial conceptually, and to setup)
Due to the flexibility of the IPIP Tunneling, the fact that we have more than one location for the JWCS and that the there are no major problems against it at this stage, we have been using IPIP tunneling. However the transport mechanism can be chosen on a per server basis, so this is not hard and fast - and may well change to direct routing for some machines.
All of the above mechanisms mentioned the following step:
The rules the Director uses to choose a real server is a called a routing policy. There are 2 basic kinds of policy, and an optional sub-policy.
In this scenaio, given a list of servers X0, X1, X2, X3, the first request goes to the first, second to second, third to third, fourth to fourth, and then you loop through the servers ad infinitum.
There is another version of this available in the system called weighted round robin, which allows you to skew the traffic so that servers which can handle more traffic can be weighted so that they recieve a greater amount of traffic.
In this scenario, the server keeps track of which server has the least number of connections, and allocates the new connection to the server with the least number of connections. Again, a weighted form of this exists, so that beefier servers can take greater strain than others.
The other thing that weighting allows us to do is to state that the default weight is (say) 1000, and if a server is becoming overloaded take some heat off it, by (say) halving its weight. Likewise we can slowly ramp up the amount of traffic going through a server by increasing weights in increments giving us very fine control over the traffic levels.
This policy says:
To simplify traffic loading we are using a weighted least-connection policy without persistance - this allows us to instantaneously take a server out of service should this become necessary, and to smoothly ramp a server back in after being taken out of service for whatever reason.
Currently this is a simple, effective, brute force approach that would be overkill in a non-clustering environment. In the LVS system though, we need to notice server failure instantaneously, since otherwise users will be affected. Hence the monitoring is highly targetted at monitoring one thing and one thing in a situation as close to the service mechanism as possible.
This would have the potential side effect of allowing the system to find the best levels of traffic for a server, rather than relying on human judgement. Secondly it could result in a system whereby no single server could become overloaded by sudden increases in traffic - unlike a DNS based approach.
If the Director fails, any one of the servers in the cluster can be utilised to take over the role of load balancing. The mechanics of this are quite simple:
NB During trials - over a month - the Director has had 100% uptime and experienced no failures of any kind.
The system currently consists of 1 Director and 4 Real (cache) Servers.
The 4 Cache servers all have identical hardware and software installations - with capacity to handle 2.5 Million requests per day each. As such the cluster should be able to handle 10 Million requests per day without problems. During the test phase, utilising 3 caches, traffic levels reached 6.5 Million requests per day, shipping over 60Gb per day.
As will be noted in the next section capacity of the Pilot service will increase as we transition sites across. All the hardware in the cluster is on normal production service callout procedures.
Some of the sites making use of the test cluster will be transitioned first. We need to do this way since these users are already using the test-cluster bypointing their caches directly at the cluster as an extra parent.
Capacity of the pilot cluster is higher than that of the original test cluster so some other sites will be transitioned as well. Since sites are being transferred to a service with normal production service callout procedures, sites will be transitioned based on decisions we make based on operational requirements. One key thing to notice here is that thisr requires no intervention by individual sites, or modifications to your existing setup.
After the initial transfer of sites, we will be transitioning sites based on the whcih machine they are currently using - ie a machine will be added to the cluster, and it's sites also added at the same time. During this time there will in all likelyhood be a small change in response times - akin to throwing a stone in a pond of water - but the effect will be temporary.
To minimise the effects of this, we will be making these transitions at times when loading is least on the caches - which means early evening (5-6pm) or weekends. It should be stressed that this tranistion is soley a DNS based one, and service will continue transparently during such a change. (For example, the test-cluster was safely transitioned from one physical network to another without downstream caches being affected in any way.)
The transfer plan is deliberately cautious, but steady.
One major benefit of transitioning entire machines and its sites in this way is for reversion purposes:
Clearly for this to be effective, we will be only changing one of your machine's parents in this way at a time.
With any new service there is always the possibility of encountering problems. The test cluster allowed us to iron out a number of problems that only happen in a real environment and whilst we would sincerely hope that we have now caught the majority of problem sources, we recognise that we live in the real world!
If you notice that a reverse lookup on you
On a note unrelated to the service, it should be noted that the LVS mechanism is very flexible and can be applied in a number of other areas - including, but not limited to:
The Linux Virtual Servers website can be found at www.linuxvirtualserver.org
The Linux Virtual Servers project is led by Wensong Zhang, and our thanks go out to everyone on the LVS Mailling list during the test phase. It's largely to the credit of the coding and existing documentation though that there was very little need to contact the list regarding problems !