Load Balancing The UK National JANET Web Cache Service
Using Linux Virtual Servers

Michael Sparks , November 1999

Abstract

This document details the use of the Linux Virtual Servers kernal patch used in the JWCS pilot LVS service. By allowing us to to finely tune the traffic passing through any server in the cluster, the LVS system helps us to reduce the chance of any single server overloading, and for the effect of any single server failure to prevent a total loss of service.

1 The problem

We have N proxy boxes and wish to divide the request stream between them in a scaleable robust, and flexible system, preferably not limited to physical connections - so that spare capacity at one location can be utilised to ease congestion at the other.

1.1 An Observation on HTTP request stream characteristics

An proxy HTTP request, response stream, at bare minimum, looks like this:

michael@natcache1:~ > telnet man0.sites.wwwcache.ja.net 8080
Trying 194.83.240.20...
Connected to paprika.mcc.wwwcache.ja.net.
Escape character is '^]'.
GET http://rascal.mcc.ac.uk:81/squidtimes/ HTTP/1.0

<Response from server - 53 lines>

Ie the incoming portion of the request is significantly smaller than the size of the response.

2 Linux Virtual Servers

These works as follows:

We advertise a single IP address and port number for the service.
A Load balancer - Director in LVS parlance - listens on this IP address, and looks for packets that are used to start a TCP connection.
It then routes requests to real servers in the server farm via NAT, IPIP tunneling, or direct routing, based on a load balancing policy - which is either numbers of connections oriented, client to server oriented or round-robin based, or a combination of these.
We then use a monitoring system to look for servers that are having problems, so that we can pull them out of service either manually or automatically. This is vital since LVS works on the assumption that it manages connections according to a policy, not a monitoring system. In practice this turns out to be an extremely useful double edged sword.

2.1 Transport Mechanisms

2.1.1 Network Address Translation

In this setup, the real servers are on a private network, and use the Director as their gateway. The gateway acts in the same way as a masquerading gateway for the servers for all external connections. From a bandwidth availability perspective, this is the least desirable option since the load balancer becomes a bottleneck between 5 & 10Mbit/s. If however the servers are doing processor intensive work, and the aim is to distribute processing (eg across multiple cgi servers, this can be extremely useful).

A big boost here is that no modifications to the servers is required.

Following an incoming request:

User at IP A.B.C.D sends a request from local port J to Director at IP L.M.N.O on directors port K.
The Director chooses a real servers W.X.Y.Z port L
Inbound packets from the client are re-written so that rather than going from L.M.N.O:K, the destination address is changed to W.X.Y.Z:L. Since the real servers are on a private network, these can be normal IP packets, and normal routing does the rest.
Outbound packets from W.X.Y.Z:L are re-written so that they contain L.M.N.O:J instead.

People using NAT/firewalls will see that the only interesting part here is the second bullet point - the rest is normal NAT stuff.

For those interested in this version, during the testing phase of the LVS Cluster, Redhat released Redhat 6.1, which includes the necessary software to setup and maintain a NAT based LVS cluster.

2.1.2 IPIP Tunnelling

In this setup, the Director & realservers do not have to be on the same network - indeed, none of the servers need be on the same network.

Following a proxy HTTP request through the system.

User at IP A.B.C.D sends a request from local port J to Director at IP L.M.N.O on directors port K.
The Director chooses a real servers W.X.Y.Z port K
From this point on, all packets (including the first) from A.B.C.D:J to L.M.N.O:K are placed as the data payload in an IP packet sent using normal IP mechanisms destined for W.X.Y.Z:K. The packet is tagged as IPIP encapsulated.
The realserver W.X.Y.Z locally has a local tunnel device that is configured to the IP address L.M.N.O, and routes all requests to/from that IP via that device.
When the realserver recieves the IPIP encapsulated packet, it retrieves the payload - the original IP packet marked from A.B.C.D:J to L.M.N.O:K and then uses normal IP mechanisms to deliver the packet. Since it's local routing says that L.M.N.O is local, the server accepts the connection on that local IP address. As a result all replies from the real server to the client come from the IP address L.M.N.O, despite the reply coming from a different machine with it's own real IP address.

Observations:

All data from the clients to the real servers pass through the Director. In this setup, the greater the differential between the data from the client to server and vice versa the better. For example sending POST & CGI requests through the system will decrease scalability.
Data passing from the real server to the client does so directly without passing through the Director - eliminating a bottleneck for the majority of data.
The real servers communicate with the origin servers directly, without the overhead of passing through the Director again - unlike with a NAT/Masquerading approach.

The fact that the realservers do not need to be plugged into the Director and that only a small fraction of the traffic associated with an HTTP request passes through the Director is a major difference between the LVS and the Level 4 switch approach.

A key advantage of this basic appraoch is that if servers at location X are underloaded, and the servers at location Y are overloaded, or there are bandwidth problems at either location, for whatever reason, traffic can be diverted to the other location quickly and easily.

Finally, it should be noted that this approach isn't really very different from the NAT approach, and that NAT still happens on the inbound portion of the data stream.

2.1.3 Direct Routing.

This case is essentially the same as IPIP encapsulation except it works on the ethernet (etc) link protocol, rather than the IP level. In this scenario, the real servers need to be on the same network segment (eg same ethernet hub/ switch), and the real servers configure a dummy device (that doesn't ARP!) with the same address as the virtual service address. Routing for the dummy device is set up in a similar way to the IPIP setup.

Following a request through the system:

Assume ethernet for purposes of discussion.

User at IP A.B.C.D sends a request from local port J to Director at IP L.M.N.O on directors port K.
The Director chooses a real servers W.X.Y.Z port K.
The Director then sends the unmodified IP packet out onto the network, but does so by sendng it in ethernet frames which are tagged with the MAC address of the realserver.
The realserver recieves the IP packet because it's transceiver because of this tag, and the system then attempts to route the packet as normal. Since the address of the IP packet inside is locally routable, and then dealable with things like communication to the client & origin server happen as before - bypassing the Director.

Notes: this technique can be potentially made to work with real servers other than Linux - people have reported success with a diverse mix of operating systems for the real servers : Linux (obviously), HP-UX, Solaris, FreeBSD, NT.

This technique clearly has no overhead on doing IPIP encapsulation - and is probably the most scaleable version of the LVS system.

Finally note that in both the IPIP Tunneling & Direct Routing scenarios, users can still connect direct to the real servers in case of catastrophic failure of the Director, and all other failover mechanisms. In the case of NAT if all the possible Directors fail, the entire system needs to be reconfigured.

2.1.4 Local Node

There is the possibility of using the Director itself as a real server as well, however with our traffic loads it would be unwise for us to do this, due to memory overheads. As a result if you want more info on this, please see the Linux Virtual Servers website. (It's trivial conceptually, and to setup)

2.1.5 Transport Mechanism Chosen for use in the JWCS Cluster

Due to the flexibility of the IPIP Tunneling, the fact that we have more than one location for the JWCS and that the there are no major problems against it at this stage, we have been using IPIP tunneling. However the transport mechanism can be chosen on a per server basis, so this is not hard and fast - and may well change to direct routing for some machines.

2.2 Routing Policies

All of the above mechanisms mentioned the following step:

The Director chooses a real servers W.X.Y.Z port K.

The rules the Director uses to choose a real server is a called a routing policy. There are 2 basic kinds of policy, and an optional sub-policy.

2.2.1 Round Robin

In this scenaio, given a list of servers X0, X1, X2, X3, the first request goes to the first, second to second, third to third, fourth to fourth, and then you loop through the servers ad infinitum.

There is another version of this available in the system called weighted round robin, which allows you to skew the traffic so that servers which can handle more traffic can be weighted so that they recieve a greater amount of traffic.

2.2.2 Least Connections

In this scenario, the server keeps track of which server has the least number of connections, and allocates the new connection to the server with the least number of connections. Again, a weighted form of this exists, so that beefier servers can take greater strain than others.

The other thing that weighting allows us to do is to state that the default weight is (say) 1000, and if a server is becoming overloaded take some heat off it, by (say) halving its weight. Likewise we can slowly ramp up the amount of traffic going through a server by increasing weights in increments giving us very fine control over the traffic levels.

2.2.3 Persistence - The Sub Policy

This policy says:

If we have already handled a connection for client IP A that resulted in us choosing server X0, then forward this new request to server X0 as well. This is a big boost if the servers are web servers handling CGI requests.
If we haven't already handled a connection for client IP A, then use the primary policy (Round Robin, Least Connections, or weighted subversion) to route this request, and remember this for future reference.

2.2.4 The Routing Policy Used by the JWCS Cluster

To simplify traffic loading we are using a weighted least-connection policy without persistance - this allows us to instantaneously take a server out of service should this become necessary, and to smoothly ramp a server back in after being taken out of service for whatever reason.

2.3 The Monitoring System for the Servers.

Currently this is a simple, effective, brute force approach that would be overkill in a non-clustering environment. In the LVS system though, we need to notice server failure instantaneously, since otherwise users will be affected. Hence the monitoring is highly targetted at monitoring one thing and one thing in a situation as close to the service mechanism as possible.

Assumption:

If packets in a TCP stream cannot get from the real server to the director, then the real server is as good as dead.
Given the real servers have thousands of concurrent connections, one more TCP stream to monitor the system is alive is useful, and minimal overhead.

Implementation:

Real servers connect to the Director using a TCP stream, and send data pulses regularly out along it.
The director expects a pulse every N seconds, and can therefore detect if pulses are late, repeatedly late, or if the server appears to've stopped sending them altogether.
If the TCP connection is broken at either end, the other end will notice straight away
- in the case of a real server detecting this, it will expect a new master to become actively shortly, and so pause, and then start sending pulses to the new master automatically, and be added into the cluster automatically.
- in the case of master detecting this, it can mark the realsever down, send no more requests to that server.
The data pulses can be anything. This provides us with a window to explore more interesting - and potentially very useful automatic load balancing options such as based on data pulses of the median TCP HIT time for a cache, and if this becomes too high reducing traffic automatically, and if it's "too low", automatically increasing traffic levels.
This would have the potential side effect of allowing the system to find the best levels of traffic for a server, rather than relying on human judgement. Secondly it could result in a system whereby no single server could become overloaded by sudden increases in traffic - unlike a DNS based approach.
A simple API to help the production of such a load balancing system has been developed and proved itself very useful, and helps to point the way to more flexibly systems in the future.

2.4 Failure of the Director

If the Director fails, any one of the servers in the cluster can be utilised to take over the role of load balancing. The mechanics of this are quite simple:

The backup server detects failure of the primary server.
It uses the software "fake" to take of the IP address. (broadcasting arps etc to ensure routers pick this up straight away).
The backup activates all the software/routing setups that were on the original Director.

NB During trials - over a month - the Director has had 100% uptime and experienced no failures of any kind.

3 Hardware currently in the Pilot Service Cluster & Support

The system currently consists of 1 Director and 4 Real (cache) Servers.

The 4 Cache servers all have identical hardware and software installations - with capacity to handle 2.5 Million requests per day each. As such the cluster should be able to handle 10 Million requests per day without problems. During the test phase, utilising 3 caches, traffic levels reached 6.5 Million requests per day, shipping over 60Gb per day.

As will be noted in the next section capacity of the Pilot service will increase as we transition sites across. All the hardware in the cluster is on normal production service callout procedures.

4 Transfer of Sites

4.1 Sites to be be Initially Transferred

Some of the sites making use of the test cluster will be transitioned first. We need to do this way since these users are already using the test-cluster bypointing their caches directly at the cluster as an extra parent.

Capacity of the pilot cluster is higher than that of the original test cluster so some other sites will be transitioned as well. Since sites are being transferred to a service with normal production service callout procedures, sites will be transitioned based on decisions we make based on operational requirements. One key thing to notice here is that thisr requires no intervention by individual sites, or modifications to your existing setup.

4.2 Transition of sites and machines

After the initial transfer of sites, we will be transitioning sites based on the whcih machine they are currently using - ie a machine will be added to the cluster, and it's sites also added at the same time. During this time there will in all likelyhood be a small change in response times - akin to throwing a stone in a pond of water - but the effect will be temporary.

To minimise the effects of this, we will be making these transitions at times when loading is least on the caches - which means early evening (5-6pm) or weekends. It should be stressed that this tranistion is soley a DNS based one, and service will continue transparently during such a change. (For example, the test-cluster was safely transitioned from one physical network to another without downstream caches being affected in any way.)

The transfer plan is deliberately cautious, but steady.

4.3 Reversion procedures.

One major benefit of transitioning entire machines and its sites in this way is for reversion purposes:

Currently each 0.sites... points at an icpX.servers name, which points at a machine name.
By transition a machine to the cluster by point the icpX.servers name at the cluster, whilst keeping a reference to the original, reversion to a known loading level on a machine is possible simply by commenting out the icpX line and uncommenting the original.

Clearly for this to be effective, we will be only changing one of your machine's parents in this way at a time.

5 Finally

With any new service there is always the possibility of encountering problems. The test cluster allowed us to iron out a number of problems that only happen in a real environment and whilst we would sincerely hope that we have now caught the majority of problem sources, we recognise that we live in the real world!

If you notice that a reverse lookup on you 0/1.sites. wwwcache.ja.net name refers to a cluster, and are experiencing problems with it, do not hesitate to contact us. (This clearly goes for the main production service too)

On a note unrelated to the service, it should be noted that the LVS mechanism is very flexible and can be applied in a number of other areas - including, but not limited to:

Web servers
LDAP/IMAP servers
POP3 Servers
etc.

References

The Linux Virtual Servers website can be found at www.linuxvirtualserver.org

Linux Virtual Servers Credits

The Linux Virtual Servers project is led by Wensong Zhang, and our thanks go out to everyone on the LVS Mailling list during the test phase. It's largely to the credit of the coding and existing documentation though that there was very little need to contact the list regarding problems !

Load Balancing The UK National JANET Web Cache Service Using Linux Virtual Servers