by Greg Orzell
It turns out that there is a rather nasty bug/behavior in Amazon’s TCP/IP load balancing that can lead to traffic being black holed. This is of course the nightmare scenario for pretty much anything TCP/IP. Lets take a detailed look at what can happen and what I think is going on.
TCP/IP Load Balancing
Before talking about the nature of the problem and how it was diagnosed, it is probably worth a quick review of how TCP/IP load balancing works. Clients connected to the load balancer(LB), establish a session with it. All of the TCP/IP protocol requirements are handled by the LB, so as far as the client is concerned it IS the server. This is really important because it means that things keepalive and other protocol related packets are handled independently of the connection to the actual server.
When a new connection arrives at the LB it creates an additional connection to one of the servers that are registered with it, using a round robin algorithm. The body of the packets from the client are unwrapped at the LB, then re-wrapped for the LB to server connection and sent on. Response packets do the same thing in reverse. In essence the LB is acting as a proxy for packet going between the client and the server.
Finding The Hole
A little while back I was helping to debug why connections were behaving in odd ways when going through an Elastic Load Balancer(ELB). The application was using TCP/IP load balancing for long lived TCP/IP connections (specifically XMPP). It’s also important to note that the XMPP protocol does not follow a request/response pattern for all actions. This means that the client can send requests out for which it doesn’t expect a response from the server. Using long lived connections and lengthy timeouts was particularly important because most of our clients were mobile devices. It was important that we not power up the radio to send heartbeats any more frequently than absolutely necessary.
For the most part this worked as expected, however there were times when clients were mysteriously unable to continue talking to the servers they were connected to. Through a series of tests we were able to determine that these times tended to correlate with when we were deploying new code. Whenever new code was pushed and the client sent further information, it would never reach any of the servers. This would continue until a heartbeat request (which expects a response) failed. Because we wanted to limit radio use, this type of request was only sent every few minutes, so numerous updates from the client could be lost.
At this point we could see that packets were being successfully delivered to the ELB from the client, but we couldn’t find the same packets being forwarded on to any of the servers. This led us to take a closer look at the ELB itself and how it was interacting with the servers behind it as they came and went. It’s important to note that we were using a Red/Black push methodology, where a new set of servers is registered with the load balancer and the old servers are deregistered. The diagrams below show a simplified view of what was happening during the push process.
As you would expect, because the client’s connection is terminated at the ELB and then proxied to the server, it is not directly affected by the de-registration event. As mentioned before, TCP/IP load balancing is done by connection, not request (as with HTTP) or packet. When a server is deregistered it leaves things in an awkward state on the client side. Your packets have no destination, which there is no way for the client to know, but they are successfully sent to and ACKd by the ELB. What exactly the ELB does with them after that is a bit of a mystery, but they are never to be seen again.
A Better Way
When this happens, I think that the load balancer should help you out, but it turns out it doesn’t. Instead it just sends all the client packets into nothingness. Bummer. What I think would make more sense would be for the ELB to send RST packets to all of the clients for which it was proxying connections to that host, when de-registration occurs. The ELB should have a state table mapping ELB/client connections to ELB/server connections. So I would think that this kind of solution would be fairly trivial to implement. If it behaved in this way, the clients would establish new connections to the ELB which could then be proxied to hosts that are still registered with the ELB and everything would be more or less happy.
While we wait for something like this to be implemented, there are a couple of ways to work around the problem. One is to send requests that have an expected response more frequently so that clients can more quickly identify a dead connection. The other is to do a rolling push were you shut down rather than de-register the servers. When this happens the server connections to the ELB are closed and the upstream clients are notified appropriately that what they were talking to is no longer available.
Luckily this isn’t a very common use case so it probably doesn’t affect a large portion of the ELB user base. But it does give me pause as more people start to use websockets and other long lived connections for their services. Will they do the right thing? It’s probably worth taking a closer look.