by Greg Orzell
It turns out that there is a rather nasty bug/behavior in Amazon’s TCP/IP load balancing that can lead to traffic being black holed. This is of course the nightmare scenario for pretty much anything TCP/IP. Lets take a detailed look at what can happen and what I think is going on.
TCP/IP Load Balancing
Before talking about the nature of the problem and how it was diagnosed, it is probably worth a quick review of how TCP/IP load balancing works. Clients connected to the load balancer(LB), establish a session with it. All of the TCP/IP protocol requirements are handled by the LB, so as far as the client is concerned it IS the server. This is really important because it means that things keepalive and other protocol related packets are handled independently of the connection to the actual server.
When a new connection arrives at the LB it creates an additional connection to one of the servers that are registered with it, using a round robin algorithm. The body of the packets from the client are unwrapped at the LB, then re-wrapped for the LB to server connection and sent on. Response packets do the same thing in reverse. In essence the LB is acting as a proxy for packet going between the client and the server.
Finding The Hole
A little while back I was helping to debug why connections were behaving in odd ways when going through an Elastic Load Balancer(ELB). The application was using TCP/IP load balancing for long lived TCP/IP connections (specifically XMPP). It’s also important to note that the XMPP protocol does not follow a request/response pattern for all actions. This means that the client can send requests out for which it doesn’t expect a response from the server. Using long lived connections and lengthy timeouts was particularly important because most of our clients were mobile devices. It was important that we not power up the radio to send heartbeats any more frequently than absolutely necessary.
For the most part this worked as expected, however there were times when clients were mysteriously unable to continue talking to the servers they were connected to. Through a series of tests we were able to determine that these times tended to correlate with when we were deploying new code. Whenever new code was pushed and the client sent further information, it would never reach any of the servers. This would continue until a heartbeat request (which expects a response) failed. Because we wanted to limit radio use, this type of request was only sent every few minutes, so numerous updates from the client could be lost.
At this point we could see that packets were being successfully delivered to the ELB from the client, but we couldn’t find the same packets being forwarded on to any of the servers. This led us to take a closer look at the ELB itself and how it was interacting with the servers behind it as they came and went. It’s important to note that we were using a Red/Black push methodology, where a new set of servers is registered with the load balancer and the old servers are deregistered. The diagrams below show a simplified view of what was happening during the push process.
As you would expect, because the client’s connection is terminated at the ELB and then proxied to the server, it is not directly affected by the de-registration event. As mentioned before, TCP/IP load balancing is done by connection, not request (as with HTTP) or packet. When a server is deregistered it leaves things in an awkward state on the client side. Your packets have no destination, which there is no way for the client to know, but they are successfully sent to and ACKd by the ELB. What exactly the ELB does with them after that is a bit of a mystery, but they are never to be seen again.
A Better Way
When this happens, I think that the load balancer should help you out, but it turns out it doesn’t. Instead it just sends all the client packets into nothingness. Bummer. What I think would make more sense would be for the ELB to send RST packets to all of the clients for which it was proxying connections to that host, when de-registration occurs. The ELB should have a state table mapping ELB/client connections to ELB/server connections. So I would think that this kind of solution would be fairly trivial to implement. If it behaved in this way, the clients would establish new connections to the ELB which could then be proxied to hosts that are still registered with the ELB and everything would be more or less happy.
While we wait for something like this to be implemented, there are a couple of ways to work around the problem. One is to send requests that have an expected response more frequently so that clients can more quickly identify a dead connection. The other is to do a rolling push were you shut down rather than de-register the servers. When this happens the server connections to the ELB are closed and the upstream clients are notified appropriately that what they were talking to is no longer available.
Luckily this isn’t a very common use case so it probably doesn’t affect a large portion of the ELB user base. But it does give me pause as more people start to use websockets and other long lived connections for their services. Will they do the right thing? It’s probably worth taking a closer look.
Did the ELB black-hole your traffic if you forced the servers to fail the health check, instead of deregistering them?
There are really some nasty connection/communication methods. Persistent search (which is great feature of LDAP servers) is another one. Often silently dropped without notifying the clients.
doesn’t the new ELB “Connection Draining” feature alleviate the problem?
Thanks for the interesting post. I am using ELB via elastic bean stalk for a Web application (no mobile yet, but someday). In theory I’d like it to add now servers and scale them back as needed. But when I’ve been testing with micro instances I’ve seen the elastic bean stalk get corrupted so I have to recreate it. I’m hoping that this was a result of the micro instance running out of memory and that this will go away when I use medium memory instances in production. It was interesting to read about the problem you had.
By the way, please tell Julie that I said “Hi”
Hi, Ian! Thanks for reading. Long time, no see 🙂
It has been a while. Sorry, I didn’t see your reply until today when Max responded.
I’ve been using AWS to build nderground (http://www.nderground.net) on nights and weekends. nderground is a social network built for privacy. nderground is hosted on AWS.
I’ve gotten more experience with scaling the Elastic Load Balancer and I’m pretty happy with it. I am now running two t2.micro instances and I have not run out of performance (although I look forward to the day this happens). I also use RDS/Postgres and DyanoDB, so I think that the performance I am seeing may result from offloading the compute instances.
Max’s comment about connection timeout looks interesting.
I am having problems with Apple devices and connection timeout on photo upload. This happens less with Safari than Chrome, but its still a problem. So I may experiment with increasing the timeout limit.
Apple is also annoying in that they don’t correct the orientation of photos. If the photo is stored on the device they write EXIF information so you can fix the orientation. But if the image comes directly from the camera to up-load, there is no EXIF information, so you can’t correct the image (as far as I know).
I since solved this problem using the rotation function in the imgscalr Java package. I also use imgscalr for image scaling.
You may set connection timeout (https://aws.amazon.com/blogs/aws/elb-idle-timeout-control/) to 1 second.
Thanks everyone. Any updates on this? I’m seeing the same behavior today. I think my server X is even sending RSTs the the ELB, and my clients are still half open!