Wednesday, November 17, 2010

What Happened During Sweepstakes?

I'm sure I'm not the only one who got burned by this during the ARRL Sweepstakes - periodically the RBN Telnet server would go down, throwing everyone off, reject connection requests for a few minutes, and then come back on - but unless your client had automatic re-connect capabilities, you might not notice until the spots stopped flowing, and then, if you tried to reconnect, you might or might not get back on..

We (mainly Felipe and Nick, F5VIH) have been studying the logs and configuration details for a few days, trying to figure out what is going on.The spot volume doesn't appear to have been to blame - only 470k spots for the busiest 24-hour day, or about 5.5 per second on average ("only", he says!). Instead, it seems to have been the number of users - we're once again victims of our own success.

There are several potential avenues for fixing this - the best would probably be to adjust the current server's parameters to increase its inbound bandwidth and other capacities very substantially, but that may not be feasible with a Virtual Private Server such as we're currently using. We could get another server, and set up a round-robin DNS so that connection requests are routed to them alternatively. Fine, except that this requires the added server to have a fixed IP address, and if either server gets overburdened and crashes, then half the users would find themselves getting rejected or thrown off until it recovers. A third possibility might be to move the whole thing to another server that we don't share with anyone, and that we can set up for high volume. A fourth, crude solution would be to set up another server with an entirely separate URL, and publicize its availability in the hope that the users themselves will redistribute among the servers.

Anyhow, we're scrambling to figure it out and solve the problem before CQWW CW in ten days' time - please stay tuned.

73, Pete N4ZR

1 comment:

  1. Hi Pete,
    I'd be curious to know more about the back-end system you are using (VPS available memory and bandwidth to the net). My rough calculations lead me to believe its not an upload bandwidth issue. Assuming 40 characters (bytes) per spot and 5.5 spots per second I figure thats 220 bytes/sec uploaded from the skimmers. It may be less depending on how/if the skimmers compress their upload packets.

    Assuming a lowly T1 line was your connection to the internet at a max 200 Kbytes/sec capacity, that would allow you to receive 5000 spots/sec burst, almost 1000 times your reported average. My guess though is you're sitting on a circuit with 10-100 times that capacity if you're with a large cloud computing provider. So I would think upload bandwidth isn't the issue.

    A more important question is how many viewers/users are connecting, 100, 500, 1000, more? That would dictate your outgoing traffic, among other resource needs. Again at 220 bytes/sec (per those 5.5 spots) that could require 22 Kbytes/s, 110 Kbytes/s, or 220 Kbytes/s respectively for outbound traffic for each quoted number of users above. That latter number would just saturate a T1 connections outbound capacity, but would be easily handled by a large cloud provider who may have upwards on 10 MBytes/s capacity to the internet, or 50x your max needed capacity to support 1000 users.

    It would seem that you may be up against a memory limit on your VPS. I'm unsure how your system works, but I'll quote numbers off a Linux system I use, some daemon knowledge and a worst case scenario. Knowing you allow users to connect with telnet, I spawned a telnet daemon and checked the memory footprint it requires to handle one user connection. Its ~2 Mbytes. Now the next question is how much memory is allocated to your VPS? 256M, 512M, 1024M, more? A good portion of that will be allocated to running the O/S, the web server (say Apache), maybe a DB (say MySQL). Assume though for the discussion those take no memory.

    The 3 quoted memory allocation above would allow for roughly 128, 256, or 512 users respectively. Once you exceed that your O/S will start swapping memory and processes out to disk. That is a BIG performance hit and will soon bring your system to its knees, with very little chance to recover, given the continuous traffic you say the system digests and relays.

    Without knowing any real numbers about your system this is just a shot in the dark, but I've seen many instances of servers that were under equipped to handle the memory requirements of a system. Hope this sheds some light and let me know if I can be of any assistance.


    andyz - K1RA