soam's home

home mail us syndication

Peak Load

The graph below shows the requests/sec on the Delve production load balancers for our playlist service system. The time frame roughly covers the past 7 days.

As you can see, we’ve had at least three major peaks over the past couple of days. Some of these have been due to some big traffic partners coming online (including Pokemon) and at least one of them (the most recent one) was because of singer/songwriter Jay Reatard’s untimely passing and the subsequent massive demand for his videos by way of pitchfork, a partner of ours. In other words, some we predicted. Others – well those just happen.

All of these hits are great for the business and for our growth but is definitely white knuckle time for those of us responsible for keeping the system running. Fortunately, through some luck and a whole lot of planning, things have gone very smoothly thus far, fingers crossed. Some of the things we did in advance to prepare:

  • load testing the entire system to isolate the weakest links: we found apachebench and httperf to be our good friends
  • instrument components to print out response times: in particular, we did this with nginx, our load balancer of choice, as it is very easy to print out upstream and client response times
  • utilize the cloud to prepare testbeds: instead of hitting our production system, we were able to set up smaller replicas in the cloud and test there
  • monitor each machine in the chain: running something as simple as top or a little more sophisticated as netstat during load testing can provide great insights. In fact, this is something not limited to load testing. Simply, monitoring production machines during heavy traffic can provide a lot of information.

Our testing showed that:

  • we needed to offload serving more static files to the CDN
  • we could use the load balancer to serve some files which were generated dynamically in the backend, yet never changed. This was an enormous saving.
  • our backend slave cloud dbs needed some semblance of tuning. I don’t consider myself to be a mysql expert by any stretch of the imagination. However, I found that our dbs were small enough and there was sufficient RAM in our AWS instances such that tweaks like increasing the query cache size and raising the innodb buffer pool ensured no disk I/O when serving requests.
  • altering our backend caching to evict after a longer period of time – this would reduce load on our dbs
  • smoothen our deployment process so we can fire up additional backend nodes and load balancers if necessary

There’s much more to be done but surviving the onslaught thus far (with plenty of remaining capacity) has definitely been very heartening. It almost (but not quite) makes up for working through most of the holiday season :)

Leave a Comment