soam's home

home mail us syndication

Archive for January, 2010

Bing’s Engineers

Nice San Jose Mercury article on the ex-Inktomi and Yahoo-ites behind Bing’s real time search launch – I particularly enjoyed the opening paragraphs:

Microsoft engineer Chad Carson wasn’t thrilled about surrendering his solo window seat on the Alaska Airlines flight from San Jose to Seattle so he could talk shop with his boss Sean Suchter and colleague Eric Scheel.

But that innocent decision last July 22 would spark a 91-day sprint to a previously unreached Internet milestone.

By the time Flight 321 was over Oregon, the group in Row 6 had evolved from a technology klatch to a cabal of plotters who scrawled a schematic tangle of boxes on a sheet of paper to map out something no big Internet search engine had yet achieved. The three members of Microsoft’s new Silicon Valley search team would try to make their company’s Bing a window into America’s stream of consciousness, serving up the chatter on Twitter and blog posts, with the latest updates on everything from celebrity gossip to breaking news.

Knowing Sean and Chad’s talent and work ethic, it’s great to see them get this exposure, particularly after spending so much time in Google’s shadow. Congrats guys! Also, I found the mention of row 6 in Alaska Airlines particularly amusing. If you’re not MVP or Gold or any type of high falutin’ flyin’ status holder, you can still get on early before the rest of the folks in cattle class if you score a seat on row 6. It’s my own shortcut on flights to Seattle to Delve HQ.

Peak Load

The graph below shows the requests/sec on the Delve production load balancers for our playlist service system. The time frame roughly covers the past 7 days.

As you can see, we’ve had at least three major peaks over the past couple of days. Some of these have been due to some big traffic partners coming online (including Pokemon) and at least one of them (the most recent one) was because of singer/songwriter Jay Reatard’s untimely passing and the subsequent massive demand for his videos by way of pitchfork, a partner of ours. In other words, some we predicted. Others – well those just happen.

All of these hits are great for the business and for our growth but is definitely white knuckle time for those of us responsible for keeping the system running. Fortunately, through some luck and a whole lot of planning, things have gone very smoothly thus far, fingers crossed. Some of the things we did in advance to prepare:

  • load testing the entire system to isolate the weakest links: we found apachebench and httperf to be our good friends
  • instrument components to print out response times: in particular, we did this with nginx, our load balancer of choice, as it is very easy to print out upstream and client response times
  • utilize the cloud to prepare testbeds: instead of hitting our production system, we were able to set up smaller replicas in the cloud and test there
  • monitor each machine in the chain: running something as simple as top or a little more sophisticated as netstat during load testing can provide great insights. In fact, this is something not limited to load testing. Simply, monitoring production machines during heavy traffic can provide a lot of information.

Our testing showed that:

  • we needed to offload serving more static files to the CDN
  • we could use the load balancer to serve some files which were generated dynamically in the backend, yet never changed. This was an enormous saving.
  • our backend slave cloud dbs needed some semblance of tuning. I don’t consider myself to be a mysql expert by any stretch of the imagination. However, I found that our dbs were small enough and there was sufficient RAM in our AWS instances such that tweaks like increasing the query cache size and raising the innodb buffer pool ensured no disk I/O when serving requests.
  • altering our backend caching to evict after a longer period of time – this would reduce load on our dbs
  • smoothen our deployment process so we can fire up additional backend nodes and load balancers if necessary

There’s much more to be done but surviving the onslaught thus far (with plenty of remaining capacity) has definitely been very heartening. It almost (but not quite) makes up for working through most of the holiday season 🙂