It’s been a while since I updated and with good reason. We were acquired by Limelight Networks last year and are now happily ensconced within the extended family of other acquisitions like Kiptronics and Clickability. While it was great to have little perks like my pick of multiple offices (Limelight has a couple scattered around the Bay Area) after toiling at home for most of two plus years, we had other mandates to fulfill. A big one was moving our infrastructure from AWS to Limelight. The reason for this was simple enough. Limelight operates a humongous number of data centers around the world, all connected by private fiber. Business cost efficiencies aside, it really didn’t make sense for us to not leverage this.
AWS makes it really easy to migrate to the cloud. Firing up machines and leveraging new services is a snap. All you really need is a credit card and a set of digital keys. It’s not until you try going the other way you realize how intertwined your applications are with AWS. I am reminded of the alien wrapped around the face of John Hurt!
Obviously, we’d not be where we were if we hadn’t been able to leverage AWS infrastructure – and for that I am eternally thankful. However, as we found, if you do need to move off of it for whatever reason, you’ll find the difficulty of your task to be directly proportional to the number of cloud services you use. In our case, we employed a tremendous amount – EC2, S3, SimpleDB, Elastic Map Reduce, Elastic Load Balancing, CloudFront, Cloud monitoring and, to a lesser extent, Elastic Block Storage and SQS. This list doesn’t even include the management tools that we used to handle our applications in the AWS cloud. Consequently, part of our task was to find suitable replacements for all of these and perform migration while ensuring our core platform continued to run (and scale) smoothly.
I’ll be writing more about our migration strategy, planning and execution further on down the line but for now, I thought it would be useful to summarize some of our experiences with Amazon services over the past couple of years. Here goes:
- EC2 – one of the things I’d expected when I first moved our services to EC2 would be that instances would be a lot more unstable. You’re asked to plan for instances going down at any time and I’ve experienced instances of all types crashing but when taking out the failures due to application issues, the larger instances turned out to be definitely more stable. Very much so, actually.
- S3 is both amazing and frustrating. It’s mind blowing how it’s engineered to store an indefinite number of files per bucket. It’s astonishing at how highly available it is. In my 3 years with AWS, I can recall S3 having serious issues only twice or thrice. It’s also annoying because it does not behave like a standard file system, in particular, in its eventual consistency policy, yet it’s very easy to write your applications to treat it like one. That’s dangerous. We kept changing our S3 related code to include many, many retries as the eventual consistency model doesn’t guarantee that just because you have successfully written a file to S3, it will actually be there.
- Cloudfront is an expensive service and the main reason to use it is because of its easy integration with S3. Other CDNs are more competitive price-wise.
- Elastic Map Reduce provides Hadoop as a service. Currently, there are two main branches of hadoop supported – 0.18.3 and 0.20.2 although EMR folks encourage you to use the latter, and I believe, is the default for job submission tools. The EMR team has incorporated many fixes of their own into the hadoop code base. I’ve heard of grumblings of this causing yet further splintering of the hadoop codebase (in addition to what comes out of Yahoo and Cloudera). I’ve also heard these will be merged back into the main branches but I am not sure of the status. Being one of the earliest and consistent users of this service (we run every couple of hours and at peak can have 100 plus instances running), I’ve found it to be very stable and the EMR team to be very responsive when jobs fail for whatever reason. Our map reduce applications use 0.18.3 and I am currently in the process of moving over to 0.20.2, something recommended by the EMR team. Once that occurs, I’d expect performance and stability to improve further. Two further thoughts regarding our EMR usage:
- EMR is structured such that it’s extremely tempting to scale your job by simply increasing the number of nodes in your cluster. It’s just a number after all. In doing so, however, you run the risk of ignoring fundamental issues with your workflow until it assumes calamitous proportions, especially as the size of your unprocessed data increases. Changing node types is a temporary palliative but at some point you have to bite the bullet and get down to computational efficiency and job tuning.
- Job run time matters! I’ve found similar jobs to run much faster in the middle of the night than during the day. I can only assume this is the dark side of multi tenancy in a shared cloud. In other words, the performance a framework like hadoop relying heavily on network interconnectivity is going to be dependent on the load imposed on the network by other customers. Perhaps this is why AWS has rolled out the new cluster instance types that can run in a more dedicated network setting.
- SimpleDB: my experience with SimpleDB is not particularly positive. The best thing I can say about it is that it rarely goes down wholesale. However, retrieval performance is not that great and endemic with 503 type errors. Whatever I said for S3 and retries goes double for SimpleDB. In addition, there’s been no new features added to SimpleDB for quite a while and I am not sure what AWS long term goals are for it, if any.
That’s it for now. I hope to add more in a future edition.
Note: post updated thanks to Sid Anand.