Archive for aws
Slides from my talk at CloudTech III in early October 2012:
As more startups use Amazon Web Services, the following scenario becomes increasingly frequent – the startup is acquired but required by the parent company to move away from AWS and into their own data centers. Given the all encompassing nature of AWS, this is not a trivial task and requires careful planning at both the application and systems level. In this presentation, I recount my experiences at Delve, a video publishing SaaS platform, with our post acquisition migration to Limelight Networks, a global CDN, during a period of tremendous growth in traffic. In particular, I share some of the tips/techniques we employed during this process to reduce AWS dependence and evolve to a hybrid private/AWS global architecture that allowed us to compete effectively with other digital video leaders.
It’s been a while since I updated and with good reason. We were acquired by Limelight Networks last year and are now happily ensconced within the extended family of other acquisitions like Kiptronics and Clickability. While it was great to have little perks like my pick of multiple offices (Limelight has a couple scattered around the Bay Area) after toiling at home for most of two plus years, we had other mandates to fulfill. A big one was moving our infrastructure from AWS to Limelight. The reason for this was simple enough. Limelight operates a humongous number of data centers around the world, all connected by private fiber. Business cost efficiencies aside, it really didn’t make sense for us to not leverage this.
AWS makes it really easy to migrate to the cloud. Firing up machines and leveraging new services is a snap. All you really need is a credit card and a set of digital keys. It’s not until you try going the other way you realize how intertwined your applications are with AWS. I am reminded of the alien wrapped around the face of John Hurt!
Obviously, we’d not be where we were if we hadn’t been able to leverage AWS infrastructure – and for that I am eternally thankful. However, as we found, if you do need to move off of it for whatever reason, you’ll find the difficulty of your task to be directly proportional to the number of cloud services you use. In our case, we employed a tremendous amount – EC2, S3, SimpleDB, Elastic Map Reduce, Elastic Load Balancing, CloudFront, Cloud monitoring and, to a lesser extent, Elastic Block Storage and SQS. This list doesn’t even include the management tools that we used to handle our applications in the AWS cloud. Consequently, part of our task was to find suitable replacements for all of these and perform migration while ensuring our core platform continued to run (and scale) smoothly.
I’ll be writing more about our migration strategy, planning and execution further on down the line but for now, I thought it would be useful to summarize some of our experiences with Amazon services over the past couple of years. Here goes:
- EC2 – one of the things I’d expected when I first moved our services to EC2 would be that instances would be a lot more unstable. You’re asked to plan for instances going down at any time and I’ve experienced instances of all types crashing but when taking out the failures due to application issues, the larger instances turned out to be definitely more stable. Very much so, actually.
- S3 is both amazing and frustrating. It’s mind blowing how it’s engineered to store an indefinite number of files per bucket. It’s astonishing at how highly available it is. In my 3 years with AWS, I can recall S3 having serious issues only twice or thrice. It’s also annoying because it does not behave like a standard file system, in particular, in its eventual consistency policy, yet it’s very easy to write your applications to treat it like one. That’s dangerous. We kept changing our S3 related code to include many, many retries as the eventual consistency model doesn’t guarantee that just because you have successfully written a file to S3, it will actually be there.
- Cloudfront is an expensive service and the main reason to use it is because of its easy integration with S3. Other CDNs are more competitive price-wise.
- Elastic Map Reduce provides Hadoop as a service. Currently, there are two main branches of hadoop supported – 0.18.3 and 0.20.2 although EMR folks encourage you to use the latter, and I believe, is the default for job submission tools. The EMR team has incorporated many fixes of their own into the hadoop code base. I’ve heard of grumblings of this causing yet further splintering of the hadoop codebase (in addition to what comes out of Yahoo and Cloudera). I’ve also heard these will be merged back into the main branches but I am not sure of the status. Being one of the earliest and consistent users of this service (we run every couple of hours and at peak can have 100 plus instances running), I’ve found it to be very stable and the EMR team to be very responsive when jobs fail for whatever reason. Our map reduce applications use 0.18.3 and I am currently in the process of moving over to 0.20.2, something recommended by the EMR team. Once that occurs, I’d expect performance and stability to improve further. Two further thoughts regarding our EMR usage:
- EMR is structured such that it’s extremely tempting to scale your job by simply increasing the number of nodes in your cluster. It’s just a number after all. In doing so, however, you run the risk of ignoring fundamental issues with your workflow until it assumes calamitous proportions, especially as the size of your unprocessed data increases. Changing node types is a temporary palliative but at some point you have to bite the bullet and get down to computational efficiency and job tuning.
- Job run time matters! I’ve found similar jobs to run much faster in the middle of the night than during the day. I can only assume this is the dark side of multi tenancy in a shared cloud. In other words, the performance a framework like hadoop relying heavily on network interconnectivity is going to be dependent on the load imposed on the network by other customers. Perhaps this is why AWS has rolled out the new cluster instance types that can run in a more dedicated network setting.
- SimpleDB: my experience with SimpleDB is not particularly positive. The best thing I can say about it is that it rarely goes down wholesale. However, retrieval performance is not that great and endemic with 503 type errors. Whatever I said for S3 and retries goes double for SimpleDB. In addition, there’s been no new features added to SimpleDB for quite a while and I am not sure what AWS long term goals are for it, if any.
That’s it for now. I hope to add more in a future edition.
Note: post updated thanks to Sid Anand.
Following up from Some Tips on Amazon’s EMR
Recently, I was asked for some pointers for a log processing application intended to be implemented via Amazon’s Elastic Map Reduce. Questions included how big the log files should be, what format worked best and so on. I found it to be a good opportunity to summarize some of my own experiences. Obviously, there’s no hard and fast rule on optimal file size as input for EMR per se. Much depends on the computation being performed and the types of instances chosen. That being said, here are some considerations:
- it’s a good idea to batch log files as much as possible. Hadoop/EMR will generate a map job for every individual file in S3 that serves as input. So, the lower the number of files, the less maps in the initial step and the faster your computation.
- another reason to batch files: hadoop doesn’t deal with many small files very well. See http://www.cloudera.com/blog/2009/02/the-small-files-problem/ It’s possible to get better performance with a smaller number of large files.
- however, it’s not a good idea to batch files too much – if 10 100M files are blended into a single 1GB file, then it’s not possible to take advantage of the parallelization of map reduce. One mapper will then be working on downloading one file while the rest are idle, waiting for that step to finish.
- gzip vs other compression schemes for input splits – hadoop cannot peer inside gzipped files, so it can’t produce splits very well with that kind of input files. It does better with LZO. See https://forums.aws.amazon.com/message.jspa?messageID=184519
In short, experimenting with log file sizes and instance types is essential for good performance down the road.
Amazon provides a whole variety of instance types for EC2 but lists their CPU capabilities via “EC2 Compute Units” where
One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.
That’s somewhat helpful for m1.small but what about c1.xlarge which has something like 20 EC2 compute units? How to map that to the real world? Fortunately, I found a cloud computing presentation from Constantinos Evangelinos and Chris Hill from MIT/EAPS which contained mappings of most of the common ec2 instance types. It’s from 2008 but should still be applicable. Drawing from the slides, we have:
- m1.small => (1 virtual core with 1 EC2 Compute Unit) => half core of a 2 socket node, dual core AMD Opteron(tm) Processor 2218 HE, 2.6GHz
- m1.large => (2 virtual cores with 2 EC2 Compute Units each) => half of a 2 socket node, dual core AMD Opteron(tm) Processor 270, 2.0GHz
- m1.xlarge => (4 virtual cores with 2 EC2 Compute Units each) => one 2 socket node, dual core AMD Opteron(tm) Processor 270, 2.0GHz
- c1.medium => (2 virtual cores with 2.5 EC2 Compute Units each) => half of a 2 socket node, quad core Xeon E5345, 2.33GHz
- c1.xlarge => (8 virtual cores with 2.5 EC2 Compute Units each) => one 2 socket node, quad core Xeon E5345, 2.33GHz
Brian Aker talks about the post Oracle MySQL world in this O’Reilly Radar interview. Good stuff. One section though caused me to raise an eyebrow:
MapReduce works as a solution when your queries are operating over a lot of data; Google sizes of data. Few companies have Google-sized datasets though. The average sites you see, they’re 10-20 gigs of data. Moving to a MapReduce solution for 20 gigs of data, or even for a terabyte or two of data, makes no sense. Using MapReduce with NoSQL solutions for small sites? This happens because people don’t understand how to pick the right tools.
Hmm. First of all, just because you have 10-20GB of data right now doesn’t mean you’ll have 10-20GB of data in the future. From my experience, once you start getting into this range of data, scaling mysql becomes painful. More likely as not, your application has absolutely no sharding/distributed processing capability built in to your mysql setup, so at this point, your choices are:
- vertical scaling => bigger boxes, RAID/SSD disks etc.
- introduce sharding into mysql, retrofit your application to deal with it
- bite the bullet and offload your processing into some other type of setup such as MapReduce
(1) is merely kicking the can down the road.
(2) involves maintaining more mysql servers, worrying about sharding schemes, setting up a middleman to deal with partitioning, data collation etc.
In both (1) and (2), you still have to worry about many little things in mysql such as setting up replication, setting up indexes for tables, tuning queries etc. And in (2), you’ll have more servers running. While it is true mysql clustering exists, as does native partitioning support in newer mysql versions, setting that stuff up is still painful and it’s not clear whether the associated maintenance overhead is worth the performance you get.
It’s not a surprise more and more people are turning to (3). A hadoop cluster provides more power out of the box than a sharded mysql setup, and a more brain dead scalable path. Just add more machines! Yes, there are configuration issues involved in a hadoop cluster as well but I think they’re far easier to deal with than the equivalent mysql setup. The main drawback here is (3) only works if your processing requirements are batch based, not real time.
It is true that not all of the technologies in the Hadoop ecosystem outside of Hadoop itself are all that mature. However, BigTable solutions like Hbase are still not that easy to setup and run. Pig is still evolving but Cascading is an amazing library. Additionally, if one uses Amazon’s cloud products judiciously, it may actually be possible to do (3) really cheap (as opposed to (2) which requires more and bigger machines).
How? Store persistent files in S3 (logs etc). Use Elastic MapReduce periodically so you are not running a dedicated hadoop cluster. Use SimpleDB for your db needs. SimpleDB has limitations (2500 limit on selects, restricted attributes, strings only) but more and more people (such as Netflix) are using it for high volume applications. Furthermore, all of these technologies are enabling single entrepreneurs to do things like crawl and maintain big chunks of the web so that they can build interesting new applications on top, something that would have been too cost prohibitive in the older MySQL world. I hope to write more about it soon.
After Amazon’s reserved instance pricing announcement last year, there were quite a few folks writing about the breakeven point for your ec2 instance i.e. the length of time you’d need to run your instance continuously before the reserved pricing turned out to be cheaper than the standard pay-as-you-go scheme. Looking around, I believe the general consensus was that it would take around 4643 hours or 6.3 months. See here, here and here, for example.
Around late October of last year, Amazon announced even cheaper pricing for their ec2 instances. However, not seeing any newer breakeven numbers computed in the wake of lower prices, I decided to post some of my own. These are for one year reserved pricing for Amazon’s US-N-Virginia data center. All data is culled from the AWS ec2 page.
As we can see, the break even numbers have dropped quite a bit – down to 4136 hours on most of the instance types, a drop of almost 500 hours or so. That translates to better pricing 3 weeks earlier than before, in about 5.7 months. Interestingly enough, the high memory instances have slightly earlier break even points (by about 50 hours or so). Not quite sure why.
Recently, I discovered Practical Cloud Computing, a blog run by Siddharth Anand, an architect in Netflix’s cloud infrastructure group. In a recent post, he writes:
I was recently tasked with fork-lifting ~1 billion rows from Oracle into SimpleDB. I completed this forklift in November 2009 after many attempts. To make this as efficient as possible, I worked closely with Amazon’s SimpleDB folks to troubleshoot performance problems and create new APIs.
Why would they need something like this? From another entry titled, Introducing the Oracle-SimpleDB Hybrid, Siddharth writes:
My company would like to migrate its systems to the cloud. As this will take several months, the engineering team needs to support data access in both the cloud and its data center in the interim. Also, the RDBMS system might be maintained until some functionality (e.g. Backup-Restore) is created in SimpleDB.
To this aim, for the past 9 months, I have been building an eventually-consistent, multi-master data store. This system is comprised of an Oracle replica and several SimpleDB replicas.
In other words, Netflix is planning to move many of its constituent services into the AWS cloud starting with their main data repository. This sounded like a pilot project, albeit a massive one, and understandably so given the size of Netflix. If this went smoothly the immediate upside would be Netflix not spending a fortune on Oracle licenses and maintenance. In addition, AWS would have proved itself to be able handle Netflix’s scale requirements.
Evidently things went well as I came across a slide deck detailing Netflix’s cloud usages further:
Fascinating stuff. From the deck, it appears that in addition to using SimpleDB for data storage, Netflix is using many AWS components in for its online streaming setup. Specifically:
- ec2 for encoding
- S3 for storing source and encoded files
- SQS for application communication
I also saw references to EBS (Elastic Block Storage), ELB (Elastic Load Balancing) and EMR (Elastic Map Reduce).
I think for the longest time, AWS and other services of its ilk, were viewed as resources used by startups (such as ourselves) in an effort to ramp up to scale quickly so as to go toe to toe with the big guys. It’s interesting to see the big guys get in on the act themselves.
The graph below shows the requests/sec on the Delve production load balancers for our playlist service system. The time frame roughly covers the past 7 days.
As you can see, we’ve had at least three major peaks over the past couple of days. Some of these have been due to some big traffic partners coming online (including Pokemon) and at least one of them (the most recent one) was because of singer/songwriter Jay Reatard’s untimely passing and the subsequent massive demand for his videos by way of pitchfork, a partner of ours. In other words, some we predicted. Others – well those just happen.
All of these hits are great for the business and for our growth but is definitely white knuckle time for those of us responsible for keeping the system running. Fortunately, through some luck and a whole lot of planning, things have gone very smoothly thus far, fingers crossed. Some of the things we did in advance to prepare:
- load testing the entire system to isolate the weakest links: we found apachebench and httperf to be our good friends
- instrument components to print out response times: in particular, we did this with nginx, our load balancer of choice, as it is very easy to print out upstream and client response times
- utilize the cloud to prepare testbeds: instead of hitting our production system, we were able to set up smaller replicas in the cloud and test there
- monitor each machine in the chain: running something as simple as top or a little more sophisticated as netstat during load testing can provide great insights. In fact, this is something not limited to load testing. Simply, monitoring production machines during heavy traffic can provide a lot of information.
Our testing showed that:
- we needed to offload serving more static files to the CDN
- we could use the load balancer to serve some files which were generated dynamically in the backend, yet never changed. This was an enormous saving.
- our backend slave cloud dbs needed some semblance of tuning. I don’t consider myself to be a mysql expert by any stretch of the imagination. However, I found that our dbs were small enough and there was sufficient RAM in our AWS instances such that tweaks like increasing the query cache size and raising the innodb buffer pool ensured no disk I/O when serving requests.
- altering our backend caching to evict after a longer period of time – this would reduce load on our dbs
- smoothen our deployment process so we can fire up additional backend nodes and load balancers if necessary
There’s much more to be done but surviving the onslaught thus far (with plenty of remaining capacity) has definitely been very heartening. It almost (but not quite) makes up for working through most of the holiday season
A key advantage associated with cloud computing is that of scalability. Theoretically, it should be easy to provision new machines or decommission older ones from an existing application and scale thus. In reality, things are not so simple. The application has to be suitably structured from ground up in order to best leverage this feature. Merely adding more CPUs or storage will not deliver linear performance improvements unless the application was explicitly designed with that goal. Most legacy systems are not and consequently, as traffic and usage grows, must be continually monitored and patched to keep performing at an acceptable level. This is not optimal. Consequently, extracting maximum utility from the cloud requires applications follow a set of architectural guidelines. Some thoughts on what those should be:
Stateless, immutable components
An important guideline for linear scalabilty is to have relatively lightweight, independent stateless processes which can execute anywhere and run on newly deployed resource (threads/nodes/cpus) as appropriate in order to serve an increasing number of requests. These services share nothing with others, merely processing asynchronous messages. At Delve, we make extensive use of this technique for multimedia operations such as thumbnail extraction, transcoding and transcription that fit well into this paradigm. Scaling for these services involves spinning up, automatically configuring and deploying additional dedicated instances which can be put to work immediately and subsequently taken down once they are no longer needed. Without planning for this type of scenario, however, it is difficult for legacy applications to leverage this type of functionality.
Reduced reliance on relational databases
Relational databases are primarily designed for managing updates and transactions on a single instance. They scale well, but usually on a single node. When the capacity of that single node is reached, it is necessary to scale out and distribute the load across multiple machines. While there are best practices such as clustering, replication and sharding to allow this type of functionality, they have to be incorporated into the system design from the beginning for the application to benefit. Moving into the cloud does not get rid of this problem.
Furthermore, even if these techniques are utilized by the application, their complexity makes it very difficult to scale to hundreds or thousands of nodes, drastically reducing their viability for large distributed systems. Legacy applications are more likely to be reliant on relational databases and moving the actual database system to the cloud does not eliminate any of these issues.
Alternatively, applications designed for the cloud have the opportunity to leverage a number of cloud based storage systems to reduce their dependence on RDBMS systems. For example, we use Amazon’s SimpleDB as core for a persistent key/value store instead of MySQL. Our use case does not require relational database features such as joining multiple tables. However, scalability is essential and SimpleDB provides a quick and uncomplicated way for us to implement this feature. Similarly, we use Amazon’s Simple Storage Service (S3) to store Write Once Read Many data such as very large video file backups and our analytics reports. Both of these requirements, were we to use MySQL like many legacy applications providing similar functionality, would require a heavy initial outlay of nodes and management infrastructure. By using SimpleDB and S3, we are able to provide functionality comparable to or better than legacy systems at lower cost.
There are caveats, however, with using nosql type systems. They have their own constraints and using them effectively requires understanding those limitations. For example, S3 works under a version of the eventual consistency model which does not provide the same guarantees as a standard file system. Treating it as such would lead to problems. Similarly, SimpleDB provides limited db functionality – treating it as a mysql equivalent would be a mistake.
Integration with other cloud based applications
A related advantage of designing for the cloud is the ability to leverage systems offered by the cloud computing provider. In our case, we extensively use Amazon’s Elastic Map Reduce (EMR) service for our analytics. EMR, like Amazon’s other cloud offerings, is a pay as you go system. It is also tightly coupled with the rest of Amazon’s cloud infrastructure such that transferring data within Amazon is free. At periodic intervals, our system spins up a number of nodes within EMR, transfers data from S3, performs computations, saves the results and tears down the instances. The functionality we thus achieve is similar to constantly maintaining a large dedicated map-reduce cluster such as that would be required by a legacy application but at a fraction of the cost.
Deploying an application to the cloud demands special preparation. Cloud machines are typically commodity hardware – preparing an environment able to run the different types of services required by an application is time consuming. In addition, deployment is not a one time operation. The service may need additional capacity to be added later, fast. Consequently, it is important to be able to quickly commission, customize and deploy a new set of boxes as necessary. Existing tools do not provide the functionality required. As cloud computing is relatively new, tools to deploy and administer in the cloud are similarly nascent and must be developed in addition to the actual application. Furthermore, developing, using and maintaining such tools requires skills typically not found in the average sysadmin. The combination of tools and personnel required to develop and run them poses yet another hurdle for moving existing applications to the cloud. For new applications, these considerations must be part of any resource planning.
Typically, cloud infrastructure providers do not guarantee uptime for a node. This implies a box can go down at any time. Additionally, providers such as Amazon will provide a certain percentage uptime confidence level for data centers. While in reality nodes are usually stable, an application designed for the cloud has to have redundancy built in such that a) backups are running and b) they are running in separate data centers. These backup systems also must meet other application requirements such as scalability. Their data must also be in sync with that of the primary stores. Deploying and coordinating such a system, imposes additional overhead in terms of design, implementation, deployment and maintenance, particularly relational databases are involved. Consequently, applications designed from the grounds up with these constraints in mind are much more likely to have an easier transition to the cloud.
MongoDB, Tokyo Cabinet, Project Voldemort … Systems that provide distributed, persistent key value store functionality are proliferating like kudzu. Sometimes it seems that not a day goes by without my hearing about a new one. Case in point: just right now, while browsing, I came across Riak, a “decentralized key-value store, a flexible map/reduce engine, and a friendly HTTP/JSON query interface to provide a database ideally suited for Web applications.”
I understand the motivation behind the NoSQL movement: one of them has to be backlash at the problems associated with MySQL. It’s one of those beasts that is very easy to get started on but, if you don’t start with the right design decisions and growth plan, can be hellacious to scale and maintain. Sadly, this is something I’ve seen happen at places repeatedly throughout my tenure at various companies. It happens all too often. Small wonder then that developers and companies have identified one of the most frequent use cases with modern web applications and MySQL – that of needing to quickly look up key value pairs reliably – and have built or are building numerous pieces of software optimized for doing precisely that at very high performance levels.
The trouble is if you need one yourself. Which one to pick? There are some nice surveys out there (here’s one from highscalability with many good links) but most are in various stages of development, usually with version numbers less than 1 and qualifiers like “alpha” or “beta” appended. Some try to assuage your fears:
Version 0.1? Is it ready for my production data?
That kind of decision depends on many factors, most of which cannot be answered in general but depend on your business. We gave it a low version number befitting a first public appearance, but Riak has robustly served the needs of multiple revenue-generating applications for nearly two years now.
In other words, “we’ve had good experiences with it but caveat emptor. You get what you pay for.”
This is why I really enjoyed the following entry in BJ Clark’s recent survey:
type: key/value store
Conclusion: Scales amazingly well
You’re probably all like “What?!?”. But guess what, S3 is a killer key/value store. It is not as performant as any of the other options, but it scales *insanely* well. It scales so well, you don’t do anything. You just keep sticking shit in it, and it keeps pumping it out. Sometimes it’s faster than other times, but most of the time it’s fast enough. In fact, it’s faster than hitting mysql with 10 queries (for us). S3 is my favorite k/v data store of any out there.
I couldn’t agree more. Recently, I finished a major project at Delve (and I hope to write about more of this later) where one of our goals was to have all our reports we computed for our customers to be available indefinitely. Our current system stores all the reports in, you guessed it, MySQL. The trouble is this eats up MySQL resources and since we don’t do any queries on these reports, we, in essence, are simply using MySQL as a repository. By moving our reporting storage to S3 (and setting up a simple indexing scheme to list and store lookup keys), we have greatly increased the capacity for our current MySQL installation and are now able to keep and lookup reports for customers indefinitely. We are reliant on S3 lookup times and availability – but, for this use case, the former is not as big an issue and having Amazon take care of the latter frees us to worry about other pressing problems of which are fairly plentiful at a startup!