soam's home

home mail us syndication

Archive for performance

Escape From Amazon: Tips/Techniques for Reducing AWS Dependencies

Slides from my talk at CloudTech III in early October 2012:

Escape From Amazon: Tips/Techniques for Reducing AWS Dependencies from Soam Acharya


Talk abstract:

As more startups use Amazon Web Services, the following scenario becomes increasingly frequent – the startup is acquired but required by the parent company to move away from AWS and into their own data centers. Given the all encompassing nature of AWS, this is not a trivial task and requires careful planning at both the application and systems level. In this presentation, I recount my experiences at Delve, a video publishing SaaS platform, with our post acquisition migration to Limelight Networks, a global CDN, during a period of tremendous growth in traffic. In particular, I share some of the tips/techniques we employed during this process to reduce AWS dependence and evolve to a hybrid private/AWS global architecture that allowed us to compete effectively with other digital video leaders.

Some Tips on Amazon’s EMR II

Following up from Some Tips on Amazon’s EMR

Recently, I was asked for some pointers for a log processing application intended to be implemented via Amazon’s Elastic Map Reduce. Questions included how big the log files should be, what format worked best and so on. I found it to be a good opportunity to summarize some of my own experiences. Obviously, there’s no hard and fast rule on optimal file size as input for EMR per se. Much depends on the computation being performed and the types of instances chosen. That being said, here are some considerations:

  • it’s a good idea to batch log files as much as possible. Hadoop/EMR will generate a map job for every individual file in S3 that serves as input. So, the lower the number of files, the less maps in the initial step and the faster your computation.
  • another reason to batch files: hadoop doesn’t deal with many small files very well. See It’s possible to get better performance with a smaller number of large files.
  • however, it’s not a good idea to batch files too much – if 10 100M files are blended into a single 1GB file, then it’s not possible to take advantage of the parallelization of map reduce. One mapper will then be working on downloading one file while the rest are idle, waiting for that step to finish.
  • gzip vs other compression schemes for input splits – hadoop cannot peer inside gzipped files, so it can’t produce splits very well with that kind of input files. It does better with LZO. See
In short, experimenting with log file sizes and instance types is essential for good performance down the road.

MySQL High Availability Talk – My Notes

Last month, Lenz Grimmer, Community Manager at MySQL (now Oracle, I suppose), gave an overview of a number of MySQL HA techniques at the SF MySQL Meetup group. My notes from that talk:

MySQL not trying to be an Oracle replacement, rather the goal is to make it better for its specific needs and requirements.

HA Concepts and Considerations:

  • something can and will fail
  • can’t afford downtime eg. maintenance
  • adding HA to an existing system is complex (Note: from my experience, this is definitely the case with MySQL!)

HA components:

  • a heartbeat checker is definitely necessary to check
    • whether services still alive?
    • components: individual servers, services, network etc
  • HA monitoring
    • have to be able to add and remove services
    • have to allow shutdown/startup operations on services
    • have to allow manual control if necessary
  • Shared storage/replication

One of the possible failure scenarios for a distributed system is the Split Brain syndrome whereby communication failures can lead to cluster partitioning. Further problems can ensue when each partition tries to take control of the entire cluster. Have to use approaches such as fencing or moderation/arbitration to avoid.

Some notes on MySQL replication:

  • replicate statements that have changed. This is a statement or row based approach.
  • can be asynchronous so slaves can lag
  • new in mysql 5.5 – semi sync replication
  • not fully synchronous but replication is included in the transaction i.e. transcation will not proceed until master receives ok from at least one slave
  • master maintains binary log and index
  • replication on slave is single threaded i.e. parallel transactions are serialized
  • there is no automated fail-over
  • a new feature in 5.5 – replication heartbeat

The master master configuration is not suitable for write load balancing. Don’t write to both masters at the same time, use sharding/partitioning instead eg. auto increments is a PITA (audience query)

Disk replication is another HA technique. This is not mysql level replication. Instead, files are replicated to another disk at the disk level via block level replication. DRBD (Disk Replacement Block Device) is one such technology. Some features:

  • raid-1 over network
  • synchronous/async block replication
  • automatic resync
  • application agnostic since operating at the disk level
  • can mask local I/O issues

By default, DRBD operates on an active-passive configuration such that block device on 2nd system isn’t accessible. Now, DRBD has changed to allow writes on the passive device as well but it only really works if using clustered file system underneath like GFS or OCFS2. However, it remains a dangerous practice.

When using DRBD with MySQL:

  • really bad with MyISAM tables since the replication occurs at the block level. Failover could lead to an inconsistent file system, so integrity check to repair would be required. Hence, can only use journaled file system with DRBD. Also, Innodb more easily repaired than MyISAM.
  • MySQL server runs only on primary drbd node, not on secondary.

Instead of replication, another possibility is to use a storage area network to secure your data. However, in that case, the SAN can become a single point of failure. In addition, following a switchover, new MySQL instances can have cold caches – since they have had not had time to warm up yet.

MySQL Cluster technology, on the other hand, is not good with cross table JOINs. In addition, owing to their architecture, they may not be suitable for all applications.

A number of companies were mentioned in the talk that are active in the space. One such example is Galera, a Norwegian company which provides their own take on MySQL replication. Essentially, they have produced a patch for Innodb as well as an external library. This allows single or multi master setups as well as multicast based replication.

Peak Load

The graph below shows the requests/sec on the Delve production load balancers for our playlist service system. The time frame roughly covers the past 7 days.

As you can see, we’ve had at least three major peaks over the past couple of days. Some of these have been due to some big traffic partners coming online (including Pokemon) and at least one of them (the most recent one) was because of singer/songwriter Jay Reatard’s untimely passing and the subsequent massive demand for his videos by way of pitchfork, a partner of ours. In other words, some we predicted. Others – well those just happen.

All of these hits are great for the business and for our growth but is definitely white knuckle time for those of us responsible for keeping the system running. Fortunately, through some luck and a whole lot of planning, things have gone very smoothly thus far, fingers crossed. Some of the things we did in advance to prepare:

  • load testing the entire system to isolate the weakest links: we found apachebench and httperf to be our good friends
  • instrument components to print out response times: in particular, we did this with nginx, our load balancer of choice, as it is very easy to print out upstream and client response times
  • utilize the cloud to prepare testbeds: instead of hitting our production system, we were able to set up smaller replicas in the cloud and test there
  • monitor each machine in the chain: running something as simple as top or a little more sophisticated as netstat during load testing can provide great insights. In fact, this is something not limited to load testing. Simply, monitoring production machines during heavy traffic can provide a lot of information.

Our testing showed that:

  • we needed to offload serving more static files to the CDN
  • we could use the load balancer to serve some files which were generated dynamically in the backend, yet never changed. This was an enormous saving.
  • our backend slave cloud dbs needed some semblance of tuning. I don’t consider myself to be a mysql expert by any stretch of the imagination. However, I found that our dbs were small enough and there was sufficient RAM in our AWS instances such that tweaks like increasing the query cache size and raising the innodb buffer pool ensured no disk I/O when serving requests.
  • altering our backend caching to evict after a longer period of time – this would reduce load on our dbs
  • smoothen our deployment process so we can fire up additional backend nodes and load balancers if necessary

There’s much more to be done but surviving the onslaught thus far (with plenty of remaining capacity) has definitely been very heartening. It almost (but not quite) makes up for working through most of the holiday season 🙂