soam's home

home mail us syndication

Escape From Amazon: Tips/Techniques for Reducing AWS Dependencies

Slides from my talk at CloudTech III in early October 2012:

Escape From Amazon: Tips/Techniques for Reducing AWS Dependencies from Soam Acharya


Talk abstract:

As more startups use Amazon Web Services, the following scenario becomes increasingly frequent – the startup is acquired but required by the parent company to move away from AWS and into their own data centers. Given the all encompassing nature of AWS, this is not a trivial task and requires careful planning at both the application and systems level. In this presentation, I recount my experiences at Delve, a video publishing SaaS platform, with our post acquisition migration to Limelight Networks, a global CDN, during a period of tremendous growth in traffic. In particular, I share some of the tips/techniques we employed during this process to reduce AWS dependence and evolve to a hybrid private/AWS global architecture that allowed us to compete effectively with other digital video leaders.

Building Analytical Applications on Hadoop

Last week, I attended a talk by Josh Wills from Cloudera entitled “Building Analytical Applications on Hadoop” at the SF Data Mining meetup. Here’s a link to the talk video. An entertaining speaker, Josh did a great job on tying together various strands of thought on what it takes to build interesting big data analytical applications that engage, inform and entertain. Here are the notes I took during his presentation.

Building Analytical Applications On Hadoop

Josh Wills – Director of Data Science at Cloudera
Nov 2012

@Cloudera 16 months
@Google before that, ~4 years. Worked on Ad Auctions, data infrastructure stuff (logging, dashboards, friend recommendations for Google+, News etc).
Operations Research background.

He is now at an enterprise S/W company. Cloudera, that by itself doesn’t have a lot of data but gets to work with companies and data sets that do.

Data Scientist definition: person who is better at stats than a software engineer and better at software engineering than any statistician.

What are analytical applications?
Field is overrun by words like bigdata, data science. What does it mean exactly?

His definition is “applications that allow individuals to work with and make decisions about data.”

This is too broad, so here’s an example to clarify –  writing a dashboard. This is Data Science 101. Use tools like D3, Tableau etc to produce dashboard. All of these tools support pie charts, 3d pie charts etc. He doesn’t like pie charts.

Another example: Crossfilter with Flight Info. It’s from D3. Great for analyzing time series oriented data.

Other topical examples: New York Times (NYT) Electoral Vote Map (differentials by county from 2008). Mike Bostock did a great series of visualizations at NYT including that one.

Make distinction between analytics applications vs frameworks. All the examples so for have used tool + data. However, technologies like R, SAS, QlikView are frameworks by themselves. BTW, R released a new framework today called Shiny used to help build new webapps. He can’t wait to start playing with it.

2012: The Predicting of the President
Talks about Nate Silver. NYT, 538,com etc. Models for presidential prediction.

Real Clear Politics: simple average of polls for a state, transparent, simple model. Nice interactions with the UI. by Nate Silver: “Foxy” model. Currently reading Nate’s “The Signal and the Noise.” Lots of inputs to his model: state polls, economic data, inter state poll correlations. It’s pretty opaque. The site provides simple interactions with a richer UI.

Princeton Election Consortium (PEC): model based on poll medians and polynomials (take poll median, convert to probability via polynomial function and predict). Very transparent. Site has a nice set of rich interactions. Can change assumptions via java applet on site and produce new predictions.

Both PEC, RCP 29/50. Nate got all 50 states. All did a good job of predicting state elections. 538 did the best.

But 1 expert actually beat Nate. Kos from did better than Nate. He took all the swing state polls – bumped Obama margins in COL, NV since polls undercount Latino turnouts. One other simple trick: picked a poll he knew to be more accurate than others.

Index funds, hedge funds and Warren Buffett
3 different approaches: simple, complex, expert + data.

Moral: having an expert armed with really simple data can beat complex models.

A Brief Intro to Hadoop
Didn’t take notes here other than on data economics: return on byte. How much does it cost to store data? How much value do I get from it?

Big Data Economics: no individual report is particularly valuable but having every record is very valuable (web index, recommendation systems, sensor data, online advertising, market basket analysis).

Developing analytical applications with hadoop
Rule #1: novelty is the enemy of adoption.

Try to make the new way of doing things appear similar to the old tool on the surface.

First hadoop application he developed was new take on an old tool with the same CLI. He calls this Seismic Hadoop. Seismic data processing: how we find oil. It involves sending waves down the earth, recording and analyzing what’s reflected back. Geophysicists have been developing open source s/w to track this stuff since ’80s. He took one of the examples and ported it to hadoop.

Reccommends Hive as the best way of getting started with Hadoop. Has a familiar SQL interface. All business tools such as Tableau connect to Hive.

Borrowing abstractions like spreadsheets. Data community has developed some really good metaphors over the years. Use them. Examples of big data tools that do this – Karmasphere.

However, if you just see hadoop through the lens of these abstractions, it limits your tapping of its main power. Consider Stonebraker who hates hadoop but just doesn’t get it. Who told him it was a relational db?

Improving the UX: Impala. Fast query engine on top of hadoop. Took ideas from classic dbs and built it.

Moving beyond the abstractions

First, make the abstractions concrete. A star schema is a very useful abstraction. Helps one understand a relational database. But you can’t take that model and map it to hadoop as it’s not a relational db. You’ll miss new ways of creating analytical applications.
How do I make this patient data available to a doctor who doesn’t know sql? A new set of users who have to work with big data sets without requisite programming skills. This is a real challenge and part of the mandate of data scientists.

Plug for Cloudera’s data science course: how to munge data on hadoop for input to machine learning model.

Analytical Applications He Loves
An Experiment Dashboard: deploy experiement, write data to hadoop cluster, compute popluation size, confidence intervals +- 1%, 2%, 3% etc. Scope for a startup here.

Adverse Drug Events: his 2nd project at Cloudera. Data on patients who die from drug reactions. FDA sample db available online. Looking for drug reactions. Analyze all possible combinatorial explosion of all possible drug interactions. This involves making assumptions as regards bucketing patients into similar groups. It’s a series of pig scripts. 20 MR jobs. Create a UI where the expert can construct a bucketing scheme, click an UI triggers pipeline and shows graphical output.

Gene Sequencing and Analytics. Guys at NextBio did a great analysis on making a gene sequencing example available in a way that makes sense to biologists. Storing genome data doesn’t map into sql very well. So people are using hadoop to create solr indexes and using solr to lookup gene sequences (google HBaseCon to find slides).

The Doctor’s Perspective: Electronic Med Records Storm for inputs, HBase for key value pairs, Solr for searches.

Couple of themes:
– structure the data in the way it makes sense for the proeblem
– interactive inputs, not just outputs (let people interact with the model eg. PEC)
– simpler interfaces that yield more sophisticated answers (for users who don’t have the technical sophistication) how to make large quantities of data who don’t have the skills to process them at scale

Holy grail is Wolfram Alpha but not proprietary.

Moving Beyond Map Reduce:
eg YARN, move beyond hadoop constraints.

The Cambrian explosion of frameworks: mezos at twitter, YARN from Yahoo.

Spark is good: in memory framework. Defines operations on distributed in memory collections. Written in Scala. Supports reading to and writing from HDFS.

Graphlab – from CMU. Map/Reduce => Update/Sort. lower level promitev. Lots of machine learning libraries out of the box. Reads from HDFS.

Playing with YARN – developed a config lang like Borg called Kitten
BranchReduce ->


Impala: not intended as Hive killer. Uses Hive meta store. Use Hive for ETLs, Impala for more interactive queries.

The one h/w technology that could evolve to better serve Big Data: network, network, network.


Deployment at Facebook

Facebook at Mozcon - Alex


An Ars Technica article, A behind-the-scenes look at Facebook release engineering, has been making the rounds this morning, yielding some fascinating behind the scenes details about FB’s release process. The whole thing is worth a read but some things stood out for me:

  • the Release Engineering team have their own built in bar. The booze must alleviate the pre-release tension!
  • very frequent deployments. It’s not unusual for code to be released daily.
  • the entire FB codebase is bundled into a single 1.5GB binary and pushed out to FB servers everywhere
  • they use bittorrent to distribute the binary since point to point pushes are that much more inefficient
  • they fully expect a percentage of servers to fail the upgrade. This can be due to hardware or network issues.
  • DevOps i.e. FB developers are on the hook for release and performance. This is part of the “”developers on the wrong end of a pager” approach that companies are increasingly taking. Having toiled as defacto Director of Ops at our startup for quite a while and having worked with Ops as well, I’ve seen all sides of the issue. What FB does here is similar to Amazon.
  • The release team actually rate developers depending on deployment smoothness with subsequent impact on their job performance reviews. Rollback is a last resort.
  • Facebook’s HipHop php compiler reduced CPU requirements by 50%. That’s pretty amazing.

The last mind blowing detail I found:

The many data sources tracked by Facebook’s internal monitoring tools even include tweets about Facebook. That information is displayed in a graph with separate trend lines to show the change in volume of positive and negative remarks. This is useful, since one of the things that people do when they encounter a technical problem on a social network is complain about it on a different social network.



Which Linux Distro Is The Most Popular?

About six years ago, I had a hankering to dig deeper into Linux distros. I took an old desktop (and when I mean old, I mean like dated circa 2000!) and went through the long and painful process of installing Gentoo on it. I then installed apache and movable type and pretty soon I had this desktop box running 24/7 at home powering my site. By today’s standards, the h/w specs of the box meant that it would have comfortably been beaten by my iPhone blindfolded and with two hands tied. Yet, because Gentoo insists on compiling the entire distro from scratch per installation, it actually performed its web hosting duties pretty well. Ultimately, I ended up moving the site over to an ISP but only because such a setup provides things like 24/7 power and net access, something not possible at home then due to the PG & E imposed rolling blackouts in the SF Bay Area.

The next time I had to consider Linux distros in any meaningful way was when I had to start moving the services in our startup to Amazon. I ended up picking Ubuntu for our AMI. It seemed to have the biggest footprint and support. Gentoo didn’t really enter the picture at the time. Since then, I’ve seen, at least post acquisition at Limelight, the slow supplanting of Ubuntu and Debian by CentOS, certainly for server installs.

Imagine my surprise when a friend recently updated his FB status thus: “Setting up gentoo linux. It’s really designed for self torture.” Did people still use Gentoo? So, I did a bit of digging and found a site, DistroWatch, that offers various distro downloads and keeps track of their popularity. According to them, the top five are:

  1. Mint
  2. Ubuntu
  3. Fedora
  4. openSUSE
  5. Debian

Gentoo comes in at 18. To be honest, I never really had heard of Mint either. Apparently it is a desktop distro that is:

an Ubuntu-based distribution whose goal is to provide a more complete out-of-the-box experience by including browser plugins, media codecs, support for DVD playback, Java and other components. It also adds a custom desktop and menus, several unique configuration tools, and a web-based package installation interface. Linux Mint is compatible with Ubuntu software repositories.

I would imagine DistroWatch is targeted at desktop downloads, hence the skew. Interesting nonetheless.

Here’s a post at Geektrio dated nearly two years ago listing the then top ten from DistroWatch. The top five at that time:

  1. Ubuntu
  2. Fedora
  3. openSUSE
  4. Debian
  5. Mandriva

Mint is at 6 and Gentoo comes in at 9. The trend for these two would seem to be pretty clear. Mandriva (also known as Linux Mandrake) has now dropped to 17. Fascinating stuff for Linux enthusiasts.


The Case for an Open Source As Service Platform

In his article on Steve Jobs (The Tinkerer), Malcolm Gladwell gets to the core of what made the UK dominate the industrial revolution:

They believe that Britain dominated the industrial revolution because it had a far larger population of skilled engineers and artisans than its competitors: resourceful and creative men who took the signature inventions of the industrial age and tweaked them—refined and perfected them, and made them work.

Similarly, Steve Jobs, as per Isaacson’s biography:

But Isaacson’s biography suggests that he was much more of a tweaker. He borrowed the characteristic features of the Macintosh—the mouse and the icons on the screen—from the engineers at Xerox PARC, after his famous visit there, in 1979. The first portable digital music players came out in 1996. Apple introduced the iPod, in 2001, because Jobs looked at the existing music players on the market and concluded that they “truly sucked.” Smart phones started coming out in the nineteen-nineties. Jobs introduced the iPhone in 2007, more than a decade later, because, Isaacson writes, “he had noticed something odd about the cell phones on the market: They all stank, just like portable music players used to.

And so on. This observation does give rise to a question – if Jobs could rise to such exalted heights by mere ruthless refinement, what hope is left for the rest of us mere mortals? What the article does not say and should be obvious to anyone in the tech industry is that we’ve been living through the golden age of tweaking. After all, what is open source if not tweaking unleashed? I don’t need to go through the sheer quantity and variety of tools, programs, methods and systems that open source has produced. There is an open source equivalent for pretty much every functionality you can think of. Yet, I wonder if we have already lived through its golden age and are moving on to something else.

What I mean is that in the past ten years or so, we have moved from the paradigm of software as executable to software as service. It is not enough to produce a program or system. You have to run it as well and keep it running. Hence websites, search engines, social networks and pretty much everything else in our grand world wide web. This leads to the next question: while there is an open source equivalent to pretty much any software from Microsoft or Adobe, where is the open source as service (OSaaS) equivalent to Google or Facebook? Nutch doesn’t count – it’s code that has to be installed and run. Doing so is nontrivial and illustrates some of the issues facing OSaaS:

  • cost of machines to run the service and supporting services
  • cost of bandwidth
  • storage costs
  • operations costs

I would argue Wikipedia is a good example of a OSaaS – and it is continually in fundraising mode. Furthermore, eiven Google and Facebook’s masive scale, it’s impossible to produce any kind of competing system the traditional way without massive amounts of money. So much for open source purity! While it is true SaaS and PaaS providers often make available APIs and platforms on a tiered basis with the first x or so requests being free and hence is a great way of getting started with your app, you have to pay after you exceed a certain level of usage. Again, non scalable and sure to discourage the next budding Steve Jobs toiling away in his/her garage.

I think the success of the SETI project (not in finding aliens but in getting people to contribute their spare cycles) or even what I saw at Looksmart when we acquired Grub indicates there might be another way. Grub was an open source crawler that users could download and install on their computers. It showed nifty screen savers when your computer needed to snooze and crawled URLs at the same time. We were surprised by Grub’s uptake. People wanted to make search better and were happy to download and run it on their own computers. We had people allocating farms of machines devoted to running Grub. We used it for nothing but dead link checking for our Wisenut search engine – but even that made people happy to contribute.

One possible lesson from this could be that if it is possible to develop a framework/platform for effectively partitioning the service amongst many participants, each participant would pay a fraction of the total cost. Of course, as BitTorrent shows, load balancing has to be carefully done. People that host too many files leech too much ISP resources and get sued. Grid computing and projects like BOINC are certainly promising but seem to be specialized for long running jobs of certain types like protein folding or astrophysics computations. It’s not clear whether they can provide a distributed, public, OSaaS platform. Such a platform, if carefully engineered, could pave the way to many interesting applications that could provide alternatives to the Facebooks and Googles of the world and ensure tinkering in the new millenium remains within the reach of dreamers.


Appetite For Self Destruction

Appetite For Self Destruction is a recounting of the fall, rise and subsequent decimation of the US music industry. The books starts with the “Disco Sucks” backlash and the subsequent precipitous fall in LP sales. The CD comes along just in time to rescue the music industry from these doldrums with Michael Jackson’s “Thriller” being one of the first killer apps of this new technology. Guess what? The industry then sees an opportunity in the improved fidelity of CDs. It uses this to jack up prices and rides the subsequent boom hard to amazing profits and profligacy.

The book has a great time recounting some of these merry stories of excess and the unsavory characters that flocked into the businss. We all know the bottom finally started to fall out with the introduction of Napster but author Steve Knopper makes the point that this occurred more due to the insistence of the RIAA and the rest of the gang in clinging to what had hitherto worked. In doing so, however, they began alienating fans and musicians alike and never recovered. Suing grandmothers is hardly a particularly good business model.

A nice graphic from a related article in Business Insider (The Real Death of the Music Industry) says it all:

Apparently, Steve Jobs was essentially forced to step in and create a digital music distribution system as he badly needed content for the then recently introduced iPod. By that time, the music industry had realized they needed a legal online presence and Apple with a scant 4-5% of the PC market hardly seemed any kind of threat. Accordingly, head honchos of labels like Warner, Sony and others ended up signing agreemens largely biased in favor of Apple, realizing only belatedly they had given away the farm.

Interspersed through the book are chapters covering in painful detail every mistake made by the record companies on their way to their current depleted state. How music can survive, new business models, apps and services – all these topics are hot areas of discussion by pundits. The contribution of this book is illustrating how we got here in the first place.

Flattering Streaming Media Coverage for Our Video Platform

Streaming Media writes that OVP (Online Video Platform) is a pretty crowded space with more than a hundred vendors. They then list the top ten platforms and yes, we’re in there. Here’s what they say:

The OVP formerly known as Delve (until Limelight bought it) stands out for its user experience and workflow. The user interface was designed to be friendly and to make it easy to accomplish tasks. When a user uploads a video, he or she can update its metadata and set the channel even before the video is fully uploaded. That’s a timesaver few can match. It also offers strong analytics and APIs.

The platform’s Video Clipper tool makes it easy to shorten videos and works well in combination with Limelight’s real-time analytics. If users see that viewers are routinely quitting a video at a certain point, they have the option of ending the video there.

The OVP was acquired by Limelight in August 2010. Rather than focusing on just the mobile experience or just analytics, the Limelight Video Platform focuses on offering the best end-to-end user experience. Since it’s now under the same roof as a top CDN, it’s able to offer tie-ins that others can’t. Users gain from functionality such as player edge scaling, which the video platform is able to offer through low-level access to the CDN. Users also get to use developer APIs that aren’t open to the public.

Given that we started relatively late in this space (the company had started to pivot away from semantic video search to the platform SaaS approach just prior to my coming on board), this recognition is particularly heartening. It has been a monumental amount of work and accumulation of many battle scars for all of us. Full credit to Alex, Edgardo and the rest of the team!

Being committed to a startup means you have to be prepared to do anything and everything. That and my own role(s) at Delve/Limelight mean involvement with pretty much all system aspects, especially the backend. I am particularly chuffed at the repeated mentions of analytics since a lot of my own blood, sweat and tears go into it.

Okay, better stop now before this starts reading like an acceptance speech 🙂

Reverse Migration – Moving Out of AWS Part I

It’s been a while since I updated and with good reason. We were acquired by Limelight Networks last year and are now happily ensconced within the extended family of other acquisitions like Kiptronics and Clickability. While it was great to have little perks like my pick of multiple offices (Limelight has a couple scattered around the Bay Area) after toiling at home for most of two plus years, we had other mandates to fulfill. A big one was moving our infrastructure from AWS to Limelight. The reason for this was simple enough. Limelight operates a humongous number of data centers around the world, all connected by private fiber. Business cost efficiencies aside, it really didn’t make sense for us to not leverage this.

Amazon Web Services

AWS makes it really easy to migrate to the cloud. Firing up machines and leveraging new services is a snap. All you really need is a credit card and a set of digital keys. It’s not until you try going the other way you realize how intertwined your applications are with AWS. I am reminded of the alien wrapped around the face of John Hurt!

Obviously, we’d not be where we were if we hadn’t been able to leverage AWS infrastructure – and for that I am eternally thankful. However, as we found, if you do need to move off of it for whatever reason, you’ll find the difficulty of your task to be directly proportional to the number of cloud services you use. In our case, we employed a tremendous amount – EC2, S3, SimpleDB, Elastic Map Reduce, Elastic Load Balancing, CloudFront, Cloud monitoring and, to a lesser extent, Elastic Block Storage and SQS. This list doesn’t even include the management tools that we used to handle our applications in the AWS cloud. Consequently, part of our task was to find suitable replacements for all of these and perform migration while ensuring our core platform continued to run (and scale) smoothly.

I’ll be writing more about our migration strategy, planning and execution further on down the line but for now, I thought it would be useful to summarize some of our experiences with Amazon services over the past couple of years. Here goes:

  • EC2 – one of the things I’d expected when I first moved our services to EC2 would be that instances would be a lot more unstable. You’re asked to plan for instances going down at any time and I’ve experienced instances of all types crashing but when taking out the failures due to application issues, the larger instances turned out to be definitely more stable. Very much so, actually.
  • S3 is both amazing and frustrating. It’s mind blowing how it’s engineered to store an indefinite number of files per bucket. It’s astonishing at how highly available it is. In my 3 years with AWS, I can recall S3 having serious issues only twice or thrice. It’s also annoying because it does not behave like a standard file system, in particular, in its eventual consistency policy, yet it’s very easy to write your applications to treat it like one. That’s dangerous. We kept changing our S3 related code to include many, many retries as the eventual consistency model doesn’t guarantee that just because you have successfully written a file to S3, it will actually be there.
  • Cloudfront is an expensive service and the main reason to use it is because of its easy integration with S3. Other CDNs are more competitive price-wise.
  • Elastic Map Reduce provides Hadoop as a service. Currently, there are two main branches of hadoop supported – 0.18.3 and 0.20.2 although EMR folks encourage you to use the latter, and I believe, is the default for job submission tools. The EMR team has incorporated many fixes of their own into the hadoop code base. I’ve heard of grumblings of this causing yet further splintering of the hadoop codebase (in addition to what comes out of Yahoo and Cloudera). I’ve also heard these will be merged back into the main branches but I am not sure of the status. Being one of the earliest and consistent users of this service (we run every couple of hours and at peak can have 100 plus instances running), I’ve found it to be very stable and the EMR team to be very responsive when jobs fail for whatever reason. Our map reduce applications use 0.18.3 and I am currently in the process of moving over to 0.20.2, something recommended by the EMR team. Once that occurs, I’d expect performance and stability to improve further. Two further thoughts regarding our EMR usage:
    • EMR is structured such that it’s extremely tempting to scale your job by simply increasing the number of nodes in your cluster. It’s just a number after all. In doing so, however, you run the risk of ignoring fundamental issues with your workflow until it assumes calamitous proportions, especially as the size of your unprocessed data increases. Changing node types is a temporary palliative but at some point you have to bite the bullet and get down to computational efficiency and job tuning.
    • Job run time matters! I’ve found similar jobs to run much faster in the middle of the night than during the day. I can only assume this is the dark side of multi tenancy in a shared cloud. In other words, the performance a framework like hadoop relying heavily on network interconnectivity is going to be dependent on the load imposed on the network by other customers. Perhaps this is why AWS has rolled out the new cluster instance types that can run in a more dedicated network setting.
  • SimpleDB: my experience with SimpleDB is not particularly positive. The best thing I can say about it is that it rarely goes down wholesale. However, retrieval performance is not that great and endemic with 503 type errors. Whatever I said for S3 and retries goes double for SimpleDB. In addition, there’s been no new features added to SimpleDB for quite a while and I am not sure what AWS long term goals are for it, if any.

That’s it for now. I hope to add more in a future edition.

Note: post updated thanks to Sid Anand.

Some Tips on Amazon’s EMR II

Following up from Some Tips on Amazon’s EMR

Recently, I was asked for some pointers for a log processing application intended to be implemented via Amazon’s Elastic Map Reduce. Questions included how big the log files should be, what format worked best and so on. I found it to be a good opportunity to summarize some of my own experiences. Obviously, there’s no hard and fast rule on optimal file size as input for EMR per se. Much depends on the computation being performed and the types of instances chosen. That being said, here are some considerations:

  • it’s a good idea to batch log files as much as possible. Hadoop/EMR will generate a map job for every individual file in S3 that serves as input. So, the lower the number of files, the less maps in the initial step and the faster your computation.
  • another reason to batch files: hadoop doesn’t deal with many small files very well. See It’s possible to get better performance with a smaller number of large files.
  • however, it’s not a good idea to batch files too much – if 10 100M files are blended into a single 1GB file, then it’s not possible to take advantage of the parallelization of map reduce. One mapper will then be working on downloading one file while the rest are idle, waiting for that step to finish.
  • gzip vs other compression schemes for input splits – hadoop cannot peer inside gzipped files, so it can’t produce splits very well with that kind of input files. It does better with LZO. See
In short, experimenting with log file sizes and instance types is essential for good performance down the road.

MySQL High Availability Talk – My Notes

Last month, Lenz Grimmer, Community Manager at MySQL (now Oracle, I suppose), gave an overview of a number of MySQL HA techniques at the SF MySQL Meetup group. My notes from that talk:

MySQL not trying to be an Oracle replacement, rather the goal is to make it better for its specific needs and requirements.

HA Concepts and Considerations:

  • something can and will fail
  • can’t afford downtime eg. maintenance
  • adding HA to an existing system is complex (Note: from my experience, this is definitely the case with MySQL!)

HA components:

  • a heartbeat checker is definitely necessary to check
    • whether services still alive?
    • components: individual servers, services, network etc
  • HA monitoring
    • have to be able to add and remove services
    • have to allow shutdown/startup operations on services
    • have to allow manual control if necessary
  • Shared storage/replication

One of the possible failure scenarios for a distributed system is the Split Brain syndrome whereby communication failures can lead to cluster partitioning. Further problems can ensue when each partition tries to take control of the entire cluster. Have to use approaches such as fencing or moderation/arbitration to avoid.

Some notes on MySQL replication:

  • replicate statements that have changed. This is a statement or row based approach.
  • can be asynchronous so slaves can lag
  • new in mysql 5.5 – semi sync replication
  • not fully synchronous but replication is included in the transaction i.e. transcation will not proceed until master receives ok from at least one slave
  • master maintains binary log and index
  • replication on slave is single threaded i.e. parallel transactions are serialized
  • there is no automated fail-over
  • a new feature in 5.5 – replication heartbeat

The master master configuration is not suitable for write load balancing. Don’t write to both masters at the same time, use sharding/partitioning instead eg. auto increments is a PITA (audience query)

Disk replication is another HA technique. This is not mysql level replication. Instead, files are replicated to another disk at the disk level via block level replication. DRBD (Disk Replacement Block Device) is one such technology. Some features:

  • raid-1 over network
  • synchronous/async block replication
  • automatic resync
  • application agnostic since operating at the disk level
  • can mask local I/O issues

By default, DRBD operates on an active-passive configuration such that block device on 2nd system isn’t accessible. Now, DRBD has changed to allow writes on the passive device as well but it only really works if using clustered file system underneath like GFS or OCFS2. However, it remains a dangerous practice.

When using DRBD with MySQL:

  • really bad with MyISAM tables since the replication occurs at the block level. Failover could lead to an inconsistent file system, so integrity check to repair would be required. Hence, can only use journaled file system with DRBD. Also, Innodb more easily repaired than MyISAM.
  • MySQL server runs only on primary drbd node, not on secondary.

Instead of replication, another possibility is to use a storage area network to secure your data. However, in that case, the SAN can become a single point of failure. In addition, following a switchover, new MySQL instances can have cold caches – since they have had not had time to warm up yet.

MySQL Cluster technology, on the other hand, is not good with cross table JOINs. In addition, owing to their architecture, they may not be suitable for all applications.

A number of companies were mentioned in the talk that are active in the space. One such example is Galera, a Norwegian company which provides their own take on MySQL replication. Essentially, they have produced a patch for Innodb as well as an external library. This allows single or multi master setups as well as multicast based replication.

Next entries »