soam's home

home mail us syndication

Archive for hadoop

Escape From Amazon: Tips/Techniques for Reducing AWS Dependencies

Slides from my talk at CloudTech III in early October 2012:

Escape From Amazon: Tips/Techniques for Reducing AWS Dependencies from Soam Acharya


Talk abstract:

As more startups use Amazon Web Services, the following scenario becomes increasingly frequent – the startup is acquired but required by the parent company to move away from AWS and into their own data centers. Given the all encompassing nature of AWS, this is not a trivial task and requires careful planning at both the application and systems level. In this presentation, I recount my experiences at Delve, a video publishing SaaS platform, with our post acquisition migration to Limelight Networks, a global CDN, during a period of tremendous growth in traffic. In particular, I share some of the tips/techniques we employed during this process to reduce AWS dependence and evolve to a hybrid private/AWS global architecture that allowed us to compete effectively with other digital video leaders.

Building Analytical Applications on Hadoop

Last week, I attended a talk by Josh Wills from Cloudera entitled “Building Analytical Applications on Hadoop” at the SF Data Mining meetup. Here’s a link to the talk video. An entertaining speaker, Josh did a great job on tying together various strands of thought on what it takes to build interesting big data analytical applications that engage, inform and entertain. Here are the notes I took during his presentation.

Building Analytical Applications On Hadoop

Josh Wills – Director of Data Science at Cloudera
Nov 2012

@Cloudera 16 months
@Google before that, ~4 years. Worked on Ad Auctions, data infrastructure stuff (logging, dashboards, friend recommendations for Google+, News etc).
Operations Research background.

He is now at an enterprise S/W company. Cloudera, that by itself doesn’t have a lot of data but gets to work with companies and data sets that do.

Data Scientist definition: person who is better at stats than a software engineer and better at software engineering than any statistician.

What are analytical applications?
Field is overrun by words like bigdata, data science. What does it mean exactly?

His definition is “applications that allow individuals to work with and make decisions about data.”

This is too broad, so here’s an example to clarify –  writing a dashboard. This is Data Science 101. Use tools like D3, Tableau etc to produce dashboard. All of these tools support pie charts, 3d pie charts etc. He doesn’t like pie charts.

Another example: Crossfilter with Flight Info. It’s from D3. Great for analyzing time series oriented data.

Other topical examples: New York Times (NYT) Electoral Vote Map (differentials by county from 2008). Mike Bostock did a great series of visualizations at NYT including that one.

Make distinction between analytics applications vs frameworks. All the examples so for have used tool + data. However, technologies like R, SAS, QlikView are frameworks by themselves. BTW, R released a new framework today called Shiny used to help build new webapps. He can’t wait to start playing with it.

2012: The Predicting of the President
Talks about Nate Silver. NYT, 538,com etc. Models for presidential prediction.

Real Clear Politics: simple average of polls for a state, transparent, simple model. Nice interactions with the UI. by Nate Silver: “Foxy” model. Currently reading Nate’s “The Signal and the Noise.” Lots of inputs to his model: state polls, economic data, inter state poll correlations. It’s pretty opaque. The site provides simple interactions with a richer UI.

Princeton Election Consortium (PEC): model based on poll medians and polynomials (take poll median, convert to probability via polynomial function and predict). Very transparent. Site has a nice set of rich interactions. Can change assumptions via java applet on site and produce new predictions.

Both PEC, RCP 29/50. Nate got all 50 states. All did a good job of predicting state elections. 538 did the best.

But 1 expert actually beat Nate. Kos from did better than Nate. He took all the swing state polls – bumped Obama margins in COL, NV since polls undercount Latino turnouts. One other simple trick: picked a poll he knew to be more accurate than others.

Index funds, hedge funds and Warren Buffett
3 different approaches: simple, complex, expert + data.

Moral: having an expert armed with really simple data can beat complex models.

A Brief Intro to Hadoop
Didn’t take notes here other than on data economics: return on byte. How much does it cost to store data? How much value do I get from it?

Big Data Economics: no individual report is particularly valuable but having every record is very valuable (web index, recommendation systems, sensor data, online advertising, market basket analysis).

Developing analytical applications with hadoop
Rule #1: novelty is the enemy of adoption.

Try to make the new way of doing things appear similar to the old tool on the surface.

First hadoop application he developed was new take on an old tool with the same CLI. He calls this Seismic Hadoop. Seismic data processing: how we find oil. It involves sending waves down the earth, recording and analyzing what’s reflected back. Geophysicists have been developing open source s/w to track this stuff since ’80s. He took one of the examples and ported it to hadoop.

Reccommends Hive as the best way of getting started with Hadoop. Has a familiar SQL interface. All business tools such as Tableau connect to Hive.

Borrowing abstractions like spreadsheets. Data community has developed some really good metaphors over the years. Use them. Examples of big data tools that do this – Karmasphere.

However, if you just see hadoop through the lens of these abstractions, it limits your tapping of its main power. Consider Stonebraker who hates hadoop but just doesn’t get it. Who told him it was a relational db?

Improving the UX: Impala. Fast query engine on top of hadoop. Took ideas from classic dbs and built it.

Moving beyond the abstractions

First, make the abstractions concrete. A star schema is a very useful abstraction. Helps one understand a relational database. But you can’t take that model and map it to hadoop as it’s not a relational db. You’ll miss new ways of creating analytical applications.
How do I make this patient data available to a doctor who doesn’t know sql? A new set of users who have to work with big data sets without requisite programming skills. This is a real challenge and part of the mandate of data scientists.

Plug for Cloudera’s data science course: how to munge data on hadoop for input to machine learning model.

Analytical Applications He Loves
An Experiment Dashboard: deploy experiement, write data to hadoop cluster, compute popluation size, confidence intervals +- 1%, 2%, 3% etc. Scope for a startup here.

Adverse Drug Events: his 2nd project at Cloudera. Data on patients who die from drug reactions. FDA sample db available online. Looking for drug reactions. Analyze all possible combinatorial explosion of all possible drug interactions. This involves making assumptions as regards bucketing patients into similar groups. It’s a series of pig scripts. 20 MR jobs. Create a UI where the expert can construct a bucketing scheme, click an UI triggers pipeline and shows graphical output.

Gene Sequencing and Analytics. Guys at NextBio did a great analysis on making a gene sequencing example available in a way that makes sense to biologists. Storing genome data doesn’t map into sql very well. So people are using hadoop to create solr indexes and using solr to lookup gene sequences (google HBaseCon to find slides).

The Doctor’s Perspective: Electronic Med Records Storm for inputs, HBase for key value pairs, Solr for searches.

Couple of themes:
– structure the data in the way it makes sense for the proeblem
– interactive inputs, not just outputs (let people interact with the model eg. PEC)
– simpler interfaces that yield more sophisticated answers (for users who don’t have the technical sophistication) how to make large quantities of data who don’t have the skills to process them at scale

Holy grail is Wolfram Alpha but not proprietary.

Moving Beyond Map Reduce:
eg YARN, move beyond hadoop constraints.

The Cambrian explosion of frameworks: mezos at twitter, YARN from Yahoo.

Spark is good: in memory framework. Defines operations on distributed in memory collections. Written in Scala. Supports reading to and writing from HDFS.

Graphlab – from CMU. Map/Reduce => Update/Sort. lower level promitev. Lots of machine learning libraries out of the box. Reads from HDFS.

Playing with YARN – developed a config lang like Borg called Kitten
BranchReduce ->


Impala: not intended as Hive killer. Uses Hive meta store. Use Hive for ETLs, Impala for more interactive queries.

The one h/w technology that could evolve to better serve Big Data: network, network, network.


Reverse Migration – Moving Out of AWS Part I

It’s been a while since I updated and with good reason. We were acquired by Limelight Networks last year and are now happily ensconced within the extended family of other acquisitions like Kiptronics and Clickability. While it was great to have little perks like my pick of multiple offices (Limelight has a couple scattered around the Bay Area) after toiling at home for most of two plus years, we had other mandates to fulfill. A big one was moving our infrastructure from AWS to Limelight. The reason for this was simple enough. Limelight operates a humongous number of data centers around the world, all connected by private fiber. Business cost efficiencies aside, it really didn’t make sense for us to not leverage this.

Amazon Web Services

AWS makes it really easy to migrate to the cloud. Firing up machines and leveraging new services is a snap. All you really need is a credit card and a set of digital keys. It’s not until you try going the other way you realize how intertwined your applications are with AWS. I am reminded of the alien wrapped around the face of John Hurt!

Obviously, we’d not be where we were if we hadn’t been able to leverage AWS infrastructure – and for that I am eternally thankful. However, as we found, if you do need to move off of it for whatever reason, you’ll find the difficulty of your task to be directly proportional to the number of cloud services you use. In our case, we employed a tremendous amount – EC2, S3, SimpleDB, Elastic Map Reduce, Elastic Load Balancing, CloudFront, Cloud monitoring and, to a lesser extent, Elastic Block Storage and SQS. This list doesn’t even include the management tools that we used to handle our applications in the AWS cloud. Consequently, part of our task was to find suitable replacements for all of these and perform migration while ensuring our core platform continued to run (and scale) smoothly.

I’ll be writing more about our migration strategy, planning and execution further on down the line but for now, I thought it would be useful to summarize some of our experiences with Amazon services over the past couple of years. Here goes:

  • EC2 – one of the things I’d expected when I first moved our services to EC2 would be that instances would be a lot more unstable. You’re asked to plan for instances going down at any time and I’ve experienced instances of all types crashing but when taking out the failures due to application issues, the larger instances turned out to be definitely more stable. Very much so, actually.
  • S3 is both amazing and frustrating. It’s mind blowing how it’s engineered to store an indefinite number of files per bucket. It’s astonishing at how highly available it is. In my 3 years with AWS, I can recall S3 having serious issues only twice or thrice. It’s also annoying because it does not behave like a standard file system, in particular, in its eventual consistency policy, yet it’s very easy to write your applications to treat it like one. That’s dangerous. We kept changing our S3 related code to include many, many retries as the eventual consistency model doesn’t guarantee that just because you have successfully written a file to S3, it will actually be there.
  • Cloudfront is an expensive service and the main reason to use it is because of its easy integration with S3. Other CDNs are more competitive price-wise.
  • Elastic Map Reduce provides Hadoop as a service. Currently, there are two main branches of hadoop supported – 0.18.3 and 0.20.2 although EMR folks encourage you to use the latter, and I believe, is the default for job submission tools. The EMR team has incorporated many fixes of their own into the hadoop code base. I’ve heard of grumblings of this causing yet further splintering of the hadoop codebase (in addition to what comes out of Yahoo and Cloudera). I’ve also heard these will be merged back into the main branches but I am not sure of the status. Being one of the earliest and consistent users of this service (we run every couple of hours and at peak can have 100 plus instances running), I’ve found it to be very stable and the EMR team to be very responsive when jobs fail for whatever reason. Our map reduce applications use 0.18.3 and I am currently in the process of moving over to 0.20.2, something recommended by the EMR team. Once that occurs, I’d expect performance and stability to improve further. Two further thoughts regarding our EMR usage:
    • EMR is structured such that it’s extremely tempting to scale your job by simply increasing the number of nodes in your cluster. It’s just a number after all. In doing so, however, you run the risk of ignoring fundamental issues with your workflow until it assumes calamitous proportions, especially as the size of your unprocessed data increases. Changing node types is a temporary palliative but at some point you have to bite the bullet and get down to computational efficiency and job tuning.
    • Job run time matters! I’ve found similar jobs to run much faster in the middle of the night than during the day. I can only assume this is the dark side of multi tenancy in a shared cloud. In other words, the performance a framework like hadoop relying heavily on network interconnectivity is going to be dependent on the load imposed on the network by other customers. Perhaps this is why AWS has rolled out the new cluster instance types that can run in a more dedicated network setting.
  • SimpleDB: my experience with SimpleDB is not particularly positive. The best thing I can say about it is that it rarely goes down wholesale. However, retrieval performance is not that great and endemic with 503 type errors. Whatever I said for S3 and retries goes double for SimpleDB. In addition, there’s been no new features added to SimpleDB for quite a while and I am not sure what AWS long term goals are for it, if any.

That’s it for now. I hope to add more in a future edition.

Note: post updated thanks to Sid Anand.

Some Tips on Amazon’s EMR II

Following up from Some Tips on Amazon’s EMR

Recently, I was asked for some pointers for a log processing application intended to be implemented via Amazon’s Elastic Map Reduce. Questions included how big the log files should be, what format worked best and so on. I found it to be a good opportunity to summarize some of my own experiences. Obviously, there’s no hard and fast rule on optimal file size as input for EMR per se. Much depends on the computation being performed and the types of instances chosen. That being said, here are some considerations:

  • it’s a good idea to batch log files as much as possible. Hadoop/EMR will generate a map job for every individual file in S3 that serves as input. So, the lower the number of files, the less maps in the initial step and the faster your computation.
  • another reason to batch files: hadoop doesn’t deal with many small files very well. See It’s possible to get better performance with a smaller number of large files.
  • however, it’s not a good idea to batch files too much – if 10 100M files are blended into a single 1GB file, then it’s not possible to take advantage of the parallelization of map reduce. One mapper will then be working on downloading one file while the rest are idle, waiting for that step to finish.
  • gzip vs other compression schemes for input splits – hadoop cannot peer inside gzipped files, so it can’t produce splits very well with that kind of input files. It does better with LZO. See
In short, experimenting with log file sizes and instance types is essential for good performance down the road.

Data Trends For 2011

From Ed Dumbill at O’Reilly Radar comes some nice thoughts on key data trends for 2011. First, the emergence of a data marketplace:

Marketplaces mean two things. Firstly, it’s easier than ever to find data to power applications, which will enable new projects and startups and raise the level of expectation—for instance, integration with social data sources will become the norm, not a novelty. Secondly, it will become simpler and more economic to monetize data, especially in specialist domains.

The knock-on effect of this commoditization of data will be that good quality unique databases will be of increasing value, and be an important competitive advantage. There will also be key roles to play for trusted middlemen: if competitors can safely share data with each other they can all gain an improved view of their customers and opportunities.

There’s a number of companies emerging that crawl the general web, Facebook and Twitter to extract raw data, process/cross-reference that data and sell access. The article mentions InfoChimp and Gnip. Other practitioners include BackType, Klout, RapLeaf etc. Their success indicates a growing hunger for this type of information. I definitely seeing this need where I am currently. Limelight, by virtue of its massive CDN infrastructure and customers such as Netflix, collects massive amounts of user data. Such data could greatly increase in value when cross referenced against other databases which provide additional dimensions such as demographic information. This is something that might best be obtained from some sort of third party exchange.

Another trend that seems familiar is the rise of real time analytics:

This year’s big data poster child, Hadoop, has limitations when it comes to responding in real-time to changing inputs. Despite efforts by companies such as Facebook to pare Hadoop’s MapReduce processing time down to 30 seconds after user input, this still remains too slow for many purposes.
It’s important to note that MapReduce hasn’t gone away, but systems are now becoming hybrid, with both an instant element in addition to the MapReduce layer.

The drive to real-time, especially in analytics and advertising, will continue to expand the demand for NoSQL databases. Expect growth to continue for Cassandra and MongoDB. In the Hadoop world, HBase will be ever more important as it can facilitate a hybrid approach to real-time and batch MapReduce processing.

Having built Delve’s (near) real time analytics last year, I am familiar with the pain points of leveraging hadoop to fit into this kind of role. In addition NoSQL based solutions, I’d note that other approaches are emerging:

It’s interesting to see how a new breed of companies have evolved from treating their actual code as a valuable asset to giving away their code and tools and treating their data (and the models they extract from that data) as major assets instead. With that in mind, I would add a third trend to this list: the rise of cloud based data processing. Many of the startups in the data space use Amazon’s cloud infrastructure for storage and processing. Amazon’s ElasticMapReduce, which I’ve written about before, is a very well put together and stable system that obviates the need to maintain a continuously running Hadoop cluster. Obviously, not all applications fit this criteria but if it does, it can be very cost effective.

MapReduce vs MySQL

Brian Aker talks about the post Oracle MySQL world in this O’Reilly Radar interview. Good stuff. One section though caused me to raise an eyebrow:

MapReduce works as a solution when your queries are operating over a lot of data; Google sizes of data. Few companies have Google-sized datasets though. The average sites you see, they’re 10-20 gigs of data. Moving to a MapReduce solution for 20 gigs of data, or even for a terabyte or two of data, makes no sense. Using MapReduce with NoSQL solutions for small sites? This happens because people don’t understand how to pick the right tools.

Hmm. First of all, just because you have 10-20GB of data right now doesn’t mean you’ll have 10-20GB of data in the future. From my experience, once you start getting into this range of data, scaling mysql becomes painful. More likely as not, your application has absolutely no sharding/distributed processing capability built in to your mysql setup, so at this point, your choices are:

  1. vertical scaling => bigger boxes, RAID/SSD disks etc.
  2. introduce sharding into mysql, retrofit your application to deal with it
  3. bite the bullet and offload your processing into some other type of setup such as MapReduce

(1) is merely kicking the can down the road.

(2) involves maintaining more mysql servers, worrying about sharding schemes, setting up a middleman to deal with partitioning, data collation etc.

In both (1) and (2), you still have to worry about many little things in mysql such as setting up replication, setting up indexes for tables, tuning queries etc. And in (2), you’ll have more servers running. While it is true mysql clustering exists, as does native partitioning support in newer mysql versions, setting that stuff up is still painful and it’s not clear whether the associated maintenance overhead is worth the performance you get.

It’s not a surprise more and more people are turning to (3). A hadoop cluster provides more power out of the box than a sharded mysql setup, and a more brain dead scalable path. Just add more machines! Yes, there are configuration issues involved in a hadoop cluster as well but I think they’re far easier to deal with than the equivalent mysql setup. The main drawback here is (3) only works if your processing requirements are batch based, not real time.

It is true that not all of the technologies in the Hadoop ecosystem outside of Hadoop itself are all that mature. However, BigTable solutions like Hbase are still not that easy to setup and run. Pig is still evolving but Cascading is an amazing library. Additionally, if one uses Amazon’s cloud products judiciously, it may actually be possible to do (3) really cheap (as opposed to (2) which requires more and bigger machines).

How? Store persistent files in S3 (logs etc). Use Elastic MapReduce periodically so you are not running a dedicated hadoop cluster. Use SimpleDB for your db needs. SimpleDB has limitations (2500 limit on selects, restricted attributes, strings only) but more and more people (such as Netflix) are using it for high volume applications. Furthermore, all of these technologies are enabling single entrepreneurs to do things like crawl and maintain big chunks of the web so that they can build interesting new applications on top, something that would have been too cost prohibitive in the older MySQL world. I hope to write more about it soon.