soam's home

home mail us syndication

Archive for November, 2012

Building Analytical Applications on Hadoop

Last week, I attended a talk by Josh Wills from Cloudera entitled “Building Analytical Applications on Hadoop” at the SF Data Mining meetup. Here’s a link to the talk video. An entertaining speaker, Josh did a great job on tying together various strands of thought on what it takes to build interesting big data analytical applications that engage, inform and entertain. Here are the notes I took during his presentation.

Building Analytical Applications On Hadoop

Josh Wills – Director of Data Science at Cloudera
Nov 2012

@Cloudera 16 months
@Google before that, ~4 years. Worked on Ad Auctions, data infrastructure stuff (logging, dashboards, friend recommendations for Google+, News etc).
Operations Research background.

He is now at an enterprise S/W company. Cloudera, that by itself doesn’t have a lot of data but gets to work with companies and data sets that do.

Data Scientist definition: person who is better at stats than a software engineer and better at software engineering than any statistician.

What are analytical applications?
Field is overrun by words like bigdata, data science. What does it mean exactly?

His definition is “applications that allow individuals to work with and make decisions about data.”

This is too broad, so here’s an example to clarify –  writing a dashboard. This is Data Science 101. Use tools like D3, Tableau etc to produce dashboard. All of these tools support pie charts, 3d pie charts etc. He doesn’t like pie charts.

Another example: Crossfilter with Flight Info. It’s from D3. Great for analyzing time series oriented data.

Other topical examples: New York Times (NYT) Electoral Vote Map (differentials by county from 2008). Mike Bostock did a great series of visualizations at NYT including that one.

Make distinction between analytics applications vs frameworks. All the examples so for have used tool + data. However, technologies like R, SAS, QlikView are frameworks by themselves. BTW, R released a new framework today called Shiny used to help build new webapps. He can’t wait to start playing with it.

2012: The Predicting of the President
Talks about Nate Silver. NYT, 538,com etc. Models for presidential prediction.

Real Clear Politics: simple average of polls for a state, transparent, simple model. Nice interactions with the UI. by Nate Silver: “Foxy” model. Currently reading Nate’s “The Signal and the Noise.” Lots of inputs to his model: state polls, economic data, inter state poll correlations. It’s pretty opaque. The site provides simple interactions with a richer UI.

Princeton Election Consortium (PEC): model based on poll medians and polynomials (take poll median, convert to probability via polynomial function and predict). Very transparent. Site has a nice set of rich interactions. Can change assumptions via java applet on site and produce new predictions.

Both PEC, RCP 29/50. Nate got all 50 states. All did a good job of predicting state elections. 538 did the best.

But 1 expert actually beat Nate. Kos from did better than Nate. He took all the swing state polls – bumped Obama margins in COL, NV since polls undercount Latino turnouts. One other simple trick: picked a poll he knew to be more accurate than others.

Index funds, hedge funds and Warren Buffett
3 different approaches: simple, complex, expert + data.

Moral: having an expert armed with really simple data can beat complex models.

A Brief Intro to Hadoop
Didn’t take notes here other than on data economics: return on byte. How much does it cost to store data? How much value do I get from it?

Big Data Economics: no individual report is particularly valuable but having every record is very valuable (web index, recommendation systems, sensor data, online advertising, market basket analysis).

Developing analytical applications with hadoop
Rule #1: novelty is the enemy of adoption.

Try to make the new way of doing things appear similar to the old tool on the surface.

First hadoop application he developed was new take on an old tool with the same CLI. He calls this Seismic Hadoop. Seismic data processing: how we find oil. It involves sending waves down the earth, recording and analyzing what’s reflected back. Geophysicists have been developing open source s/w to track this stuff since ’80s. He took one of the examples and ported it to hadoop.

Reccommends Hive as the best way of getting started with Hadoop. Has a familiar SQL interface. All business tools such as Tableau connect to Hive.

Borrowing abstractions like spreadsheets. Data community has developed some really good metaphors over the years. Use them. Examples of big data tools that do this – Karmasphere.

However, if you just see hadoop through the lens of these abstractions, it limits your tapping of its main power. Consider Stonebraker who hates hadoop but just doesn’t get it. Who told him it was a relational db?

Improving the UX: Impala. Fast query engine on top of hadoop. Took ideas from classic dbs and built it.

Moving beyond the abstractions

First, make the abstractions concrete. A star schema is a very useful abstraction. Helps one understand a relational database. But you can’t take that model and map it to hadoop as it’s not a relational db. You’ll miss new ways of creating analytical applications.
How do I make this patient data available to a doctor who doesn’t know sql? A new set of users who have to work with big data sets without requisite programming skills. This is a real challenge and part of the mandate of data scientists.

Plug for Cloudera’s data science course: how to munge data on hadoop for input to machine learning model.

Analytical Applications He Loves
An Experiment Dashboard: deploy experiement, write data to hadoop cluster, compute popluation size, confidence intervals +- 1%, 2%, 3% etc. Scope for a startup here.

Adverse Drug Events: his 2nd project at Cloudera. Data on patients who die from drug reactions. FDA sample db available online. Looking for drug reactions. Analyze all possible combinatorial explosion of all possible drug interactions. This involves making assumptions as regards bucketing patients into similar groups. It’s a series of pig scripts. 20 MR jobs. Create a UI where the expert can construct a bucketing scheme, click an UI triggers pipeline and shows graphical output.

Gene Sequencing and Analytics. Guys at NextBio did a great analysis on making a gene sequencing example available in a way that makes sense to biologists. Storing genome data doesn’t map into sql very well. So people are using hadoop to create solr indexes and using solr to lookup gene sequences (google HBaseCon to find slides).

The Doctor’s Perspective: Electronic Med Records Storm for inputs, HBase for key value pairs, Solr for searches.

Couple of themes:
– structure the data in the way it makes sense for the proeblem
– interactive inputs, not just outputs (let people interact with the model eg. PEC)
– simpler interfaces that yield more sophisticated answers (for users who don’t have the technical sophistication) how to make large quantities of data who don’t have the skills to process them at scale

Holy grail is Wolfram Alpha but not proprietary.

Moving Beyond Map Reduce:
eg YARN, move beyond hadoop constraints.

The Cambrian explosion of frameworks: mezos at twitter, YARN from Yahoo.

Spark is good: in memory framework. Defines operations on distributed in memory collections. Written in Scala. Supports reading to and writing from HDFS.

Graphlab – from CMU. Map/Reduce => Update/Sort. lower level promitev. Lots of machine learning libraries out of the box. Reads from HDFS.

Playing with YARN – developed a config lang like Borg called Kitten
BranchReduce ->


Impala: not intended as Hive killer. Uses Hive meta store. Use Hive for ETLs, Impala for more interactive queries.

The one h/w technology that could evolve to better serve Big Data: network, network, network.