soam's home

home mail us syndication

Archive for April, 2010

MapReduce vs MySQL

Brian Aker talks about the post Oracle MySQL world in this O’Reilly Radar interview. Good stuff. One section though caused me to raise an eyebrow:

MapReduce works as a solution when your queries are operating over a lot of data; Google sizes of data. Few companies have Google-sized datasets though. The average sites you see, they’re 10-20 gigs of data. Moving to a MapReduce solution for 20 gigs of data, or even for a terabyte or two of data, makes no sense. Using MapReduce with NoSQL solutions for small sites? This happens because people don’t understand how to pick the right tools.

Hmm. First of all, just because you have 10-20GB of data right now doesn’t mean you’ll have 10-20GB of data in the future. From my experience, once you start getting into this range of data, scaling mysql becomes painful. More likely as not, your application has absolutely no sharding/distributed processing capability built in to your mysql setup, so at this point, your choices are:

  1. vertical scaling => bigger boxes, RAID/SSD disks etc.
  2. introduce sharding into mysql, retrofit your application to deal with it
  3. bite the bullet and offload your processing into some other type of setup such as MapReduce

(1) is merely kicking the can down the road.

(2) involves maintaining more mysql servers, worrying about sharding schemes, setting up a middleman to deal with partitioning, data collation etc.

In both (1) and (2), you still have to worry about many little things in mysql such as setting up replication, setting up indexes for tables, tuning queries etc. And in (2), you’ll have more servers running. While it is true mysql clustering exists, as does native partitioning support in newer mysql versions, setting that stuff up is still painful and it’s not clear whether the associated maintenance overhead is worth the performance you get.

It’s not a surprise more and more people are turning to (3). A hadoop cluster provides more power out of the box than a sharded mysql setup, and a more brain dead scalable path. Just add more machines! Yes, there are configuration issues involved in a hadoop cluster as well but I think they’re far easier to deal with than the equivalent mysql setup. The main drawback here is (3) only works if your processing requirements are batch based, not real time.

It is true that not all of the technologies in the Hadoop ecosystem outside of Hadoop itself are all that mature. However, BigTable solutions like Hbase are still not that easy to setup and run. Pig is still evolving but Cascading is an amazing library. Additionally, if one uses Amazon’s cloud products judiciously, it may actually be possible to do (3) really cheap (as opposed to (2) which requires more and bigger machines).

How? Store persistent files in S3 (logs etc). Use Elastic MapReduce periodically so you are not running a dedicated hadoop cluster. Use SimpleDB for your db needs. SimpleDB has limitations (2500 limit on selects, restricted attributes, strings only) but more and more people (such as Netflix) are using it for high volume applications. Furthermore, all of these technologies are enabling single entrepreneurs to do things like crawl and maintain big chunks of the web so that they can build interesting new applications on top, something that would have been too cost prohibitive in the older MySQL world. I hope to write more about it soon.

Brave New World Of Oversharing

From the New York Times:

“Ten years ago, people were afraid to buy stuff online. Now they’re sharing everything they buy,” said Barry Borsboom, a student at Leiden University in the Netherlands, who this year created an intentionally provocative site called Please Rob Me. The site collected and published Foursquare updates that indicated when people were out socializing — and therefore away from their homes.

In this day and age of Too Much Information (TMI), the only real security, it would seem, would be the “security through obscurity” variety. If everyone flooded the web about the minutiae of their day to day lives, chances are it’s going to be tough to single out anyone in particular. That approach, however, puts early adopters at risk. No longer would they be just a face in the crowd. Comes with the territory, I guess.

That being said, websites making said TMI possible should probably realize there are still some boundaries best left uncrossed.

Recruiter LOL

linkedin

The picture says it all really. For the record, the full subject line from the recruiter was “Data Analytics Architect Opportunity – NOT SPAM.”

EC2 Reserved Instance Breakeven Point 2.0

After Amazon’s reserved instance pricing announcement last year, there were quite a few folks writing about the breakeven point for your ec2 instance i.e. the length of time you’d need to run your instance continuously before the reserved pricing turned out to be cheaper than the standard pay-as-you-go scheme. Looking around, I believe the general consensus was that it would take around 4643 hours or 6.3 months. See herehere and here, for example.

Around late October of last year, Amazon announced even cheaper pricing for their ec2 instances. However, not seeing any newer breakeven numbers computed in the wake of lower prices, I decided to post some of my own. These are for one year reserved pricing for Amazon’s US-N-Virginia data center. All data is culled from the AWS ec2 page.

As we can see, the break even numbers have dropped quite a bit – down to 4136 hours on most of the instance types, a drop of almost 500 hours or so. That translates to better pricing 3 weeks earlier than before, in about 5.7 months. Interestingly enough, the high memory instances have slightly earlier break even points (by about 50 hours or so). Not quite sure why.