soam's home

home mail us syndication

Archive for Uncategorized

View From My Office

After a couple of years of working remotely, it still feels strange to have an office of my own, let alone one with a modicum of a view of the downtown SF skyline, so I am enjoying it while it lasts. We’re scheduled to move to a more central SOMA location sometime in the next month.

View From My Office

What You Know Vs What You Don’t

To this, I have to add one of my favorite quotes:

I am not young enough to know everything

– Oscar Wilde

Delve Analytics: A New Foundation

If you’re using Delve Networks read on – we just swept the rug from underneath your feet and I bet you didn’t even know it! If you’re considering using us, well, you’d definitely want to know about this too.

For the past two weeks, Delve has been running a totally new, revamped analytics system – one that allows us to scale well beyond our current levels and provide a foundation for many more features to come. A peek under the hood:

  • Our event collection system is now completely in the cloud, running on Amazon EC2 instances. This gives us the ability to quickly scale up (hopefully always up :=) or down quickly depending on load. Correspondingly, we can now instrument our players to send more granular data to improve our accuracy vis a vis metrics such as playback times
  • Our analytics processing is also now completely in the cloud. We use (EMR) Amazon’s Elastic MapReduce, a service built atop Hadoop, to process our event data and generate reports. We are early adopters of this service and have engaged the Amazon EMR tech team to catch and resolve issues. One of the biggest benefits of using EMR is that we don’t need to maintain a dedicated hadoop cluster. Instead we simply select the number of machines to run for each given job submission – again, this simplifies scaling as our data sets grow.
  • We have now moved away from our dependence on MySQL and instead use S3 as a report storage repository. This allows publishers access to an archive of all past computed reports. While our front end does not offer this feature yet, the basic structure is in place to allow us to do so in the near future.

Perhaps the biggest advantage of the new rehaul is now we have a strong foundation for further enhancements, reports and features, futuristic and otherwise. Several are already in the works and are slated to be released soon. Stay tuned!

Some Tips On Amazon’s EMR

Amazon’s Elastic Map Reduce is a fascinating new service aiming to further commodify Map Reduce based data operations. The model is best summarized by Tom White’s summary, the S3 source/sink based approach:


I am in the process of writing some heavy jobs for EMR at the moment and thought it would be useful to gather together some of the pointers posted in various bulletin boards thus far. Here goes.

Amazon Elastic MapReduce and Amazon S3 Bucket Names:

Amazon Elastic MapReduce uses the S3N Native File System for Hadoop. This file system uses the “hostname” method for accessing data in Amazon S3 which places restrictions on bucket names used in Amazon Elastic MapReduce job flows. To conform with DNS requirements the bucket names should:
• Bucket names should not contain underscores “_”
• Bucket names should be between 3 and 63 characters long
• Bucket names should not end with a dash
• Bucket names cannot contain dashes next to periods (e.g., “” and “my.-bucket” are invalid)

Common Problems Running Job Flows:

Using s3:// instead of s3n://

If your files have been simply uploaded in Amazon S3 to become Amazon S3 objects then you must specify s3n:// for resources used by your job flow such as input directory, jar file. The reason is that in Hadoop s3n:// refers the the Amazon S3 Native File System while s3:// refers in Hadoop to a block structured file system which expects the files in a very particular block structured format.

Note that when specifying resources in the Elastic Map Reduce Tab in the AWS Console if you specify a resource such as jar, input or output then it will have s3n:// prepended to it as convenience. Please also note that this prepending of s3n:// is *not* applied to jar arguments, streaming arguments, nor parameters.

Path to s3n:// must have at least three slashes

You must have a terminating slash on the end of your s3n URL. It is not sufficient to supply a bucket, e.g. s3n://mybucket, rather you must specify s3n://mybucket/ otherwise Hadoop in most instances fail your job flow.

Hadoop Does not Recurse Input Directories

It would be nice if Hadoop were able to recursively search input directories for input files. It doesn’t. So if you have a directory structure like /corpus/01/01.txt, /corpus/01/02.txt, /corpus/02/01.txt and you specify /corpus/ as the input to your job then no files will be input to the job because Hadoop does not look through subdirectories, even when using Amazon S3.

The Output Path Must Not Exist

If the output path you specified already exists then Hadoop in most instances will fail the job. This means that if you run a job flow once and then run it again with exactly the same parameters it could work the first time and then never again since after the first run the output path exists and causes all successive runs to fail.

Resources cannot be specified as http://

Hadoop does not recognize resources specified as http:// so you cannot specify a resource via an HTTP URL, e.g. specifying the Jar argument as http://mysite/myjar.jar will not work

Using -cacheFile requires a ‘#’ separator.

When you specify a cacheFile as a streaming argument to Hadoop then you must specify a destination in the distributed cache to place this file. So for example

-cacheFile s3n://mybucket/my_program#my_program

This will copy the file from s3n://mybucket/my_program to my_program in the distributed cache to be made available to mappers and reducers.

The Elastic Map Reduce Web Service creates -cacheFile entries for resources passed to the -mapper and -reducer arguments if they refer to resources in Amazon S3.

Cannot SSH To Master

There are two causes for not being able to SSH to the master. The first possibility is that the pem file containing your ssh key might have the wrong permissions. If you pem file is called myfile.pem then you can fix it using chmod by:

chmod og-rwx myfile.pem

The second possibility is that the name of the keypair you specified does not match your pem file. Check in the AWS Console ( for which keypair has been specified when the job flow was created.

The command to ssh to the master is

ssh -i mykey.pem

But of course specify your own pem file and the public dns name of the master node.

Running DistCp Requires a Custom Jar

You cannot run Distcp by specifying a Jar residing on the AMI. Instead you can use the samples/distcp/distcp.jar in the elasticmapreduce S3 bucket. Remember to substitute your jobflow id in the following:

elastic-mapreduce –jobflow j-ABABABABABAB \
–jar s3n://elasticmapreduce/samples/distcp/distcp.jar \
–arg s3n://elasticmapreduce/samples/wordcount/input \
–arg hdfs:///samples/wordcount/input

Where are the Logs?

To see why your job flow step failed it is helpful to be able to inspect the log files produced when the step ran. To be able to see logs when running job flows from the AWS Console specify a path to one of your buckets in Amazon S3 in the advanced options.

Note that you logs will not be uploaded into S3 until 5 minutes after your step has completed.

And for those, using cascading in conjunction with Hadoop, some notes:

…you want to use the local HDFS as your default in all your jobs, and only integrate with S3 to pull/push the data that needs to live longer than your cluster.

So just use Hfs and relative paths everywhere, except when that data is in S3 or must go to S3 (new Hfs( “s3n://…..” ))

And my recommendation is to use s3n:// not s3://, this way other apps an get at the data (s3cmd, http://, etc). The drawback is that you must consider that on input, you can only have one mapper for every file being read from S3 (in the first MR job in your Flow).

Yahoo Cupcake

From happier times at Yahoo, snap of a Yahoo cupcake in a Yahoo Oktoberfest mug.


Many things are said about Yahoo external image problems – however, their internal branding was excellent.