soam's home

home mail us syndication

Archive for May, 2009

Some Tips On Amazon’s EMR

Amazon’s Elastic Map Reduce is a fascinating new service aiming to further commodify Map Reduce based data operations. The model is best summarized by Tom White’s summary, the S3 source/sink based approach:


I am in the process of writing some heavy jobs for EMR at the moment and thought it would be useful to gather together some of the pointers posted in various bulletin boards thus far. Here goes.

Amazon Elastic MapReduce and Amazon S3 Bucket Names:

Amazon Elastic MapReduce uses the S3N Native File System for Hadoop. This file system uses the “hostname” method for accessing data in Amazon S3 which places restrictions on bucket names used in Amazon Elastic MapReduce job flows. To conform with DNS requirements the bucket names should:
• Bucket names should not contain underscores “_”
• Bucket names should be between 3 and 63 characters long
• Bucket names should not end with a dash
• Bucket names cannot contain dashes next to periods (e.g., “” and “my.-bucket” are invalid)

Common Problems Running Job Flows:

Using s3:// instead of s3n://

If your files have been simply uploaded in Amazon S3 to become Amazon S3 objects then you must specify s3n:// for resources used by your job flow such as input directory, jar file. The reason is that in Hadoop s3n:// refers the the Amazon S3 Native File System while s3:// refers in Hadoop to a block structured file system which expects the files in a very particular block structured format.

Note that when specifying resources in the Elastic Map Reduce Tab in the AWS Console if you specify a resource such as jar, input or output then it will have s3n:// prepended to it as convenience. Please also note that this prepending of s3n:// is *not* applied to jar arguments, streaming arguments, nor parameters.

Path to s3n:// must have at least three slashes

You must have a terminating slash on the end of your s3n URL. It is not sufficient to supply a bucket, e.g. s3n://mybucket, rather you must specify s3n://mybucket/ otherwise Hadoop in most instances fail your job flow.

Hadoop Does not Recurse Input Directories

It would be nice if Hadoop were able to recursively search input directories for input files. It doesn’t. So if you have a directory structure like /corpus/01/01.txt, /corpus/01/02.txt, /corpus/02/01.txt and you specify /corpus/ as the input to your job then no files will be input to the job because Hadoop does not look through subdirectories, even when using Amazon S3.

The Output Path Must Not Exist

If the output path you specified already exists then Hadoop in most instances will fail the job. This means that if you run a job flow once and then run it again with exactly the same parameters it could work the first time and then never again since after the first run the output path exists and causes all successive runs to fail.

Resources cannot be specified as http://

Hadoop does not recognize resources specified as http:// so you cannot specify a resource via an HTTP URL, e.g. specifying the Jar argument as http://mysite/myjar.jar will not work

Using -cacheFile requires a ‘#’ separator.

When you specify a cacheFile as a streaming argument to Hadoop then you must specify a destination in the distributed cache to place this file. So for example

-cacheFile s3n://mybucket/my_program#my_program

This will copy the file from s3n://mybucket/my_program to my_program in the distributed cache to be made available to mappers and reducers.

The Elastic Map Reduce Web Service creates -cacheFile entries for resources passed to the -mapper and -reducer arguments if they refer to resources in Amazon S3.

Cannot SSH To Master

There are two causes for not being able to SSH to the master. The first possibility is that the pem file containing your ssh key might have the wrong permissions. If you pem file is called myfile.pem then you can fix it using chmod by:

chmod og-rwx myfile.pem

The second possibility is that the name of the keypair you specified does not match your pem file. Check in the AWS Console ( for which keypair has been specified when the job flow was created.

The command to ssh to the master is

ssh -i mykey.pem

But of course specify your own pem file and the public dns name of the master node.

Running DistCp Requires a Custom Jar

You cannot run Distcp by specifying a Jar residing on the AMI. Instead you can use the samples/distcp/distcp.jar in the elasticmapreduce S3 bucket. Remember to substitute your jobflow id in the following:

elastic-mapreduce –jobflow j-ABABABABABAB \
–jar s3n://elasticmapreduce/samples/distcp/distcp.jar \
–arg s3n://elasticmapreduce/samples/wordcount/input \
–arg hdfs:///samples/wordcount/input

Where are the Logs?

To see why your job flow step failed it is helpful to be able to inspect the log files produced when the step ran. To be able to see logs when running job flows from the AWS Console specify a path to one of your buckets in Amazon S3 in the advanced options.

Note that you logs will not be uploaded into S3 until 5 minutes after your step has completed.

And for those, using cascading in conjunction with Hadoop, some notes:

…you want to use the local HDFS as your default in all your jobs, and only integrate with S3 to pull/push the data that needs to live longer than your cluster.

So just use Hfs and relative paths everywhere, except when that data is in S3 or must go to S3 (new Hfs( “s3n://…..” ))

And my recommendation is to use s3n:// not s3://, this way other apps an get at the data (s3cmd, http://, etc). The drawback is that you must consider that on input, you can only have one mapper for every file being read from S3 (in the first MR job in your Flow).

Online Video And The Future Of Journalism

Interesting observation by Arianna Huffington in her statement to the US Senate Hearings on “The Future Of Journalism” held last week (May 6, 2009):

No, the future is to be found elsewhere. It is a linked economy. It is search engines. It is online advertising. It is citizen journalism and foundation-supported investigative funds. That’s where the future is. And if you can’t find your way to that, then you can’t find your way.

Online video offers a useful example of the importance of being able to adapt. Not that long ago, content providers were committed to the idea of requiring viewers to come to their site to view their content — and railed against anyone who dared show even a short clip.

But content hoarding — the walled garden — didn’t work. And instead of sticking their finger in the dike, trying to hold back the flow of innovation, smart companies began providing embeddable players that allowed their best stuff to be posted all over the web, accompanied by links and ads that helped generate additional traffic and revenue.

This dovetails into the type of work we’re trying to accomplish at Delve. It’s nice to think that in some small way, we might be a part of the solution and not the problem.


I’ve been quite a fan of Roku’s set top box since buying it for Netflix’s Watch Now service. The ease of installation and use, a small form factor and bargain basement pricing clearly marks it out as a winner. The video quality is very good as well although my experimentation has been limited to 480p, not HD. Though limited to two channels at the moment (Netflix and Amazon), Roku has plans to add more to its lineup later this year. Thus far, Roku has been mum on exactly what those channels might be. However, the interviews by the top brass over there, particularly Tim Twerdahl, VP of Consumer Products, have revealed some tantalizing details about their plans, philosophy and how well they are doing.

From trenderresearch:

“We are running at full tilt, selling them almost as fast as we can make them,” he said. Tim did not disclose numbers, but says “we are absolutely making money on the hardware” and tells me unit sales are well into the “hundreds of thousands.” Surprisingly, Tim also downplayed how much money Roku makes on content sales either through revenue-sharing or referral fees, saying “We really make money on the box… occasionally there are bounties for bringing new (eyeballs) to the content owners.”


Tim tells me that the Roku player takes advantage of the new cloud computing model that allows the box to be a “thin client.” The Roku player only needs to stream and buffer about 4 minutes of video at a time, eliminating the need for a large hard drive. It does not need encryption inside the box, instead leveraging the DRM protections inherent in the connectors— similar to what a DVD player does.

From Contentinopole:

Contentinople: Can you tell me more about the content providers you’re talking to — like MTV Networks or Comedy Central or Hulu?

Tim Twerdahl: I can’t obviously talk about any of those specific deals. But I’ll tell you philosophically where we’re going.

We look at content for our boxes as this matrix: You have various kinds of content — TV shows, movies, YouTube. One of the key cells that we don’t have today in that matrix is ad-supported TV. We will have ad-supported TV this year.

User-generated clips that are free or ad supported is something we’re focused on. Subscription-based music and free music are things we’re looking at.

Rounding out that content matrix is what we’re focused on now. This summer, we’ll be releasing an SDK [software development kit] publicly where anybody can create a channel, and we will have a marketplace for those channels so that our users can choose to add the channels they’d like to to the box.

We strongly believe that we don’t have a monopoly on every great idea for what would make sense to put on the TV, and so this SDK will really let the community help define that for us.

Bottom line: though intially, this box was marketed as a hardware based Netflix Player, there is much more potential here for programming including music only services. Worrying news for cable companies and other set top box manufacturers.