soam's home

home mail us syndication

Some Tips On Amazon’s EMR

Amazon’s Elastic Map Reduce is a fascinating new service aiming to further commodify Map Reduce based data operations. The model is best summarized by Tom White’s summary, the S3 source/sink based approach:

s3,ec2,hadoop

I am in the process of writing some heavy jobs for EMR at the moment and thought it would be useful to gather together some of the pointers posted in various bulletin boards thus far. Here goes.

Amazon Elastic MapReduce and Amazon S3 Bucket Names:

Amazon Elastic MapReduce uses the S3N Native File System for Hadoop. This file system uses the “hostname” method for accessing data in Amazon S3 which places restrictions on bucket names used in Amazon Elastic MapReduce job flows. To conform with DNS requirements the bucket names should:
• Bucket names should not contain underscores “_”
• Bucket names should be between 3 and 63 characters long
• Bucket names should not end with a dash
• Bucket names cannot contain dashes next to periods (e.g., “my-.bucket.com” and “my.-bucket” are invalid)

Common Problems Running Job Flows:

Using s3:// instead of s3n://

If your files have been simply uploaded in Amazon S3 to become Amazon S3 objects then you must specify s3n:// for resources used by your job flow such as input directory, jar file. The reason is that in Hadoop s3n:// refers the the Amazon S3 Native File System while s3:// refers in Hadoop to a block structured file system which expects the files in a very particular block structured format.

Note that when specifying resources in the Elastic Map Reduce Tab in the AWS Console if you specify a resource such as jar, input or output then it will have s3n:// prepended to it as convenience. Please also note that this prepending of s3n:// is *not* applied to jar arguments, streaming arguments, nor parameters.

Path to s3n:// must have at least three slashes

You must have a terminating slash on the end of your s3n URL. It is not sufficient to supply a bucket, e.g. s3n://mybucket, rather you must specify s3n://mybucket/ otherwise Hadoop in most instances fail your job flow.

Hadoop Does not Recurse Input Directories

It would be nice if Hadoop were able to recursively search input directories for input files. It doesn’t. So if you have a directory structure like /corpus/01/01.txt, /corpus/01/02.txt, /corpus/02/01.txt and you specify /corpus/ as the input to your job then no files will be input to the job because Hadoop does not look through subdirectories, even when using Amazon S3.

The Output Path Must Not Exist

If the output path you specified already exists then Hadoop in most instances will fail the job. This means that if you run a job flow once and then run it again with exactly the same parameters it could work the first time and then never again since after the first run the output path exists and causes all successive runs to fail.

Resources cannot be specified as http://

Hadoop does not recognize resources specified as http:// so you cannot specify a resource via an HTTP URL, e.g. specifying the Jar argument as http://mysite/myjar.jar will not work

Using -cacheFile requires a ‘#’ separator.

When you specify a cacheFile as a streaming argument to Hadoop then you must specify a destination in the distributed cache to place this file. So for example

-cacheFile s3n://mybucket/my_program#my_program

This will copy the file from s3n://mybucket/my_program to my_program in the distributed cache to be made available to mappers and reducers.

The Elastic Map Reduce Web Service creates -cacheFile entries for resources passed to the -mapper and -reducer arguments if they refer to resources in Amazon S3.

Cannot SSH To Master

There are two causes for not being able to SSH to the master. The first possibility is that the pem file containing your ssh key might have the wrong permissions. If you pem file is called myfile.pem then you can fix it using chmod by:

chmod og-rwx myfile.pem

The second possibility is that the name of the keypair you specified does not match your pem file. Check in the AWS Console ( http://console.aws.amazon.com) for which keypair has been specified when the job flow was created.

The command to ssh to the master is

ssh -i mykey.pem hadoop@ec2-01-001-001-1.compute-1.amazonaws.com

But of course specify your own pem file and the public dns name of the master node.

Running DistCp Requires a Custom Jar

You cannot run Distcp by specifying a Jar residing on the AMI. Instead you can use the samples/distcp/distcp.jar in the elasticmapreduce S3 bucket. Remember to substitute your jobflow id in the following:

elastic-mapreduce –jobflow j-ABABABABABAB \
–jar s3n://elasticmapreduce/samples/distcp/distcp.jar \
–arg s3n://elasticmapreduce/samples/wordcount/input \
–arg hdfs:///samples/wordcount/input

Where are the Logs?

To see why your job flow step failed it is helpful to be able to inspect the log files produced when the step ran. To be able to see logs when running job flows from the AWS Console specify a path to one of your buckets in Amazon S3 in the advanced options.

Note that you logs will not be uploaded into S3 until 5 minutes after your step has completed.

And for those, using cascading in conjunction with Hadoop, some notes:

…you want to use the local HDFS as your default in all your jobs, and only integrate with S3 to pull/push the data that needs to live longer than your cluster.

So just use Hfs and relative paths everywhere, except when that data is in S3 or must go to S3 (new Hfs( “s3n://…..” ))

And my recommendation is to use s3n:// not s3://, this way other apps an get at the data (s3cmd, http://, etc). The drawback is that you must consider that on input, you can only have one mapper for every file being read from S3 (in the first MR job in your Flow).

Rod said,

September 18, 2010 @ 12:53 am

Just spend the last 30 minutes trying to connect to the hadoop master using the wrong username (root@), Thanks for the tip!

Also, instead of using an explicit ssh cmd as above, the following also works: elastic-mapreduce -j $JOB –ssh

Dolan Antenucci said,

June 24, 2012 @ 11:37 am

Note: on EMR, s3:// and s3n:// both map to the native S3 file system according to http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/FileSystemConfig.html

ibsn said,

July 29, 2014 @ 5:40 am

Hey there! I could have sworn I’ve been to this blog before but after
checking through some of the post I realized it’s new to me.
Anyways, I’m definitely delighted I found it and I’ll be bookmarking and checking back often!

Raymundo said,

April 20, 2015 @ 12:27 am

What’s up to every body, it’s my first go to see off this blog; this website contains remarkable and inn faxt exceellent information in favor
of readers.

Torsten said,

July 15, 2015 @ 8:28 am

A? Our inner beings never see weakness or negative things.
You have probably heard of the great Law of Attraction before and how you attract into your life, people, circumstances and situations in proportion to the
thoughts you entertain within your mind. Though advance technology is more difficult to understand
but you can make it so easy by study a lot about the methods and techniques which is basic rule of using
“photo retouching”.

https://www.facebook.com/ said,

March 10, 2016 @ 2:53 am

If you are going for finest contents like myself, only pay a quick visit
this website everyday since it presents quality contents, thanks

RSS feed for comments on this post · TrackBack URI

Leave a Comment