soam's home

home mail us syndication

Archive for March, 2011

Some Tips on Amazon’s EMR II

Following up from Some Tips on Amazon’s EMR

Recently, I was asked for some pointers for a log processing application intended to be implemented via Amazon’s Elastic Map Reduce. Questions included how big the log files should be, what format worked best and so on. I found it to be a good opportunity to summarize some of my own experiences. Obviously, there’s no hard and fast rule on optimal file size as input for EMR per se. Much depends on the computation being performed and the types of instances chosen. That being said, here are some considerations:

  • it’s a good idea to batch log files as much as possible. Hadoop/EMR will generate a map job for every individual file in S3 that serves as input. So, the lower the number of files, the less maps in the initial step and the faster your computation.
  • another reason to batch files: hadoop doesn’t deal with many small files very well. SeeĀ http://www.cloudera.com/blog/2009/02/the-small-files-problem/ It’s possible to get better performance with a smaller number of large files.
  • however, it’s not a good idea to batch files too much – if 10 100M files are blended into a single 1GB file, then it’s not possible to take advantage of the parallelization of map reduce. One mapper will then be working on downloading one file while the rest are idle, waiting for that step to finish.
  • gzip vs other compression schemes for input splits – hadoop cannot peer inside gzipped files, so it can’t produce splits very well with that kind of input files. It does better with LZO. SeeĀ https://forums.aws.amazon.com/message.jspa?messageID=184519
In short, experimenting with log file sizes and instance types is essential for good performance down the road.