English 中文(简体)
Writing single Hadoop map reduce output into multiple S3 objects
原标题:

I am implementing a Hadoop Map reduce job that needs to create output in multiple S3 objects. Hadoop itself creates only a single output file (an S3 object) but I need to partition the output into multiple files.

How do I achieve this?

问题回答

I did this by just writing the output directly from my reducer method to S3, using an S3 toolkit. Since I was running on EC2, this was quick and free.

In general, you want Hadoop to handle your input and output as much as possible, for cleaner mappers and reducers; and, of course, you want to write to S3 at the very end of your pipeline, to let Hadoop s code moving do it s job over HDFS.

In any case, I recommend doing all of your data partitioning, and writing entire output sets to S3 in a final reduce task, one set per S3 file. This puts as little writer logic in your code as possible. This paid off for me because I ended up with a minimal Hadoop S3 toolkit which I used for several task flows.

I needed to write to S3 in my reducer code because the S3/S3n filesystems weren t mature; they might work better now.

Do you also know the MultipleOutputFormat? It s not related to S3, but in general it allows to write output to multiple files, implementing a given logic.





相关问题
Hadoop - namenode is not starting up

I am trying to run hadoop as a root user, i executed namenode format command hadoop namenode -format when the Hadoop file system is running. After this, when i try to start the name node server, it ...

What default reducers are available in Elastic MapReduce?

I hope I m asking this in the right way. I m learning my way around Elastic MapReduce and I ve seen numerous references to the "Aggregate" reducer that can be used with "Streaming" job flows. In ...

Establishing Eclipse project environment for HadoopDB

I have checked-out a project from SourceForge named HadoopDB. It uses some class in another project named Hive. I have used Eclipse Java build path setting to link source to the Hive project root ...

Hadoop: intervals and JOIN

I m very new to Hadoop and I m currently trying to join two sources of data where the key is an interval (say [date-begin/date-end]). For example: input1: 20091001-20091002 A 20091011-20091104 ...

hadoop- determine if a file is being written to

Is there a way to determine if a file in hadoop is being written to? eg- I have a process that puts logs into hdfs. I have another process that monitors for the existence of new logs in hdfs, but I ...

Building Apache Hive - impossible to resolve dependencies

I am trying out the Apache Hive as per http://wiki.apache.org/hadoop/Hive/GettingStarted and am getting this error from Ivy: Downloaded file size doesn t match expected Content Length for http://...

热门标签