Question

I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig?

Thanks!

Answer 1

In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.

hadoop jar 
    $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar <br>
    -Dmapred.reduce.tasks=1 
    -Dmapred.job.queue.name=$QUEUE 
    -input "$INPUT" 
    -output "$OUTPUT" 
    -mapper cat 
    -reducer cat

If you want compression add
-Dmapred.output.compress=true -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

Answer 2

hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>

Answer 3

okay...I figured out a way using hadoop fs commands -

hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]

It worked when I tested it...any pitfalls one can think of?

Thanks!

Answer 4

If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.

For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:

hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt

Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us.

Answer 5

You can use the tool HDFSConcat, new in HDFS 0.21, to perform this operation without incurring the cost of a copy.

Answer 6

If you are working in Hortonworks cluster and want to merge multiple file present in HDFS location into a single file then you can run hadoop-streaming-2.7.1.2.3.2.0-2950.jar jar which runs single reducer and get the merged file into HDFS output location.

$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar 
-Dmapred.reduce.tasks=1 
-input "/hdfs/input/dir" 
-output "/hdfs/output/dir" 
-mapper cat 
-reducer cat

You can download this jar from Get hadoop streaming jar

If you are writing spark jobs and want to get a merged file to avoid multiple RDD creations and performance bottlenecks use this piece of code before transforming your RDD

sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)

This will merge all part files into one and save it again into hdfs location

Answer 7

Addressing this from Apache Pig perspective,

To merge two files with identical schema via Pig, UNION command can be used

 A = load  tmp/file1  Using PigStorage( 	 ) as ....(schema1)
 B = load  tmp/file2  Using PigStorage( 	 ) as ....(schema1) 
 C = UNION A,B
 store C into  tmp/fileoutput  Using PigStorage( 	 )

Answer 8

All the solutions are equivalent to doing a

hadoop fs -cat [dir]/* > tmp_local_file  
hadoop fs -copyFromLocal tmp_local_file

it only means that the local m/c I/O is on the critical path of data transfer.

友情链接