I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig?
Thanks!
I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig?
Thanks!
In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.
hadoop jar
$HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar <br>
-Dmapred.reduce.tasks=1
-Dmapred.job.queue.name=$QUEUE
-input "$INPUT"
-output "$OUTPUT"
-mapper cat
-reducer cat
If you want compression add
-Dmapred.output.compress=true
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>
okay...I figured out a way using hadoop fs
commands -
hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]
It worked when I tested it...any pitfalls one can think of?
Thanks!
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to /mnt/hdfs
locally. I run the following command and it works great:
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us.
You can use the tool HDFSConcat, new in HDFS 0.21, to perform this operation without incurring the cost of a copy.
If you are working in Hortonworks cluster and want to merge multiple file present in HDFS location into a single file then you can run hadoop-streaming-2.7.1.2.3.2.0-2950.jar jar which runs single reducer and get the merged file into HDFS output location.
$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar
-Dmapred.reduce.tasks=1
-input "/hdfs/input/dir"
-output "/hdfs/output/dir"
-mapper cat
-reducer cat
You can download this jar from Get hadoop streaming jar
If you are writing spark jobs and want to get a merged file to avoid multiple RDD creations and performance bottlenecks use this piece of code before transforming your RDD
sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)
This will merge all part files into one and save it again into hdfs location
Addressing this from Apache Pig perspective,
To merge two files with identical schema via Pig, UNION command can be used
A = load tmp/file1 Using PigStorage( ) as ....(schema1)
B = load tmp/file2 Using PigStorage( ) as ....(schema1)
C = UNION A,B
store C into tmp/fileoutput Using PigStorage( )
All the solutions are equivalent to doing a
hadoop fs -cat [dir]/* > tmp_local_file
hadoop fs -copyFromLocal tmp_local_file
it only means that the local m/c I/O is on the critical path of data transfer.
I am trying to run hadoop as a root user, i executed namenode format command hadoop namenode -format when the Hadoop file system is running. After this, when i try to start the name node server, it ...
I hope I m asking this in the right way. I m learning my way around Elastic MapReduce and I ve seen numerous references to the "Aggregate" reducer that can be used with "Streaming" job flows. In ...
I have checked-out a project from SourceForge named HadoopDB. It uses some class in another project named Hive. I have used Eclipse Java build path setting to link source to the Hive project root ...
I am researching Hadoop to see which of its products suits our need for quick queries against large data sets (billions of records per set) The queries will be performed against chip sequencing data. ...
I am implementing a Hadoop Map reduce job that needs to create output in multiple S3 objects. Hadoop itself creates only a single output file (an S3 object) but I need to partition the output into ...
I m very new to Hadoop and I m currently trying to join two sources of data where the key is an interval (say [date-begin/date-end]). For example: input1: 20091001-20091002 A 20091011-20091104 ...
Is there a way to determine if a file in hadoop is being written to? eg- I have a process that puts logs into hdfs. I have another process that monitors for the existence of new logs in hdfs, but I ...
I am trying out the Apache Hive as per http://wiki.apache.org/hadoop/Hive/GettingStarted and am getting this error from Ivy: Downloaded file size doesn t match expected Content Length for http://...