English 中文(简体)
Hadoops 输入分裂 - 它如何运作?
原标题:Hadoop s input splitting- How does it work
  • 时间:2012-05-23 11:41:41
  •  标签:
  • hadoop

我知道有关hadoop

我想知道它是如何运作的。

确切地说,我想知道它是如何分割/分割输入文件的。

大小大小是否以相等块分割?

或者它是可配置的东西。

我读了这个,但我无法理解

最佳回答

这取决于输入格式,对于大多数基于文件的格式,其定义在 FileInputFormat 基级中。

有许多可配置的选项, 这些选项表示 Hadoop 将如何将单个文件作为单一的分割处理, 或者将文件分割成多个分割 :

  • If the input file is compressed, the input format and compression method must be splittable. Gzip for example is not splittable (you can t randomly seek to a point in the file and recover the compressed stream). BZip2 is splittable. See the specific InputFormat.isSplittable() implementation for your input format for more information
  • If the file size is less than or equal to its defined HDFS block size, then hadoop will most probably process it in a single split (this can be configured, see a later point about split size properties)
  • If the file size is greater than its defined HDFS block size, then hadoop will most probably divide up the file into splits based upon the underlying blocks (4 blocks would result in 4 splits)
  • You can configure two properties mapred.min.split.size and mapred.max.split.size which help the input format when breaking up blocks into splits. Note that the minimum size may be overriden by the input format (which may have a fixed minumum input size)

如果您想知道更多信息, 且在浏览源码时感到舒适, 请查看 < code> getSplits () 方法在 < code> FileInputFormat (新旧的pi 方法都相同, 但可能有些细微差异) 中的位置 。

问题回答




相关问题
Hadoop - namenode is not starting up

I am trying to run hadoop as a root user, i executed namenode format command hadoop namenode -format when the Hadoop file system is running. After this, when i try to start the name node server, it ...

What default reducers are available in Elastic MapReduce?

I hope I m asking this in the right way. I m learning my way around Elastic MapReduce and I ve seen numerous references to the "Aggregate" reducer that can be used with "Streaming" job flows. In ...

Establishing Eclipse project environment for HadoopDB

I have checked-out a project from SourceForge named HadoopDB. It uses some class in another project named Hive. I have used Eclipse Java build path setting to link source to the Hive project root ...

Hadoop: intervals and JOIN

I m very new to Hadoop and I m currently trying to join two sources of data where the key is an interval (say [date-begin/date-end]). For example: input1: 20091001-20091002 A 20091011-20091104 ...

hadoop- determine if a file is being written to

Is there a way to determine if a file in hadoop is being written to? eg- I have a process that puts logs into hdfs. I have another process that monitors for the existence of new logs in hdfs, but I ...

Building Apache Hive - impossible to resolve dependencies

I am trying out the Apache Hive as per http://wiki.apache.org/hadoop/Hive/GettingStarted and am getting this error from Ivy: Downloaded file size doesn t match expected Content Length for http://...

热门标签