Question

我知道有关hadoop

我想知道它是如何运作的。

确切地说,我想知道它是如何分割/分割输入文件的。

大小大小是否以相等块分割?

或者它是可配置的东西。

我读了这个,但我无法理解

Answer 1

这取决于输入格式,对于大多数基于文件的格式,其定义在 FileInputFormat 基级中。

有许多可配置的选项, 这些选项表示 Hadoop 将如何将单个文件作为单一的分割处理, 或者将文件分割成多个分割 :

If the input file is compressed, the input format and compression method must be splittable. Gzip for example is not splittable (you can t randomly seek to a point in the file and recover the compressed stream). BZip2 is splittable. See the specific InputFormat.isSplittable() implementation for your input format for more information
If the file size is less than or equal to its defined HDFS block size, then hadoop will most probably process it in a single split (this can be configured, see a later point about split size properties)
If the file size is greater than its defined HDFS block size, then hadoop will most probably divide up the file into splits based upon the underlying blocks (4 blocks would result in 4 splits)
You can configure two properties mapred.min.split.size and mapred.max.split.size which help the input format when breaking up blocks into splits. Note that the minimum size may be overriden by the input format (which may have a fixed minumum input size)

如果您想知道更多信息, 且在浏览源码时感到舒适, 请查看 < code> getSplits () 方法在 < code> FileInputFormat (新旧的pi 方法都相同, 但可能有些细微差异) 中的位置。

Answer 2

当您提交地图裁剪任务( 或猪/ 蜂活任务) 时, Hadoop 首先计算输入拆分, 每个输入拆分大小一般等于 HDFS 区块大小。例如, 对于 1GB 大小的文件, 如果区块大小为 64MB, 将会有 16 个输入拆分。但是, 分割大小可以配置为小/ 大于 HDFS 区块大小。输入拆分的计算用 Format 文件完成。对于其中的每一个输入拆分, 必须启动地图任务。

但是,您可以通过以下属性的配置来改变输入的分隔大小 :

mapred.min.split.size: The minimum size chunk that map input should be split into.
mapred.max.split.size: The largest valid size inbytes for a file split. 
dfs.block.size: The default block size for new files.

输入分割的公式是:

Math.max("mapred.min.split.size", Math.min("mapred.max.split.size", blockSize));

您可以查看以下示例这里。

友情链接