Version inside file-format

Question

我想知道如何在Hadoop/HDFS/Hbase中版本数据。它应该成为你模型的一部分,因为变化非常可能(大数据是长期收集的)。

HDFS (基于文件的后端) 的主要示例。

sample-log-file.log :

timestamp x1 y1 z1 ...
timestamp x2 y2 z2 ...

我现在想知道该如何添加版本信息。

Version inside file-format

log-file.log :


timestamp V1 x1 y1 z1 ...
timestamp V2 w1 x2 y2 z1 ...

Version inside file-name

*log-file_V1.log* 缩略语


timestamp x1 y1 z1 ...

*log-file_V2.log*

timestamp w1 x1 y1 z1 ...

第二个选项( 文件名中的版本) 对我来说感觉更干净一些, 适合 HDFS ( 我可以简单地使用 & v2* 作为排除旧版本风格文件的模式 ) 。另一方面, 我还需要执行两个不同的任务, 因为无法分析一个任务中的版本片段。

关于HBase,我想在HBase中,该版本将在另一个表格栏中定义结尾(HDFS是实施细节,并作为HBase的后端使用)如何?

后端Hadoop/HDFS/HBase的数据版本的替代版本?

谢谢!

我的问题是如何处理版本信息本身, 而不是时间戳。

Answer 1

In my view, efficient data versioning required storing records of he same version in some proximity. Then you can have aplicative logic to select the right version for your need. It is similar to what some relational databases are doing.
This approach might be used by CoachDB, although i am not 100% sure about it.
Now lets look on HDFS/HBase. They are quite different from this perspective since HBase allows data mutation and editing, while HDFS is not.
So for the HBase you can have timestemp as a last part of the key, and all versions wil be together
HDFS is suited for storing small number of big files and we can not edit them. I would suggest to write all versions to the files in the order they arrive and use MapReduce to group together all versions of the record with different timestmps together in the reducer. It will not be efficient, since all data will pass shuffling but you will be in control. To solve it we can by doing this resolution periodically and store data with most records in one version.

Answer 2

对于 HDFS 来说, 将时间戳存储在文件中会使用更多空间( 每行都重复使用Timstamp), 但允许您在单个文件中保留多个日期。这最好完全取决于您的使用大小写。

对于 HBase, 您可以有几种选项: 您可以在行键中明确包含一个时间戳( 和/ 或版本号), 并将数据项的不同版本转换为表格中不同的行; 或者, 您可以使用 HBase 的内嵌时间维度, 它实际上包含数据库中每个单元格的时间戳( 即每列每一列的每一值), 并允许您保留一个可配置的版本数量。默认情况下, 扫描只返回每个关键/ 值的最新版本, 但是您可以在扫描时改变该行为, 返回多个版本, 或者只在给定的时间范围内的版本。

Version inside file-format

Version inside file-name

友情链接