Question

我们存放在S3的标识,我们(Pig)的查询之一将 gr碎三种不同的记录类型。每一记录类型都是按类型/日期排列的分指示数。例如:

/logs/<type>/<year>/<month>/<day>/<hour>/lots_of_logs_for_this_hour_and_type.log*

我的问候是,要装上所有三类原木,留出一定时间。例如:

type1 = load  s3:/logs/type1/2011/03/08  as ...
type2 = load  s3:/logs/type2/2011/03/08  as ...
type3 = load  s3:/logs/type3/2011/03/08  as ...
result = join type1 ..., type2, etc...

my queries would then run against all of these logs.

处理这一问题的最有效方式是什么?

Do we need use the bash script expansion? Not sure if this works with multi directories, and I doubt it would be efficient (or even possible) if there were 10k logs to load.
Do we create a service to aggregate all of the logs and push them to hdfs directly?
Custom java/python importers?
Other thoughts?

如果你能够酌情留下某种榜样,那将是有益的。

增编

Answer 1

Globbing得到PigStorage的支持,因此你只能尝试:

<代码>类型1 = 装载量3:/logs/type{1,2,3}/203/08 as .。

甚至

<代码>类型1 = 装载量3:/logs/*/201103/08 as .。

Answer 2

我拥有像你这样的模拟记录系统,唯一的区别是,我实际上不是按日期,而是按类型分析这些记录。

type1 = load  s3:/logs/type1/2011/03/  as ...

用于分析1类的月度记录,而不将其与2类混在一起。由于你不是按类型分析,而是按日期分析,我建议你改变结构,以便:

/logs/<year>/<month>/<day>/<hour>/<type>/lots_of_logs_for_this_hour_and_type.log*

因此,你可以装上每日(或每月)数据,然后按类型过滤,更方便。

Answer 3

如果与我一样,你使用Hive和你的数据是分门别类的,那么你可以使用Piggy Bank的一些装货机(例如),只要你想要过滤的目录结构看上:

.../type=value1/...
.../type=value2/...
.../type=value3/...

那么,你就应当能够查阅文件,然后按照类型分类FLTER=价值1。

例:

REGISTER piggybank.jar;
I = LOAD  /hive/warehouse/mytable  using AllLoader() AS ( a:int, b:int );
F = FILTER I BY type = 1 OR type = 2;

友情链接