我们存放在S3的标识,我们(Pig)的查询之一将 gr碎三种不同的记录类型。 每一记录类型都是按类型/日期排列的分指示数。 例如:
/logs/<type>/<year>/<month>/<day>/<hour>/lots_of_logs_for_this_hour_and_type.log*
我的问候是,要装上所有三类原木,留出一定时间。 例如:
type1 = load s3:/logs/type1/2011/03/08 as ...
type2 = load s3:/logs/type2/2011/03/08 as ...
type3 = load s3:/logs/type3/2011/03/08 as ...
result = join type1 ..., type2, etc...
my queries would then run against all of these logs.
处理这一问题的最有效方式是什么?
- Do we need use the bash script expansion? Not sure if this works with multi directories, and I doubt it would be efficient (or even possible) if there were 10k logs to load.
- Do we create a service to aggregate all of the logs and push them to hdfs directly?
- Custom java/python importers?
- Other thoughts?
如果你能够酌情留下某种榜样,那将是有益的。
增 编