English 中文(简体)
进口 had/pi木的多级目录
原标题:Importing multi-level directories of logs in hadoop/pig

我们存放在S3的标识,我们(Pig)的查询之一将 gr碎三种不同的记录类型。 每一记录类型都是按类型/日期排列的分指示数。 例如:

/logs/<type>/<year>/<month>/<day>/<hour>/lots_of_logs_for_this_hour_and_type.log*

我的问候是,要装上所有三类原木,留出一定时间。 例如:

type1 = load  s3:/logs/type1/2011/03/08  as ...
type2 = load  s3:/logs/type2/2011/03/08  as ...
type3 = load  s3:/logs/type3/2011/03/08  as ...
result = join type1 ..., type2, etc...

my queries would then run against all of these logs.

处理这一问题的最有效方式是什么?

  1. Do we need use the bash script expansion? Not sure if this works with multi directories, and I doubt it would be efficient (or even possible) if there were 10k logs to load.
  2. Do we create a service to aggregate all of the logs and push them to hdfs directly?
  3. Custom java/python importers?
  4. Other thoughts?

如果你能够酌情留下某种榜样,那将是有益的。

增 编

最佳回答

Globbing得到PigStorage的支持,因此你只能尝试:

<代码>类型1 = 装载量3:/logs/type{1,2,3}/203/08 as .。

甚至

<代码>类型1 = 装载量3:/logs/*/201103/08 as .。

问题回答

我拥有像你这样的模拟记录系统,唯一的区别是,我实际上不是按日期,而是按类型分析这些记录。

type1 = load  s3:/logs/type1/2011/03/  as ...

用于分析1类的月度记录,而不将其与2类混在一起。 由于你不是按类型分析,而是按日期分析,我建议你改变结构,以便:

/logs/<year>/<month>/<day>/<hour>/<type>/lots_of_logs_for_this_hour_and_type.log*

因此,你可以装上每日(或每月)数据,然后按类型过滤,更方便。





相关问题
Hadoop - namenode is not starting up

I am trying to run hadoop as a root user, i executed namenode format command hadoop namenode -format when the Hadoop file system is running. After this, when i try to start the name node server, it ...

What default reducers are available in Elastic MapReduce?

I hope I m asking this in the right way. I m learning my way around Elastic MapReduce and I ve seen numerous references to the "Aggregate" reducer that can be used with "Streaming" job flows. In ...

Establishing Eclipse project environment for HadoopDB

I have checked-out a project from SourceForge named HadoopDB. It uses some class in another project named Hive. I have used Eclipse Java build path setting to link source to the Hive project root ...

Hadoop: intervals and JOIN

I m very new to Hadoop and I m currently trying to join two sources of data where the key is an interval (say [date-begin/date-end]). For example: input1: 20091001-20091002 A 20091011-20091104 ...

hadoop- determine if a file is being written to

Is there a way to determine if a file in hadoop is being written to? eg- I have a process that puts logs into hdfs. I have another process that monitors for the existence of new logs in hdfs, but I ...

Building Apache Hive - impossible to resolve dependencies

I am trying out the Apache Hive as per http://wiki.apache.org/hadoop/Hive/GettingStarted and am getting this error from Ivy: Downloaded file size doesn t match expected Content Length for http://...

热门标签