English 中文(简体)
用Hadoop计算独特的访客的最佳途径是什么?
原标题:What s the best way to count unique visitors with Hadoop?

简言之,如果您的记录单像这样看待,那么,在绘制地图上的最佳方式是计算独一无二的访客。

DATE       siteID  action   username
05-05-2010 siteA   pageview jim
05-05-2010 siteB   pageview tom
05-05-2010 siteA   pageview jim
05-05-2010 siteB   pageview bob
05-05-2010 siteA   pageview mike

你们希望为每个网站找到独一无二的访客?

I was thinking the mapper would emit siteID username and the reducer would keep a set() of the unique usersnames per key and then emit the length of that set. However that would be potentially storing millions of usernames in memory which doesn t seem right. Anyone have a better way?

I m 采用按行方式流出

感谢

最佳回答

你们可以采取两阶段行动:

第一步,即<代码>(用户名称=>siteID),并且使用<代码> >set使网站ID多次出现故障,因为通常网站比用户少得多,因此应予罚款。

然后,在第二步中,你可以发送<代码>(siteID=>username),并且可以进行简单的计算,因为复制件已被删除。

问题回答

我的一位导师也与前导师对一小 的tw子一样。

  1. map output : (username, siteid) => ("")
  2. reduce output: (siteid) => (1)
  3. map : identity mapper
  4. reduce : longsumreducer (i.e. simply summarize)

请注意,第一次减少无需结转任何记录。 你可以简单地审查关键因素并得出结果。

HTH

采用二级对用户补贴进行分类。 这样,你就不需要记忆——仅仅通过数据流出,并且每当你看到某个特定地点的价值变化时,就把自己的数据加起来。

这里有文件





相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签