English 中文(简体)
Mongo地图/减缓大量收藏的减速
原标题:Mongo map/reduce slowdown on large collections

我们有一个看似简单的地图/ 递减任务, 每天通过记录数据。 在开发服务器上, 我们可以在大量文件上运行这个任务, 包括 ~ 1M, 并且没有问题。 我们把工作转到生产服务器上, 即 亚马逊 EC2 服务器, 工作将以非常快的速度通过50%左右的行, 然后爬过其余的数据 。 它需要几个小时才能通过几十万个文件, 而不是预期的分钟或两分钟 。 所以我希望我们在地图/ 降级工作中 犯了一个明显的错误 。

以下是一个样本输入文件:

{
    "_id" : ObjectId("4f147a92d72b292c02000057"),
    "cid" : 25,
    "ip" : "123.45.67.89",
    "b" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7",
    "r" : "",
    "l" : "en-US,en;q=0.8",
    "ts" : ISODate("2012-01-16T19:29:22Z"),
    "s" : 0,
    "cv" : "4f143a5fd72b292d7f000007",
    "c" : ""
}

我们只查询 _id s 的范围。

这是地图代码:

function() { 
    var browser = {}
    ,referrer = {};
    browser[this.b] = {
         count : 1
    };
    referrer[this.r] = {
         count : 1
    };
    var objEmit =  {
         count : 1
        , browsers  : browser
        , referrers  : referrer
    };
    var date = this._id.getTimestamp();
    date.setHours(0);
    date.setMinutes(0);
    date.setSeconds(0);
    emit({ cv  : this.cv,  date  : date,  cid  : this.cid }, objEmit);
};

以下是下限代码:

function (key, emits) {
    var total = 0
    ,browsers = {}
    ,referrers = {};
    for (var i in emits) {
        total += emits[i].count;
        for (var key in emits[i].browsers) {
            if (emits[i].browsers.hasOwnProperty(key)) {
                !(browsers[key]) && (browsers[key] = { count : 0 });
                browsers[key].count +=  emits[i].browsers[key].count;
            }
        }
        for (var key in emits[i].referrers) {
            if (emits[i].referrers.hasOwnProperty(key)) {
                !(referrers[key]) && (referrers[key] = { count : 0 });
                referrers[key].count += emits[i].referrers[key].count;
            }
        }
    }
    return { count  : total,  browsers  : browsers,  referrers  : referrers}
};

没有最终确定, 我们将地图/ 减少任务输出到已有的收藏, 使用“ 合并” 选项设置为真实 。

非常感谢任何帮助。

问题回答

因为它是相同的代码 在Dev和制作中运行, 你一直运行在dev 在非常大的套装上, 而且它很快返回, 任何特别的原因 你怀疑你的代码可能是错误的?

是否可能再次运行微实例? 如果您不知道, < a href=" http://docs. amazonwebservices.com/AWSEC2/latest/UserGuide/ concepts_micro_instances.html" rel=“nofollow”> rel=“nofollow” > Micro 例平均CPU使用上限 , 并且可能破坏您的地图- 降序活动, 因为它造成大量数据编织, 而不允许处理这些数据( I/ O没有被同样地封住, 以便继续进入和 Linux 内核, 然后花大部分时间管理这个数据, 使情况更糟 ) 。

从微型转换到小型,即使使用较低的 CPU 速度,也可能有助于您脱身,因为您有固定的 CPU 循环的“流程”, 与(正常机器总是这样)一起工作, 而MongoDB 的内部时间安排可能更适合。

通常的查询“ spikes” 持续的时间不够长, 无法让 CPU 限制打开 。





相关问题
Error in Hadoop MapReduce

When I run a mapreduce program using Hadoop, I get the following error. 10/01/18 10:52:48 INFO mapred.JobClient: Task Id : attempt_201001181020_0002_m_000014_0, Status : FAILED java.io.IOException:...

Error in using Hadoop MapReduce in Eclipse

When I executed a MapReduce program in Eclipse using Hadoop, I got the below error. It has to be some change in path, but I m not able to figure it out. Any idea? 16:35:39 INFO mapred.JobClient: Task ...

Is MapReduce right for me?

I am working on a project that deals with analyzing a very large amount of data, so I discovered MapReduce fairly recently, and before i dive any further into it, i would like to make sure my ...

Hadoop or Hadoop Streaming for MapReduce on AWS

I m about to start a mapreduce project which will run on AWS and I am presented with a choice, to either use Java or C++. I understand that writing the project in Java would make more functionality ...

What default reducers are available in Elastic MapReduce?

I hope I m asking this in the right way. I m learning my way around Elastic MapReduce and I ve seen numerous references to the "Aggregate" reducer that can be used with "Streaming" job flows. In ...

Displaying access log analysis

I m doing some work to analyse the access logs from a Catalyst web application. The data is from the load balancers in front of the web farm and totals about 35Gb per day. It s stored in a Hadoop HDFS ...

热门标签