背景
最近使用MongoDB做数据统计,要求根据某个字段的值按创建时间(时间戳)来统计每天不同值所对应的数量,并按天进行分页输出。 首先想到了aggregate来解决,但经过测试无法解决问题,无奈选择MapReduce来解决,由于第一次使用,遇到了一些坑,记于此与诸位新兵共勉。
说明
MapReduce,顾名思义,其过程主要分为map和reduce两步,它并不是MongoDB独有的,而是普片用于大数据分析的方法,比如hadoop、hbase等等。本文主要说MongoDB对MapReduce的实现,MongoDB内置了js的V8解释器,MapReduce需要使用js编写。**其中要注意不同MongoDB版本内置的V8版本也不同,**这是我遇到的第一个坑(开发环境MongoDB v3.4,支持es6语法,线上MongoDB v3.2,不支持,只能使用es5语法,我因为let 而导致线上失败,换成 var就好了)。
在命令行中,可以通过db.runCommand或db.collection.mapreduce命令来执行MapReduce操作。
db.collection.mapreduce(
map : <mapfunction>,
reduce : <reducefunction>
[, query : <query filter object>]
[, sort : <sort the query. useful for optimization>]
[, limit : <number of objects to return from collection>]
[, out : <output-collection name>]
[, keeptemp: <true|false>]
[, finalize : <finalizefunction>]
[, scope : <object where fields go into javascript global scope >]
[, verbose : true],
[,jsMode:boolean,default true]
)
除了map和reduce参数,其他都是可选参数。
各参数具体作用请参考文章MongoDB MapReduce基础说明
基础用法
基本原理请看文章菜鸟教程。
假设有这样一个集合,名称叫test,如下图:

则解决问题的方法如下:
1:db.test.mapReduce(
function() {
var datetime = new Date(this.PACKAGE_CREATED*1000+28800000);
var key = datetime.getFullYear()+"-"+("0"+(parseInt(datetime.getMonth())+1)).slice(-2)+"-"+("0"+datetime.getDate()).slice(-2);
var value = {router_status:this.ROUTER_STATUS,single:0,total:1,cancellation:1,availability:1,pick_up:1,transit:1,delivered:1,abnormal:1};
emit(key,value);
},
function(key, values) {
var ret = {router_status:-1000,total:0,cancellation:0,availability:0,pick_up:0,transit:0,delivered:0,abnormal:0};
for(var i in values) {
switch(values[i].router_status){
case 1:
ret.delivered +=values[i].delivered;
ret.total +=values[i].total;
break;
case 3:
ret.transit +=values[i].transit;
ret.total +=values[i].total;
break;
case 2:
ret.cancellation +=values[i].cancellation;
ret.total +=values[i].total;
break;
case -1000:
ret.abnormal +=values[i].abnormal;
ret.transit +=values[i].transit;
ret.pick_up +=values[i].pick_up;
ret.delivered +=values[i].delivered;
ret.cancellation +=values[i].cancellation;
default:
ret.total +=values[i].total;
}
}
ret.availability = ret.total - ret.cancellation;
ret.created = (new Date(key)).getTime()/1000;
ret.loading = false;
ret.single = 1;
return ret;
},
{
//out: {inline:1},
out: {replace:"test_temp"},
query: {PACKAGE_CREATED:{$ne:null}},
finalize:function (key, reduceResult) {
if(reduceResult.single == 1){
return reduceResult;
}else{
var ret = {router_status:-1000,total:0,cancellation:0,availability:0,pick_up:0,transit:0,delivered:0,abnormal:0};
switch(reduceResult.router_status){
case 1:
ret.delivered +=1;
ret.total +=1;
break;
case 3:
ret.transit +=1;
ret.total +=1;
break;
case 2:
ret.cancellation +=1;
ret.total +=1;
break;
default:
ret.total +=reduceResult.total;
}
ret.availability = ret.total - ret.cancellation;
ret.created = (new Date(key)).getTime()/1000;
ret.loading = false;
ret.single = 1;
return ret;
}
}
}
)
2:db.test_temp.find({}).sort({value.created:1}).skip(0).limit(20)
在例子就可以看到,通过1,2补完美解决了问题。其中1中有对时差的纠正,MongoDB默认UTC时区,转成中国需要+8h。这里遇到了第二个坑,就是map结果只有一个是不会进入reduce,这时需要由finalize来处理,这就是我要追加一步finalize的原因。
还有第三个坑就是map结果超过100个时,它不会一次提交给reduce,而是100个一批提交一次,这也是为什么我会在reduce中ret.delivered +=values[i].delivered;而不是ret.delivered +=1;
具体可以看看文章MongoDB MapReduce原理实战之坑注意点
PHP用法
当使用MongoDB扩展时:
use MongoDB\BSON\Javascript;
$condition ['PACKAGE_CREATED'] = ['$ne' => null];
$map = new Javascript('function(){...}');
$reduce = new Javascript('function(){...}');
$finalize = new Javascript('function(){...}');
$options = [
'finalize' =>$finalize,
'query' => $condition
];
//创建临时集合report_data_router_temp,并将统计后的数据放在该集合中
$temp = $this->mongodb->test->mapReduce($map, $reduce, 'test_temp',$options);
当使用Mongo扩展时:
$condition ['PACKAGE_CREATED'] = ['$ne' => null];
$map = 'function(){...}';
$reduce = 'function(){...}';
$finalize = 'function(){...}';
$temp = $mongo->command(array(
'mapreduce' => 'test',
'map' => $map,
'reduce' => $reduce,
'finalize' => $finalize,
'query' => $condition,
'sort' => ['PACKAGE_CREATED' => -1],
//'out' => ['inline' => 1]
'out' => 'test_temp'
));
3178

被折叠的 条评论
为什么被折叠?



