进入MongoDB中文手册(4.2版本)目录
聚合管道作为替代
聚合管道比map-reduce提供更好的性能和更一致的接口。
各种map-reduce操作可以使用聚合管道操作符重写,诸如$group, $merge等。下面的例子包括聚合管道的替代方案。
为了执行map-reduce操作,MongoDB提供了mapReduce命令,并在mongo shell中提供了封装方法db.collection.mapReduce()。
如果map-reduce数据集不断增长,则可能需要执行增量map-reduce而不是每次都对整个数据集执行map-reduce操作。
要执行增量map-reduce:
- 在当前集合上运行map-reduce操作,然后将结果输出到单独的集合;
- 当您有更多数据要处理时,请使用以下命令运行后续的map-reduce操作:
* query参数指定仅与新文档匹配的条件。
* out参数指定将新结果合并到现有输出集合中的reduce操作。
请考虑以下示例,在该示例中,您在每天结束时要在usersessions集合上运行的map-reduce操作。
1 数据设置
usersessions集合包含每天记录的用户会话的文档,例如:
db.usersessions.insertMany([
{ userid: "a", start: ISODate('2020-03-03 14:17:00'), length: 95 },
{ userid: "b", start: ISODate('2020-03-03 14:23:00'), length: 110 },
{ userid: "c", start: ISODate('2020-03-03 15:02:00'), length: 120 },
{ userid: "d", start: ISODate('2020-03-03 16:45:00'), length: 45 },
{ userid: "a", start: ISODate('2020-03-04 11:05:00'), length: 105 },
{ userid: "b", start: ISODate('2020-03-04 13:14:00'), length: 120 },
{ userid: "c", start: ISODate('2020-03-04 17:00:00'), length: 130 },
{ userid: "d", start: ISODate('2020-03-04 15:37:00'), length: 65 }
])
2 当前集合的初始Map-Reduce
运行第一个map-reduce操作,如下所示:
- 定义map函数,用来把userid映射到包含total_time,count和avg_time字段的对象:
var mapFunction = function() {
var key = this.userid;
var value = { total_time: this.length, count: 1, avg_time: 0 };
emit( key, value );
};
- 用key和values两个参数定义相应的reduce函数, 用来计数并计算总时间。将key对应的时userid,values是数组,该数组元素对应的时mapFunction中映射到userid的独立的对象。
var reduceFunction = function(key, values) {
var reducedObject = { total_time: 0, count:0, avg_time:0 };
values.forEach(function(value) {
reducedObject.total_time += value.total_time;
reducedObject.count += value.count;
});
return reducedObject;
};
- 用key和reducedValue两个参数定义finalize函数。该函数通过添加另一个字段average来修改reducedValue文档并返回修改后的文档。
var finalizeFunction = function(key, reducedValue) {
if (reducedValue.count > 0)
reducedValue.avg_time = reducedValue.total_time / reducedValue.count;
return reducedValue;
};
- 在usersessions集合上使用mapFunction、reduceFunction和finalizeFunction函数执行map-reduce。将结果输出到集合session_stats。如果session_stats集合已经存在,则该操作将替换内容:
db.usersessions.mapReduce(
mapFunction,
reduceFunction,
{
out: "session_stats",
finalize: finalizeFunction
}
)
- 查询session_stats集合来验证结果
db.session_stats.find().sort( { _id: 1 } )
该操作返回以下文档:
{ "_id" : "a", "value" : { "total_time" : 200, "count" : 2, "avg_time" : 100 } }
{ "_id" : "b", "value" : { "total_time" : 230, "count" : 2, "avg_time" : 115 } }
{ "_id" : "c", "value" : { "total_time" : 250, "count" : 2, "avg_time" : 125 } }
{ "_id" : "d", "value" : { "total_time" : 110, "count" : 2, "avg_time" : 55 } }
3 后续的增量Map-Reduce
以后,随着usersessions集合的增长,您可以运行其他map-reduce操作。例如,将新文档添加到usersessions集合中:
db.usersessions.insertMany([
{ userid: "a", ts: ISODate('2020-03-05 14:17:00'), length: 130 },
{ userid: "b", ts: ISODate('2020-03-05 14:23:00'), length: 40 },
{ userid: "c", ts: ISODate('2020-03-05 15:02:00'), length: 110 },
{ userid: "d", ts: ISODate('2020-03-05 16:45:00'), length: 100 }
])
在每天结束时,对usersessions集合执行增量map-reduce ,但是使用query字段来仅选择新文档。将结果输出到collection session_stats,并使用增量map-reduce的结果来归纳(reduce)内容:
db.usersessions.mapReduce(
mapFunction,
reduceFunction,
{
query: { ts: { $gte: ISODate('2020-03-05 00:00:00') } },
out: { reduce: "session_stats" },
finalize: finalizeFunction
}
);
查询session_stats集合来验证结果:
db.session_stats.find().sort( { _id: 1 } )
该操作返回以下文档:
{ "_id" : "a", "value" : { "total_time" : 330, "count" : 3, "avg_time" : 110 } }
{ "_id" : "b", "value" : { "total_time" : 270, "count" : 3, "avg_time" : 90 } }
{ "_id" : "c", "value" : { "total_time" : 360, "count" : 3, "avg_time" : 120 } }
{ "_id" : "d", "value" : { "total_time" : 210, "count" : 3, "avg_time" : 70 } }
4 聚合替代
前提条件:将集合设置为原始状态:
db.usersessions.drop();
db.usersessions.insertMany([
{ userid: "a", start: ISODate('2020-03-03 14:17:00'), length: 95 },
{ userid: "b", start: ISODate('2020-03-03 14:23:00'), length: 110 },
{ userid: "c", start: ISODate('2020-03-03 15:02:00'), length: 120 },
{ userid: "d", start: ISODate('2020-03-03 16:45:00'), length: 45 },
{ userid: "a", start: ISODate('2020-03-04 11:05:00'), length: 105 },
{ userid: "b", start: ISODate('2020-03-04 13:14:00'), length: 120 },
{ userid: "c", start: ISODate('2020-03-04 17:00:00'), length: 130 },
{ userid: "d", start: ISODate('2020-03-04 15:37:00'), length: 65 }
])
使用可用的聚合管道运算符,您可以重写map-reduce示例,而无需定义自定义函数:
db.usersessions.aggregate([
{ $group: { _id: "$userid", total_time: { $sum: "$length" }, count: { $sum: 1 }, avg_time: { $avg: "$length" } } },
{ $project: { value: { total_time: "$total_time", count: "$count", avg_time: "$avg_time" } } },
{ $merge: {
into: "session_stats_agg",
whenMatched: [ { $set: {
"value.total_time": { $add: [ "$value.total_time", "$$new.value.total_time" ] },
"value.count": { $add: [ "$value.count", "$$new.value.count" ] },
"value.avg": { $divide: [ { $add: [ "$value.total_time", "$$new.value.total_time" ] }, { $add: [ "$value.count", "$$new.value.count" ] } ] }
} } ],
whenNotMatched: "insert"
}}
])
- $group根据userid分组并计算:
* 使用$sum运算符计算total_time;
* 使用$sum运算符计算count;
* 使用$avg运算符计算avg_time。
该操作返回以下文档:
{ "_id" : "c", "total_time" : 250, "count" : 2, "avg_time" : 125 }
{ "_id" : "d", "total_time" : 110, "count" : 2, "avg_time" : 55 }
{ "_id" : "a", "total_time" : 200, "count" : 2, "avg_time" : 100 }
{ "_id" : "b", "total_time" : 230, "count" : 2, "avg_time" : 115 }
- $project阶段调整输出文档的形状来反映map-reduce的输出,该输出具有_id和value两个字段。该阶段是可选的,可以不需要反映_id和value结构。
{ "_id" : "a", "value" : { "total_time" : 200, "count" : 2, "avg_time" : 100 } }
{ "_id" : "d", "value" : { "total_time" : 110, "count" : 2, "avg_time" : 55 } }
{ "_id" : "b", "value" : { "total_time" : 230, "count" : 2, "avg_time" : 115 } }
{ "_id" : "c", "value" : { "total_time" : 250, "count" : 2, "avg_time" : 125 } }
- $merge阶段将结果输出到 session_stats_agg集合。如果现有文档_id与新结果相同,则该操作将应用指定的管道来根据结果和现有文档计算total_time,count和avg_time。如果是在session_stats_agg中不存在相同_id的文档,则执行插入文档操作。
- 查询session_stats_agg集合来验证结果:
db.session_stats_agg.find().sort( { _id: 1 } )
该操作返回以下文档:
{ "_id" : "a", "value" : { "total_time" : 200, "count" : 2, "avg_time" : 100 } }
{ "_id" : "b", "value" : { "total_time" : 230, "count" : 2, "avg_time" : 115 } }
{ "_id" : "c", "value" : { "total_time" : 250, "count" : 2, "avg_time" : 125 } }
{ "_id" : "d", "value" : { "total_time" : 110, "count" : 2, "avg_time" : 55 } }
```
5. 将新文档添加到usersessions集合中:
```handlebars
db.usersessions.insertMany([
{ userid: "a", ts: ISODate('2020-03-05 14:17:00'), length: 130 },
{ userid: "b", ts: ISODate('2020-03-05 14:23:00'), length: 40 },
{ userid: "c", ts: ISODate('2020-03-05 15:02:00'), length: 110 },
{ userid: "d", ts: ISODate('2020-03-05 16:45:00'), length: 100 }
])
- 在管道的开头添加一个$match阶段来指定日期过滤器:
db.usersessions.aggregate([
{ $match: { ts: { $gte: ISODate('2020-03-05 00:00:00') } } },
{ $group: { _id: "$userid", total_time: { $sum: "$length" }, count: { $sum: 1 }, avg_time: { $avg: "$length" } } },
{ $project: { value: { total_time: "$total_time", count: "$count", avg_time: "$avg_time" } } },
{ $merge: {
into: "session_stats_agg",
whenMatched: [ { $set: {
"value.total_time": { $add: [ "$value.total_time", "$$new.value.total_time" ] },
"value.count": { $add: [ "$value.count", "$$new.value.count" ] },
"value.avg_time": { $divide: [ { $add: [ "$value.total_time", "$$new.value.total_time" ] }, { $add: [ "$value.count", "$$new.value.count" ] } ] }
} } ],
whenNotMatched: "insert"
}}
])
- 查询session_stats_agg集合来验证结果:
db.session_stats_agg.find().sort( { _id: 1 } )
该操作返回以下文档:
{ "_id" : "a", "value" : { "total_time" : 330, "count" : 3, "avg_time" : 110 } }
{ "_id" : "b", "value" : { "total_time" : 270, "count" : 3, "avg_time" : 90 } }
{ "_id" : "c", "value" : { "total_time" : 360, "count" : 3, "avg_time" : 120 } }
{ "_id" : "d", "value" : { "total_time" : 210, "count" : 3, "avg_time" : 70 } }
- 可选的。为了避免每次运行时都必须修改聚合管道的$match的日期条件,可以在一个帮助函数中定义包装聚合:
updateSessionStats = function(startDate) {
db.usersessions.aggregate([
{ $match: { ts: { $gte: startDate } } },
{ $group: { _id: "$userid", total_time: { $sum: "$length" }, count: { $sum: 1 }, avg_time: { $avg: "$length" } } },
{ $project: { value: { total_time: "$total_time", count: "$count", avg_time: "$avg_time" } } },
{ $merge: {
into: "session_stats_agg",
whenMatched: [ { $set: {
"value.total_time": { $add: [ "$value.total_time", "$$new.value.total_time" ] },
"value.count": { $add: [ "$value.count", "$$new.value.count" ] },
"value.avg_time": { $divide: [ { $add: [ "$value.total_time", "$$new.value.total_time" ] }, { $add: [ "$value.count", "$$new.value.count" ] } ] }
} } ],
whenNotMatched: "insert"
}}
]);
};
然后,要运行,您只需将开始日期传递给该updateSessionStats()函数:
updateSessionStats(ISODate('2020-03-05 00:00:00'))
也可以看看