英文原文:http://www.mongodb.org/display/DOCS/MapReduce
MapReduce在mongodb中使用主要做为批处理数据和聚合操作,比较像Hadoop,所有的输入来自一个结合,所有的输出到一个集合,更像是传统关系数据库中的group聚合操作,mapreduce是一个很有用的工具在mongodb中。
在mongodb中索引和标准的查询很大程度上依赖于map/reduce,如果你在过去使用过CouchDB ,注意couchdb和mongodb是很大不同的,mongodb中的索引和查询更像是mysql中的索引与查询。
map/reduce 是mongodb的一个命令接口,特别是用在集合的输出操作上效果更佳,map和reduce函数通过javascript来编写,然后在服务器中执行,命令格式语法如下
db.runCommand(
{ mapreduce : <collection>,
map : <mapfunction>,
reduce : <reducefunction>
[, query : <query filter object>]
[, sort : <sorts the input objects using this key. Useful for optimization, like sorting by the emit key for fewer reduces>]
[, limit : <number of objects to return from collection>]
[, out : <see output options below>]
[, keeptemp: <true|false>]
[, finalize : <finalizefunction>]
[, scope : <object where fields go into javascript global scope >]
[, jsMode : true]
[, verbose : true]
}
);
Map-reduce增量
如果你要处理的数据不断增大,那么你使用map/reduce有很明显的优势,但是这样你只能看到总的结果,不能看到每次执行的结果;map/reduce操作主要采取以下步骤:
1. 首先运行一个任务,对集合操作,并输出结果到一个集合。
2. 当你有更多的数据的时候,运行第二个任务,可以使用选项进行过滤数据。
3. 使用reduce output 选项,通过reduce 函数归并新的数据到一个新的集合。
Output otions
"collectionName" - By default the output will by of type "replace".
{ replace : "collectionName" } - the output will be inserted into a collection which will atomically replace any existing collection with the same name.
{ merge : "collectionName" } - This option will merge new data into the old output collection. In other words, if the same key exists in both the result set and the old collection, the new key will overwrite the old one.
{ reduce : "collectionName" } - If documents exists for a given key in the result set and in the old collection, then a reduce operation (using the specified reduce function) will be performed on the two values and the result will be written to the output collection. If a finalize function was provided, this will be run after the reduce as well.
{ inline : 1} - With this option, no collection will be created, and the whole map-reduce operation will happen in RAM. Also, the results of the map-reduce will be returned within the result object. Note that this option is possible only when the result set fits within the 16MB limit of a single document.
Result object
{
[results : <document_array>,]
[result : <collection_name> | {db: <db>, collection: <collection_name>},]
timeMillis : <job_time>,
counts : {
input : <number of objects scanned>,
emit : <number of times emit was called>,
output : <number of items in output collection>
} ,
ok : <1_if_ok>
[, err : <errmsg_if_error>]
}
Map函数
map函数的内部变量指向当前文档对象,map函数调用emit(key,value) 一定次数,把数据给reduce函数,大部分情况下,对每个文档执行一次,但有些情况下也可能执行多次emit。
reduce函数
执行map/reduce操作,reduce函数主要用来收集map中emit执行的结果数据,并计算出一个值。
下面给出一个python的mongodb客户端的map-reduce例子,如下:
#!/usr/bin env python
#coding=utf-8
from pymongo import Connection
connection = Connection('localhost', 27017)
db = connection.map_reduce_example
db.things.remove({})
db.things.insert({"x": 1, "tags": ["dog", "cat"]})
db.things.insert({"x": 2, "tags": ["cat"]})
db.things.insert({"x": 3, "tags": ["mouse", "cat", "dog"]})
db.things.insert({"x": 4, "tags": []})
from bson.code import Code
mapfun = Code("function () {this.tags.forEach(function(z) {emit(z, 1);});}")
reducefun = Code("function (key, values) {"
" var total = 0;"
" for (var i = 0; i < values.length; i++) {"
" total += values[i];"
" }"
" return total;"
"}")
result = db.things.map_reduce(mapfun, reducefun, "myresults")
for doc in result.find():
print doc
print "#################################################################"
result = db.things.map_reduce(mapfun, reducefun, "myresults", query={"x": {"$lt": 3}})
for doc in result.find():
print doc
print "#################################################################"
执行结果如下:
{u'_id': u'cat', u'value': 3.0}
{u'_id': u'dog', u'value': 2.0}
{u'_id': u'mouse', u'value': 1.0}
#################################################################
{u'_id': u'cat', u'value': 2.0}
{u'_id': u'dog', u'value': 1.0}
#################################################################