二.聚合操作之group,distinct,count
count()方法可以查询统计符合条件的集合的总数
db.COLLECTION_NAME.count(<query>) // 此方法等价于 db.COLLECTION_NAME.find(<query>).count()
在分布式集合中,会出现计算错误的情况,这个时候推荐使用aggregate;
distinct命令可以找出给定键的所有去重之后的值。使用时也必须指定集合和键
db.runCommand({ distinct: "<collection>", key: "<field>", query: <query> }) //此方法等价于 db.collection.distinct(field, query)
参数说明:
collection :要查询的集合的名称
key:需要去重的字段的名称
query :可选参数, 指明查询条件,相当于SQL中的where语句
Group操作:mongodb2.2版本对于返回数据最多只包涵20000个元素,最多支持20000独立分组;对于超过20000的独立分组建议采用mapreduce;
db.runCommand({group:{ns: <namespace>,
key: <key>,
$reduce: <reduce function>,
$keyf: <key function>,
cond: <query>,
finalize: <finalize function>
}
}) //该方法等价于 db.collection.group({ key, reduce, initial [, keyf] [, cond] [, finalize] })
参数说明
ns:集合名称
key:用来分组文档的字段。和keyf两者必须有一个
keyf:可以接受一个javascript函数。用来动态的确定分组文档的字段。和key两者必须有一个
initial:reduce中使用变量的初始化
reduce:执行的reduce函数。函数需要返回值。
cond:执行过滤的条件。
finallize:在reduce执行完成,结果集返回之前对结果集最终执行的函数。可选的。
三.Mongodb数据统计分析:group by、count(distinct userId)
select
date_format(gmtCreate,'%Y%m%d') day,
count(1) shareCount,
count(distinct userId) shareUserCount
from t_share_log
where gmtCreate between str_to_date(20161001, '%Y%m%d') and str_to_date(20161231, '%Y%m%d');
mongodb命令行:
db.runCommand({group:
{
ns:"t_share_log",
cond : { "$and":[{"gmtCreate":{"$gt":new ISODate("2016-10-01T00:00:00.000Z")}}, {"gmtCreate":{"$lt":new ISODate("2016-12-31T23:59:59.999Z")}}]},
$keyf:function(doc){
var myDate = new Date(doc.gmtCreate);
var mm = '0'+(myDate.getMonth()+1);
var dd = '0'+myDate.getDate();
return {day:myDate.getFullYear()+''+mm.substring(mm.length-2)+''+dd.substring(dd.length-2)};
},
initial:{"shareCount" : 0 , "shareUserCount" : 0 , "userIdMap" : {}},
$reduce:function(doc, prev){
if(doc.userId != null){
prev.shareCount ++;
if(prev.userIdMap[doc.userId] == null) {
prev.shareUserCount ++;
prev.userIdMap[doc.userId] = 1;
}
}
},
finalize: function(doc){ delete doc.userIdMap; }
}
});
通过定义keyf的function格式化group by的日期值,再通过定义reduce:function函数借助userIdMap去重userId(相当于dintinct userId)
或者
db.col1.aggregate(
[{$group: {
_id: "$field1",
count: { $sum : 1 }
}}, {
$group: {
_id: "$_id",
count: { $sum : "$count" }
}},{
$out: "distinctCount"
}],
{allowDiskUse:true}
)
解释:The second $group
is not actually doing anything here since it’s just using the same _id value, as well as passing through the same output value to $sum
. Also $out
was introduced in the same version where the default .aggregate() output is actually a “cursor”, so previous problems with a result in a single BSON document are no longer a problem here. Other API’s than the shell might “default” to a BSON document, but they can easily be altered to return a “cursor” instead. You only need to use $out where you actually “intend” to create a new collection for the results.