三、MongoDB的聚合操作-笔记

最新推荐文章于 2024-02-01 23:38:01 发布

weljy_sun

最新推荐文章于 2024-02-01 23:38:01 发布

阅读量388

点赞数

分类专栏： MongoDB 文章标签： mongodb 数据库 nosql

本文链接：https://blog.csdn.net/weixin_39085641/article/details/129245835

版权

MongoDB 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

MongoDB的聚合操作用于处理数据并返回计算后的结果，包括单一作用聚合、聚合管道和MapReduce。聚合管道是数据处理流水线，通过多个阶段如$match筛选、$group分组、$unwind展开和$project投影等进行数据转换。文章还提供了多个案例展示了如何使用聚合操作进行数据统计和分析。

摘要由CSDN通过智能技术生成

一、MongoDB聚合

MongoDB 中聚合(aggregate)主要用于处理数据(诸如统计平均值，求和等)，并返回计算后的数据结果。

$\color{red}{类似 SQL 语句中的 count(*)。}$

$\color{red}{聚合操作处理数据记录并返回计算结果}$ 。聚合操作组值来自多个文档，可以对分组数据执行各种操作以返回单个结果
聚合操作包含三类： $\color{red}{单一作用聚合、聚合管道、MapReduce。}$
- 单一作用聚合：提供了对常见聚合过程的简单访问， $\color{red}{操作单个集合聚合文档}$
- $\color{red}{聚合管道是一个数据聚合的框架，模型基于数据处理流水线的概念}$ 。文档进入多级管道，将文档转换为聚合结果
- MapReduce操作具有两个阶段： $\color{red}{处理每个文档并向每个输入文档发射一个或多个对象的map阶段}$ ， $\color{red}{以及reduce组合map操作的输出阶段。}$

语法

db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)

二、单一作用聚合

MongoDB提供 db.collection.estimatedDocumentCount(), db.collection.count(), db.collection.distinct() 这类单一作用的聚合函数。
所有这些操作都聚合来自单个集合的文档。
虽然这些操作提供了对公共聚合过程的简单访问，但它们缺乏聚合管道和map-Reduce的灵活性和功能。

函数	功能
db.collection.estimatedDocumentCount()	返回集合或视图中所有文档的计数
db.collection.count()	返回与find()集合或视图的查询匹配的文档计数等同于 db.collection.find(query).count()构造
db.collection.distinct()	在单个集合或视图中查找指定字段的不同值，并在数组中返回结果

//检索集合中所有文档的计数
testdb> db.testCollection.estimatedDocumentCount();
49
//计算与查询匹配的所有文档
testdb> db.testCollection.count({favCount:{$gt:50}});
DeprecationWarning: Collection.count() is deprecated. Use countDocuments or estimatedDocumentCount.
32
testdb> db.testCollection.countDocuments({favCount:{$gt:50}});
32
//返回不同type的数组
testdb> db.testCollection.distinct("type")
[ 'literature', 'none', 'novel', 'sociality', 'technology', 'travel' ]
返回收藏数小于等于80的文档不同type的数组
testdb> db.testCollection.distinct("type",{favCount:{$lte:80}})
[ 'literature', 'none', 'novel', 'sociality', 'technology', 'travel' ]
testdb>

三、聚合管道

3.1 什么是MongoDB聚合框架

$\color{red}{MongoDB 聚合框架（Aggregation Framework）是一个计算框架}$

$\color{red}{作用在一个或几个集合上}$
$\color{red}{对集合中的数据进行的一系列运算,将这些数据转化为期望的形式}$

从效果而言，聚合框架相当于 SQL 查询中的GROUP BY、 LEFT OUTER JOIN 、 AS等

3.2 管道（Pipeline）和阶段（Stage）

管道在Unix和Linux中一般用于将当前命令的输出结果作为下一个命令的参数。

MongoDB的聚合管道将MongoDB文档在一个管道处理完毕后将结果传递给下一个管道处理。管道操作是可以重复的。

表达式：处理输入文档并输出。表达式是无状态的，只能用于计算当前聚合管道的文档，不能处理其它的文档。

$\color{red}{整个聚合运算过程称为管道（Pipeline），它是由多个阶段（Stage）组成的}$

$\color{red}{接受一系列文档（原始数据）}$
$\color{red}{每个阶段对这些文档进行一系列运算}$
$\color{red}{结果文档输出给下一个阶段}$

聚合管道操作语法

pipeline = [$stage1, $stage2, ...$stageN];
db.collection.aggregate(pipeline, {options})

pipelines 一组数据聚合阶段。除$out、$Merge和$geonear阶段之外，每个阶段都可以在管道中出现多次。
options 可选，聚合操作的其他参数。包含：查询计划、是否使用临时文件、游标、最大操作时间、读写策略、强制索引等等

3.3 常用的管道聚合阶段

阶段	描述	SQL等价运算符
$match	筛选条件	WHERE
$project	投影	AS
$lookup	左外连接	LEFT OUTER JOIN
$sort	排序	ORDER BY
$sum	计算总和	db.mycol.aggregate([{$group : {_id : “$by_user”, num_tutorial : {$sum : “$likes”}}}])
$avg	计算平均值	db.mycol.aggregate([{$group : {_id : “$by_user”, num_tutorial : {$avg : “$likes”}}}])
$min/max	获取集合中所有文档对应值得最小/最大值。	db.mycol.aggregate([{$group : {_id : “$by_user”, num_tutorial : {$min/max : “$likes”}}}])
$push	将值加入一个数组中，不会判断是否有重复的值	db.mycol.aggregate([{$group : {_id : “$by_user”, url : {$push: “$url”}}}])
$addToSet	将值加入一个数组中，会判断是否有重复的值，若相同的值在数组中已经存在了，则不加入。	db.mycol.aggregate([{$group : {_id : “$by_user”, url : {$addToSet : “$url”}}}])
$first/last	根据资源文档的排序获取第一个/最后一个文档数据	db.mycol.aggregate([{$group : {_id : “$by_user”, url : {$first : “$url”}}}])
$group	分组	GROUP BY
$skip/$limit	分页
$unwind	展开数组
$graphLookup	图搜索
$facet/$bucket	分面搜索

管道聚合文档：

3.3.1 聚合表达式

获取字段信息

$<field> ：用 $ 指示字段路径
$<field>.<subfield> ：使用 $ 和 . 来指示内嵌文档的路径
常量表达式

$literal :<value> ：指示常量

系统变量表达式

$$<variable> 使用 $$ 指示系统变量
$$CURRENT 指示管道中当前操作的文档

准备数据

var tags = ["nosql","mongodb","document","developer","popular"];
var types = ["technology","sociality","travel","novel","literature"];
var books=[];
for(var i=0;i<500;i++){
    var typeIdx = Math.floor(Math.random()*types.length);
    var tagIdx = Math.floor(Math.random()*tags.length);
    var tagIdx2 = Math.floor(Math.random()*tags.length);
    var favCount = Math.floor(Math.random()*100);
    var username = "xx00"+Math.floor(Math.random()*10);
    var age = 30 + Math.floor(Math.random()*15);
    var book = {
        title: "book-"+i,
        type: types[typeIdx],
        tag: [tags[tagIdx],tags[tagIdx2]],
        favCount: favCount,
        author: {name:username,age:age}
    };
    books.push(book)
}
db.books.insertMany(books);

// 加载js
testdb> load("C:\\xx\\xxxx\\xxx\\test.js")

3.3.2 $project

$\color{red}{投影操作，将原始字段投影成指定名称}$ ，如将集合中的 title 投影成 name

testdb> db.testCollection.aggregate([{$project:{name:"$title"}}])
[
  { _id: ObjectId("63f884d67b9fcb1f8bf84df1") },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84df4"), name: 'book-2' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84df5"), name: 'book-3' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84df6"), name: 'book-4' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84df7"), name: 'book-5' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84df8"), name: 'book-6' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84df9"), name: 'book-7' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84dfa"), name: 'book-8' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84dfb"), name: 'book-9' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84dfc"), name: 'book-10' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84dfd"), name: 'book-11' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84dfe"), name: 'book-12' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84dff"), name: 'book-13' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84e00"), name: 'book-14' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84e01"), name: 'book-15' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84e02"), name: 'book-16' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84e03"), name: 'book-17' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84e04"), name: 'book-18' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84e05"), name: 'book-19' },
  { _id: ObjectId("63f88c0b7b9fcb1f8bf84e06"), name: 'book-20' }
]
Type "it" for more

$project $\color{red}{可以灵活控制输出文档的格式，也可以剔除不需要的字段}$

testdb> db.testCollection.aggregate([{$project:{name:"$title",_id:0,type:1,author:1}}])
[
  {},
  { type: 'novel', author: 'xxw2', name: 'book-2' },
  { type: 'travel', author: 'xxw3', name: 'book-3' },
  { type: 'novel', author: 'xxw4', name: 'book-4' },
  { type: 'sociality', author: 'xxw5', name: 'book-5' },
  { type: 'novel', author: 'xxw6', name: 'book-6' },
  { type: 'novel', author: 'xxw7', name: 'book-7' },
  { type: 'sociality', author: 'xxw8', name: 'book-8' },
  { type: 'sociality', author: 'xxw9', name: 'book-9' },
  { type: 'travel', author: 'xxw10', name: 'book-10' },
  { type: 'travel', author: 'xxw11', name: 'book-11' },
  { type: 'literature', author: 'xxw12', name: 'book-12' },
  { type: 'travel', author: 'xxw13', name: 'book-13' },
  { type: 'sociality', author: 'xxw14', name: 'book-14' },
  { type: 'travel', author: 'xxw15', name: 'book-15' },
  { type: 'novel', author: 'xxw16', name: 'book-16' },
  { type: 'travel', author: 'xxw17', name: 'book-17' },
  { type: 'technology', author: 'xxw18', name: 'book-18' },
  { type: 'novel', author: 'xxw19', name: 'book-19' },
  { type: 'novel', author: 'xxw20', name: 'book-20' }
]
Type "it" for more

$\color{red}{从嵌套文档中排除字段}$

testdb> db.books.aggregate([
...     {$project:{name:"$title",_id:0,type:1,"author.name":1}}
... ])
[
  { type: 'literature', author: { name: 'xx000' }, name: 'book-0' },
  { type: 'literature', author: { name: 'xx009' }, name: 'book-1' },
  { type: 'technology', author: { name: 'xx007' }, name: 'book-2' },
  { type: 'technology', author: { name: 'xx007' }, name: 'book-3' },
  { type: 'sociality', author: { name: 'xx007' }, name: 'book-4' },
  { type: 'travel', author: { name: 'xx002' }, name: 'book-5' },
  { type: 'travel', author: { name: 'xx007' }, name: 'book-6' },
  { type: 'literature', author: { name: 'xx004' }, name: 'book-7' },
  { type: 'travel', author: { name: 'xx002' }, name: 'book-8' },
  { type: 'travel', author: { name: 'xx002' }, name: 'book-9' },
  { type: 'sociality', author: { name: 'xx009' }, name: 'book-10' },
  { type: 'novel', author: { name: 'xx003' }, name: 'book-11' },
  { type: 'travel', author: { name: 'xx008' }, name: 'book-12' },
  { type: 'novel', author: { name: 'xx006' }, name: 'book-13' },
  { type: 'novel', author: { name: 'xx007' }, name: 'book-14' },
  { type: 'travel', author: { name: 'xx001' }, name: 'book-15' },
  { type: 'technology', author: { name: 'xx001' }, name: 'book-16' },
  { type: 'literature', author: { name: 'xx004' }, name: 'book-17' },
  { type: 'novel', author: { name: 'xx008' }, name: 'book-18' },
  { type: 'travel', author: { name: 'xx003' }, name: 'book-19' }
]


testdb> db.books.aggregate([
...     {$project:{name:"$title",_id:0,type:1,author:{name:1}}}
... ])
[
  { type: 'literature', author: { name: 'xx000' }, name: 'book-0' },
  { type: 'literature', author: { name: 'xx009' }, name: 'book-1' },

3.3.2 $match

$match用于对文档进行筛选，之后可以在得到的文档子集上做聚合，
$match $\color{red}{可以使用除了地理空间之外的所有常规查询操作符}$

在实际应用中尽可能将$match放在管道的前面位置。
这样有两个好处：

可以快速将不需要的文档过滤掉，以减少管道的工作量；
如果再投射和分组之前执行$match，查询可以使用索引。

db.books.aggregate([{$match:{type:"technology"}}])

$\color{red}{筛选管道操作和其他管道操作配合时候时，尽量放到开始阶段}$ ，这样可以 $\color{red}{减少后续管道操作符要操作的文档数，提升效率}$

db.books.aggregate([
    {$match:{type:"technology"}},
    {$project:{name:"$title",_id:0,type:1,author:{name:1}}}
])

3.3.2 $count

计数并返回与查询匹配的结果数

testdb> db.books.aggregate([
{$match:{type:"travel"}},//$match阶段筛选出type匹配travel的文档，并传到下一阶段；
{$count:"travel_count"}])//$count阶段返回聚合管道中剩余文档的计数，并将该值分配给travel_count
[ { travel_count: 99 } ]

3.3.3 $group

$\color{red}{按指定的表达式对文档进行分组，并将每个不同分组的文档输出到下一个阶段}$
输出文档包含一个_id字段，该字段按键包含不同的组。
输出文档还可以包含计算字段，该字段保存由$group的_id字段分组的一些accumulator表达式的值。
$group $\color{red}{不会输出具体的文档而只是统计信息}$

{ $group: { _id: <expression>, <field1>: { <accumulator1> : <expression1> }, ... } }

$\color{red}{\_id字段是必填的;但是，可以指定_id值为null来为整个输入文档计算累计}$
剩余的计算字段是可选的，并使用运算符进行计算
_id和表达式可以接受任何有效的表达式

$group $\color{red}{阶段的内存限制为100M}$ 。 $\color{red}{默认情况下，如果stage超过此限制，}$ $group $\color{red}{将产生错误}$ 。但是，要允许处理大型数据集，请将 $\color{red}{allowDiskUse选项设置为true以启用}$ $group $\color{red}{操作以写入临时文件}$

book的数量，收藏总数和平均值

testdb> db.books.aggregate([{$match:{type:"travel"}},{$count:"travel_count"}])
[ { travel_count: 99 } ]
testdb> db.books.aggregate([
...     {$group:{_id:null,count:{$sum:1},pop:{$sum:"$favCount"},avg:{$avg:"$favCount"}}}
... ])
[ { _id: null, count: 500, pop: 25999, avg: 51.998 } ]

统计每个作者的book收藏总数

testdb> db.books.aggregate([
{
$group:{_id:"$author.name",pop:{$sum:"$favCount"}}
}
])
[
  { _id: 'xx004', pop: 2320 },
  { _id: 'xx003', pop: 3199 },
  { _id: 'xx001', pop: 2572 },
  { _id: 'xx007', pop: 2825 },
  { _id: 'xx005', pop: 2578 },
  { _id: 'xx009', pop: 2755 },
  { _id: 'xx006', pop: 2698 },
  { _id: 'xx002', pop: 2093 },
  { _id: 'xx008', pop: 2269 },
  { _id: 'xx000', pop: 2690 }
]

统计每个作者的每本book的收藏数

testdb> db.books.aggregate({$group:{_id:{name:"$author.name",title:"$title"},pop:{$sum:"$favCount"}}})
[
  { _id: { name: 'xx003', title: 'book-19' }, pop: 84 },
  { _id: { name: 'xx008', title: 'book-450' }, pop: 28 },
  { _id: { name: 'xx006', title: 'book-201' }, pop: 18 },
  { _id: { name: 'xx004', title: 'book-207' }, pop: 89 },
  { _id: { name: 'xx009', title: 'book-400' }, pop: 47 },
  { _id: { name: 'xx004', title: 'book-330' }, pop: 80 },
  { _id: { name: 'xx005', title: 'book-448' }, pop: 24 },
  { _id: { name: 'xx001', title: 'book-159' }, pop: 50 },
  { _id: { name: 'xx002', title: 'book-63' }, pop: 19 },
  { _id: { name: 'xx003', title: 'book-301' }, pop: 94 },
  { _id: { name: 'xx000', title: 'book-308' }, pop: 82 },
  { _id: { name: 'xx005', title: 'book-395' }, pop: 37 },
  { _id: { name: 'xx002', title: 'book-220' }, pop: 5 },
  { _id: { name: 'xx007', title: 'book-127' }, pop: 39 },
  { _id: { name: 'xx008', title: 'book-50' }, pop: 90 },
  { _id: { name: 'xx003', title: 'book-303' }, pop: 12 },
  { _id: { name: 'xx003', title: 'book-37' }, pop: 38 },
  { _id: { name: 'xx004', title: 'book-17' }, pop: 82 },
  { _id: { name: 'xx007', title: 'book-252' }, pop: 76 },
  { _id: { name: 'xx001', title: 'book-445' }, pop: 29 }
]

每个作者的book的type合集

testdb> db.books.aggregate(
... [
... {$group:{_id:"$author.name",typeSet:{$addToSet:"$type"}}}
... ]
... )
[
  {
    _id: 'xx005',
    typeSet: [ 'technology', 'literature', 'novel', 'sociality', 'travel' ]
  }
]

3.3.4 $unwind

可以将数组拆分为单独的文档
v3.2+支持如下语法：

{
  $unwind:
    {
     #要指定字段路径，在字段名称前加上$符并用引号括起来。
      path: <field path>,
      #可选,一个新字段的名称用于存放元素的数组索引。该名称不能以$开头。
      includeArrayIndex: <string>,  
      #可选，default :false，若为true,如果路径为空，缺少或为空数组，则$unwind输出文档
      preserveNullAndEmptyArrays: <boolean> 
 } }

姓名为xx006的作者的book的tag数组拆分为多个文档

db.books.aggregate([
    {$match:{"author.name":"xx006"}},
    {$unwind:"$tag"}
])

db.books.aggregate([
    {$match:{"author.name":"xx006"}}
])

每个作者的book的tag合集

db.books.aggregate([
    {$unwind:"$tag"},
    {$group:{_id:"$author.name",types:{$addToSet:"$tag"}}}
])

案例

db.books.insert([
{
	"title" : "book-51",
	"type" : "technology",
	"favCount" : 110,
     "tag":[],
	"author" : {
		"name" : "weljy",
		"age" : 30
	}
},{
	"title" : "book-52",
	"type" : "technology",
	"favCount" : 150,
	"author" : {
		"name" : "weljy",
		"age" : 30
	}
},{
	"title" : "book-53",
	"type" : "technology",
	"tag" : [
		"nosql",
		"document"
	],
	"favCount" : 20,
	"author" : {
		"name" : "weljy",
		"age" : 30
	}
}])

测试

// 使用includeArrayIndex选项来输出数组元素的数组索引
testdb> db.books.aggregate([{$match:{"author.name":"weljy"}},{$unwind:{path:"$tag",includeArrayIndex:"arrayIndex"}}])
[
  {
    _id: ObjectId("63fd6e507b9fcb1f8bf8501a"),
    title: 'book-53',
    type: 'technology',
    tag: 'nosql',
    favCount: 20,
    author: { name: 'weljy', age: 30 },
    arrayIndex: Long("0")
  },
  {
    _id: ObjectId("63fd6e507b9fcb1f8bf8501a"),
    title: 'book-53',
    type: 'technology',
    tag: 'document',
    favCount: 20,
    author: { name: 'weljy', age: 30 },
    arrayIndex: Long("1")
  }
]

//使用preserveNullAndEmptyArrays选项在输出中包含缺少size字段，null或空数组的文档
testdb> db.books.aggregate([
...     {$match:{"author.name":"weljy"}},
...     {$unwind:{path:"$tag", preserveNullAndEmptyArrays: true}}
... ])
[
  {
    _id: ObjectId("63fd6e507b9fcb1f8bf85018"),
    title: 'book-51',
    type: 'technology',
    favCount: 110,
    author: { name: 'weljy', age: 30 }
  },
  {
    _id: ObjectId("63fd6e507b9fcb1f8bf85019"),
    title: 'book-52',
    type: 'technology',
    favCount: 150,
    author: { name: 'weljy', age: 30 }
  },
  {
    _id: ObjectId("63fd6e507b9fcb1f8bf8501a"),
    title: 'book-53',
    type: 'technology',
    tag: 'nosql',
    favCount: 20,
    author: { name: 'weljy', age: 30 }
  },
  {
    _id: ObjectId("63fd6e507b9fcb1f8bf8501a"),
    title: 'book-53',
    type: 'technology',
    tag: 'document',
    favCount: 20,
    author: { name: 'weljy', age: 30 }
  }
]

3.3.5 $limit

限制传递到管道中下一阶段的文档数

db.books.aggregate([
    {$limit : 5 }
])

此操作仅返回管道传递给它的前5个文档。$limit对其传递的文档内容没有影响。
注意：当$sort在管道中的$limit之前立即出现时，$sort操作只会在过程中维持前n个结果，其中n是指定的限制，而MongoDB只需要将n个项存储在内存中。

3.3.6 $skip

跳过进入stage的指定数量的文档，并将其余文档传递到管道中的下一个阶段

db.books.aggregate([
    {$skip : 5 }
])

此操作将跳过管道传递给它的前5个文档。 $skip对沿着管道传递的文档的内容没有影响。

3.3.7 $sort

对所有输入文档进行排序，并按排序顺序将它们返回到管道。
语法：

{ $sort: { <field1>: <sort order>, <field2>: <sort order> ... } }

要对字段进行排序，请将排序顺序设置为1或-1，以分别指定升序或降序排序，如下例所示：

db.books.aggregate([
    {$sort : {favCount:-1,title:1}}
])

3.3.8 $lookup

Mongodb 3.2版本新增， $\color{red}{主要用来实现多表关联查询}$ ，相当关系型数据库中多表关联查询。每个输入待处理的文档，经过$lookup 阶段的处理，输出的新文档中会包含一个新生成的数组（可根据需要命名新key ）。数组列存放的数据是来自被Join集合的适配文档，如果没有，集合为空（即为[ ])

语法：

db.collection.aggregate([{
      $lookup: {
             from: "<collection to join>",
             localField: "<field from the input documents>",
             foreignField: "<field from the documents of the from collection>",
             as: "<output array field>"
           }
  })

from 同一个数据库下等待被Join的集合。
localField 源集合中的match值，如果输入的集合中，某文档没有 localField这个Key（Field），在处理的过程中，会默认为此文档含有 localField：null的键值对。
foreignField 待Join的集合的match值，如果待Join的集合中，文档没有foreignField值，在处理的过程中，会默认为此文档含有 foreignField：null的键值对。as为输出文档的新增值命名。如果输入的集合中已存在该值，则会覆盖掉

$\color{red}{ 注意：null = null 此为真}$
其语法功能类似于下面的伪SQL语句：

SELECT *, <output array field>
FROM collection
WHERE <output array field> IN (SELECT *
                               FROM <collection to join>
                               WHERE <foreignField>= <collection.localField>);

案例数据准备

db.customer.insert({customerCode:1,name:"customer1",phone:"13112345678",address:"test1"})
db.customer.insert({customerCode:2,name:"customer2",phone:"13112345679",address:"test2"})

db.order.insert({orderId:1,orderCode:"order001",customerCode:1,price:200})
db.order.insert({orderId:2,orderCode:"order002",customerCode:2,price:400})

db.orderItem.insert({itemId:1,productName:"apples",qutity:2,orderId:1})
db.orderItem.insert({itemId:2,productName:"oranges",qutity:2,orderId:1})
db.orderItem.insert({itemId:3,productName:"mangoes",qutity:2,orderId:1})
db.orderItem.insert({itemId:4,productName:"apples",qutity:2,orderId:2})
db.orderItem.insert({itemId:5,productName:"oranges",qutity:2,orderId:2})
db.orderItem.insert({itemId:6,productName:"mangoes",qutity:2,orderId:2})

关联查询

testdb> db.customer.aggregate([{$lookup:{from:"order",localField:"customerCode",foreignField:"customerCode",as:"customerOrder"}}])
[
  {
    _id: ObjectId("63fd78e27b9fcb1f8bf8501b"),
    customerCode: 1,
    name: 'customer1',
    phone: '13112345678',
    address: 'test1',
    customerOrder: [
      {
        _id: ObjectId("63fd78e27b9fcb1f8bf8501d"),
        orderId: 1,
        orderCode: 'order001',
        customerCode: 1,
        price: 200
      }
    ]
  },
  {
    _id: ObjectId("63fd78e27b9fcb1f8bf8501c"),
    customerCode: 2,
    name: 'customer2',
    phone: '13112345679',
    address: 'test2',
    customerOrder: [
      {
        _id: ObjectId("63fd78e27b9fcb1f8bf8501e"),
        orderId: 2,
        orderCode: 'order002',
        customerCode: 2,
        price: 400
      }
    ]
  }
]

testdb> db.customer.aggregate([{$lookup:{from:"order",localField:"customerCode",foreignField:"customerCode",as:"customerOrder"}},{$lookup:{from:"orderItem",localField:"orderId",foreignField:"orderId",as:"orderItem"}}])
[
  {
    _id: ObjectId("63fd78e27b9fcb1f8bf8501b"),
    customerCode: 1,
    name: 'customer1',
    phone: '13112345678',
    address: 'test1',
    customerOrder: [
      {
        _id: ObjectId("63fd78e27b9fcb1f8bf8501d"),
        orderId: 1,
        orderCode: 'order001',
        customerCode: 1,
        price: 200
      }
    ],
    orderItem: []
  },
  {
    _id: ObjectId("63fd78e27b9fcb1f8bf8501c"),
    customerCode: 2,
    name: 'customer2',
    phone: '13112345679',
    address: 'test2',
    customerOrder: [
      {
        _id: ObjectId("63fd78e27b9fcb1f8bf8501e"),
        orderId: 2,
        orderCode: 'order002',
        customerCode: 2,
        price: 400
      }
    ],
    orderItem: []
  }
]

3.4 聚合操作案例1

统计每个分类的book文档数量

testdb> db.books.aggregate([{$group:{_id:"$type",total:{$sum:1}}},{$sort:{total:-1}}])
[
  { _id: 'technology', total: 112 },
  { _id: 'literature', total: 105 },
  { _id: 'travel', total: 99 },
  { _id: 'novel', total: 99 },
  { _id: 'sociality', total: 88 }
]

标签的热度排行，标签的热度则按其关联book文档的收藏数（favCount）来计算

testdb> db.books.aggregate(
{$match:{favCount:{$gt:0}}},
{$unwind:"$tag"},
{$group:{_id:"$tag",total:{$sum:"$favCount"}}},
{$sort:{total:-1}})
[
  { _id: 'nosql', total: 11861 },
  { _id: 'document', total: 10766 },
  { _id: 'popular', total: 10141 },
  { _id: 'developer', total: 10016 },
  { _id: 'mongodb', total: 9254 }
]

$match阶段：用于过滤favCount=0的文档。
$unwind阶段：用于将标签数组进行展开，这样一个包含3个标签的文档会被拆解为3个条目。
$group阶段：对拆解后的文档进行分组计算，$sum："$favCount"表示按favCount字段进行累加。
$sort阶段：接收分组计算的输出，按total得分进行排序

统计book文档收藏数[0,10),[10,60),[60,80),[80,100),[100,+∞）

testdb> db.books.aggregate({$bucket:{groupBy:"$favCount",boundaries:[0,10,60,80,100],default:"other",output:{"count":{$sum:1}}}})
[
  { _id: 0, count: 48 },
  { _id: 10, count: 234 },
  { _id: 60, count: 103 },
  { _id: 80, count: 116 },
  { _id: 'other', count: 2 }
]

3.5 聚合操作案例2

邮政编码数据
 MongoDB Database Tools

使用mongoimport工具导入数据

mongoimport -h 192.168.65.174 -d test -u weljy -p weljy --authenticationDatabase=admin -c zips --file C:\ProgramData\zips.json

h,–host ：代表远程连接的数据库地址，默认连接本地Mongo数据库；
–port：代表远程连接的数据库的端口，默认连接的远程端口27017；
-u,–username：代表连接远程数据库的账号，如果设置数据库的认证，需要指定用户账号；
-p,–password：代表连接数据库的账号对应的密码；
-d,–db：代表连接的数据库；
-c,–collection：代表连接数据库中的集合；
-f, --fields：代表导入集合中的字段；
–type：代表导入的文件类型，包括csv和json,tsv文件，默认json格式；
–file：导入的文件名称
–headerline：导入csv文件时，指明第一行是列名，不需要导入

3.5.1 返回人口超过500万的州

test> db.zips.aggregate([
{$group:{_id:"$state",totalPop:{$sum:"$pop"}}},
{$match:{totalPop:{$gte:10000*500}}}
])
[
  { _id: 'MI', totalPop: 9295297 },
  { _id: 'PA', totalPop: 11881643 },
  { _id: 'NC', totalPop: 6628637 },
  { _id: 'OH', totalPop: 10846517 },
  { _id: 'VA', totalPop: 6181479 },
  { _id: 'IL', totalPop: 11427576 },
  { _id: 'FL', totalPop: 12686644 },
  { _id: 'IN', totalPop: 5544136 },
  { _id: 'GA', totalPop: 6478216 },
  { _id: 'MO', totalPop: 5110648 },
  { _id: 'NJ', totalPop: 7730188 },
  { _id: 'NY', totalPop: 17990402 },
  { _id: 'MA', totalPop: 6016425 },
  { _id: 'TX', totalPop: 16984601 },
  { _id: 'CA', totalPop: 29754890 }
]

这个聚合操作的等价SQL是：

SELECT state, SUM(pop) AS totalPop
FROM zips
GROUP BY state
HAVING totalPop >= (10000*500)

3.5.2 返回各州平均城市人口

db.zips.aggregate( [
   { $group: { _id: { state: "$state", city: "$city" }, cityPop: { $sum: "$pop" } } }
] )
[
  { _id: { state: 'MI', city: 'BELLEVILLE' }, cityPop: 35436 },
  { _id: { state: 'MI', city: 'GRAND JUNCTION' }, cityPop: 2100 },
  { _id: { state: 'CA', city: 'SEIAD VALLEY' }, cityPop: 311 },
  { _id: { state: 'OR', city: 'NORTH POWDER' }, cityPop: 571 },
  { _id: { state: 'WY', city: 'COLTER BAY' }, cityPop: 9078 },
  { _id: { state: 'GA', city: 'CRAWFORDVILLE' }, cityPop: 1915 },
  { _id: { state: 'CA', city: 'DUNLAP' }, cityPop: 94 },
  { _id: { state: 'IL', city: 'COLLISON' }, cityPop: 421 },
  { _id: { state: 'MI', city: 'PLEASANT RIDGE' }, cityPop: 2895 },
  { _id: { state: 'NJ', city: 'ROSELLE' }, cityPop: 20159 },
  { _id: { state: 'PA', city: 'ALUM BANK' }, cityPop: 2175 },
  { _id: { state: 'WI', city: 'IRON BELT' }, cityPop: 265 },
  { _id: { state: 'NY', city: 'HEWLETT' }, cityPop: 8023 },
  { _id: { state: 'OH', city: 'MONTPELIER' }, cityPop: 7569 },
  { _id: { state: 'WA', city: 'TENINO' }, cityPop: 6451 },
  { _id: { state: 'MI', city: 'BENTON HARBOR' }, cityPop: 37550 },
  { _id: { state: 'KY', city: 'BRANDENBURG' }, cityPop: 6480 },
  { _id: { state: 'NJ', city: 'CRANFORD' }, cityPop: 22866 },
  { _id: { state: 'MO', city: 'BERKELEY' }, cityPop: 20546 },
  { _id: { state: 'AR', city: 'NATURAL DAM' }, cityPop: 497 }
]


test> db.zips.aggregate([{$group:{_id:{state:"$state",city:"$city"},cityPop:{$sum:"$pop"}}},{$group:{_id:"$_id.state",avgCityPop:{$avg:"$cityPop"}}}])
[
  { _id: 'NV', avgCityPop: 18209.590909090908 },
  { _id: 'OK', avgCityPop: 6155.743639921722 },
  { _id: 'MI', avgCityPop: 12087.512353706112 },
  { _id: 'PA', avgCityPop: 8679.067202337472 },
  { _id: 'OH', avgCityPop: 12700.839578454332 },
  { _id: 'NC', avgCityPop: 10622.815705128205 },
  { _id: 'SC', avgCityPop: 11139.626198083068 },
  { _id: 'VA', avgCityPop: 8526.177931034483 },
  { _id: 'LA', avgCityPop: 10465.496277915632 },
  { _id: 'NM', avgCityPop: 5872.360465116279 },
  { _id: 'AZ', avgCityPop: 20591.16853932584 },
  { _id: 'OR', avgCityPop: 8262.561046511628 },
  { _id: 'SD', avgCityPop: 1839.6746031746031 },
  { _id: 'IA', avgCityPop: 3123.0821147356583 },
  { _id: 'WV', avgCityPop: 2771.4775888717154 },
  { _id: 'WI', avgCityPop: 7323.00748502994 },
  { _id: 'VT', avgCityPop: 2315.8765432098767 },
  { _id: 'HI', avgCityPop: 15831.842857142858 },
  { _id: 'KS', avgCityPop: 3819.884259259259 },
  { _id: 'DE', avgCityPop: 14481.91304347826 }
]

按州返回最大和最小的城市

test> db.zips.aggregate([
...     {$group:{_id:{state:"$state",city:"$city"},pop:{$sum:"$pop"}}},
...     {$sort:{pop:1}},
...     {$group:
...             {
...                     _id:"$_id.state",
...                     biggestCity:{$last:"$_id.city"},
...                     biggestPop:{ $last: "$pop" },
...                     smallestCity:{ $first: "$_id.city" },
...                     smallestPop:{ $first: "$pop" }
...             }
...     },
...     { $project:
...             { _id: 0,
...               state: "$_id",
...               biggestCity:  { name: "$biggestCity",  pop: "$biggestPop" },
...               smallestCity: { name: "$smallestCity", pop: "$smallestPop" }
...             }
...     }
... ])
[
  {
    biggestCity: { name: 'DES MOINES', pop: 148155 },
    smallestCity: { name: 'DOUDS', pop: 15 },
    state: 'IA'
  },
  {
    biggestCity: { name: 'HUNTINGTON', pop: 75343 },
    smallestCity: { name: 'MOUNT CARBON', pop: 0 },
    state: 'WV'
  },
  {
    biggestCity: { name: 'SIOUX FALLS', pop: 102046 },
    smallestCity: { name: 'ZEONA', pop: 8 },
    state: 'SD'
  },
  {
    biggestCity: { name: 'PORTLAND', pop: 518543 },
    smallestCity: { name: 'ODELL', pop: 0 },
    state: 'OR'
  },
  {
    biggestCity: { name: 'MANCHESTER', pop: 106452 },
    smallestCity: { name: 'WEST NOTTINGHAM', pop: 27 },
    state: 'NH'
  },
  {
    biggestCity: { name: 'BALTIMORE', pop: 733081 },
    smallestCity: { name: 'ANNAPOLIS JUNCTI', pop: 32 },
    state: 'MD'
  },
  {
    biggestCity: { name: 'ALBUQUERQUE', pop: 449584 },
    smallestCity: { name: 'ALGODONES', pop: 0 },
    state: 'NM'
  },
  {
    biggestCity: { name: 'VIRGINIA BEACH', pop: 385080 },
    smallestCity: { name: 'WALLOPS ISLAND', pop: 0 },
    state: 'VA'
  },
  {
    biggestCity: { name: 'NEW ORLEANS', pop: 496937 },
    smallestCity: { name: 'FORDOCHE', pop: 0 },
    state: 'LA'
  },
  {
    biggestCity: { name: 'COLUMBIA', pop: 269521 },
    smallestCity: { name: 'QUINBY', pop: 0 },
    state: 'SC'
  },
  {
    biggestCity: { name: 'PHILADELPHIA', pop: 1610956 },
    smallestCity: { name: 'HAMILTON', pop: 0 },
    state: 'PA'
  },
  {
    biggestCity: { name: 'CHARLOTTE', pop: 465833 },
    smallestCity: { name: 'GLOUCESTER', pop: 0 },
    state: 'NC'
  },
  {
    biggestCity: { name: 'DETROIT', pop: 963243 },
    smallestCity: { name: 'LELAND', pop: 0 },
    state: 'MI'
  },
  {
    biggestCity: { name: 'PHOENIX', pop: 890853 },
    smallestCity: { name: 'HUALAPAI', pop: 2 },
    state: 'AZ'
  },
  {
    biggestCity: { name: 'LAS VEGAS', pop: 597557 },
    smallestCity: { name: 'TUSCARORA', pop: 1 },
    state: 'NV'
  },
  {
    biggestCity: { name: 'TULSA', pop: 389072 },
    smallestCity: { name: 'SOUTHARD', pop: 8 },
    state: 'OK'
  },
  {
    biggestCity: { name: 'CLEVELAND', pop: 536759 },
    smallestCity: { name: 'ISLE SAINT GEORG', pop: 38 },
    state: 'OH'
  },
  {
    biggestCity: { name: 'ANCHORAGE', pop: 183987 },
    smallestCity: { name: 'SELAWIK', pop: 0 },
    state: 'AK'
  },
  {
    biggestCity: { name: 'INDIANAPOLIS', pop: 348868 },
    smallestCity: { name: 'WESTPOINT', pop: 145 },
    state: 'IN'
  },
  {
    biggestCity: { name: 'MIAMI', pop: 825232 },
    smallestCity: { name: 'CECIL FIELD NAS', pop: 0 },
    state: 'FL'
  }
]

四、MapReduce操作

MapReduce操作将大量的数据处理工作拆分成多个线程并行处理，然后将结果合并在一起。MongoDB提供的Map-Reduce非常灵活，对于大规模数据分析也相当实用。

MapReduce具有两个阶段：

将具有相同Key的文档数据整合在一起的map阶段
组合map操作的结果进行统计输出的reduce阶段

**MapReduce的基本语法 **

db.collection.mapReduce(
   function() {emit(key,value);},  //map 函数
   function(key,values) {return reduceFunction},   //reduce 函数
   {
      out: <collection>,
      query: <document>,
      sort: <document>,
      limit: <number>,
     finalize: <function>, 
     scope: <document>,
     jsMode: <boolean>,
     verbose: <boolean>,
     bypassDocumentValidation: <boolean>
   }
)

map，将数据拆分成键值对，交给reduce函数
reduce，根据键将值做统计运算
out，可选，将结果汇入指定表
quey，可选筛选数据的条件，筛选的数据送入map
sort，排序完后，送入map
limit，限制送入map的文档数
finalize，可选，修改reduce的结果后进行输出
scope，可选，指定map、reduce、finalize的全局变量
jsMode，可选，默认false。在mapreduce过程中是否将数据转换成bson格式。
verbose，可选，是否在结果中显示时间，默认false
bypassDocmentValidation，可选，是否略过数据校验

统计type为travel的不同作者的book文档收藏数

db.books.mapReduce(
    function(){emit(this.type,this.favCount)},
    function(key,values){return Array.sum(values)},
    {
        query:{type:"travel"},
        out: "books_favCount"
    }
 )