管道在Unix和Linux中一般用于将当前命令的输出结果作为下一个命令的参数。
MongoDB的聚合管道将MongoDB文档在一个管道处理完毕后将结果传递给下一个管道处理。管道操作是可以重复的。
表达式:处理输入文档并输出。表达式是无状态的,只能用于计算当前聚合管道的文档,不能处理其它的文档。
这里我们介绍一下聚合框架中常用的几个操作:
- $project:修改输入文档的结构。可以用来重命名、增加或删除域,也可以用于创建计算结果以及嵌套文档。
- $match:用于过滤数据,只输出符合条件的文档。$match使用MongoDB的标准查询操作。
- $limit:用来限制MongoDB聚合管道返回的文档数。
- $skip:在聚合管道中跳过指定数量的文档,并返回余下的文档。
- $unwind:将文档中的某一个数组类型字段拆分成多条,每条包含数组中的一个值。
- $group:将集合中的文档分组,可用于统计结果。
- $sort:将输入文档排序后输出。
- $geoNear:输出接近某一地理位置的有序文档。
Aggregation Pipleline
关于Aggregation,官方文档在这里。我这里就半翻译半扯淡(当然大部分是扯淡)说一下Aggregation Pipleline。
管道的概念
管道在*nix中将上一个命令输出的数据作为下一个命令的参数。MongoDB中的管道操作是可以重复的,基于这个概念,MongoDB中的管道聚合可以有非常实用的玩法,比如对聚合的结果进行排序。
那么,如何使用呢

借用一下官方的图,这里aggregate的命令为:
[
{$match: {status: "A"}},
{$group: {_id: "$cust_id", total: {$sum: "$amount"}}}
]
aggreagte是一个数组,其中包含多个对象(命令),通过遍历Pipleline数组对collection中的数据进行操作。
解释一下例子中的配置项意思。
$match:查询条件
- 经常使用正则表示的人肯定对match不陌生,很明显
$match是配置查询数据时的条件。这里的语法和查询数据库的时候一毛一样,也就不再赘述了,看一下MongoDB的CRUD文档即可。
$group:聚合的配置
_id代表你想聚合的数据的主键,例如上述数据中,你想聚合所有cust_id相同的条目的amount的总和,那_id即被设置为cust_id。_id为必须,但是你可以填写一个空值。total代表你最后想输出的数据之一,这里total是每条结果中amount的总和。$sum是一个聚合的操作符,另外的操作符你可以在官方文档中找到。上图中的命令表示对相同主键(_id)下的amount进行求和。如果你想要计算主键出现的次数,可以把命令写成如下的形式
{$sum: 1}
聚合的过程
看一下图例,所有的数据先经过$match命令,只留下了status为A的数据,接着,对筛选出的数据进行聚合操作,对相同cust_id的数据进行计算amount总和的操作,最后输出结果。
其他管道表达式与操作符
这里说一下$sort操作符。前面说过,利用管道的特性,可以做到对结果进行排序。所以只需要在$group操作之后使用$sort操作即可,这点在报表的制作上十分实用。况且使用MongoDB对结果进行排序也可以尽量优化性能。
$first 从分组中取出第一个值
{ "_id" : 1, "item" : "abc", "price" : 10, "quantity" : 2, "date" : ISODate("2014-01-01T08:00:00Z") }
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1, "date" : ISODate("2014-02-03T09:00:00Z") }
{ "_id" : 3, "item" : "xyz", "price" : 5, "quantity" : 5, "date" : ISODate("2014-02-03T09:05:00Z") }
{ "_id" : 4, "item" : "abc", "price" : 10, "quantity" : 10, "date" : ISODate("2014-02-15T08:00:00Z") }
{ "_id" : 5, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-02-15T09:05:00Z") }
{ "_id" : 6, "item" : "xyz", "price" : 5, "quantity" : 5, "date" : ISODate("2014-02-15T12:05:10Z") }
{ "_id" : 7, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-02-15T14:12:12Z") }
db.sales.aggregate(
[
{ $sort: { item: 1, date: 1 } },
{
$group:
{
_id: "$item",
firstSalesDate: { $first: "$date" }
}
}
]
)
结果
{ "_id" : "xyz", "firstSalesDate" : ISODate("2014-02-03T09:05:00Z") }
{ "_id" : "jkl", "firstSalesDate" : ISODate("2014-02-03T09:00:00Z") }
{ "_id" : "abc", "firstSalesDate" : ISODate("2014-01-01T08:00:00Z") }
$unwind 将数组打散
{ "_id" : 1, "item" : "ABC1", sizes: [ "S", "M", "L"] }
db.inventory.aggregate( [ { $unwind : "$sizes" } ] )
{ "_id" : 1, "item" : "ABC1", "sizes" : "S" }
{ "_id" : 1, "item" : "ABC1", "sizes" : "M" }
{ "_id" : 1, "item" : "ABC1", "sizes" : "L" }
db.inventory.aggregate( [ { $unwind: "$sizes" } ] )
db.inventory.aggregate( [ { $unwind: { path: "$sizes" } } ] )这两种用法是一样的
还有另外一个参数preserveNullAndEmptyArrays,true时当path为空或者不存在也会返回,false时不返回
{ "_id" : 1, "item" : "ABC", "sizes": [ "S", "M", "L"] }
{ "_id" : 2, "item" : "EFG", "sizes" : [ ] }
{ "_id" : 3, "item" : "IJK", "sizes": "M" }
{ "_id" : 4, "item" : "LMN" }
{ "_id" : 5, "item" : "XYZ", "sizes" : null }
db.inventory.aggregate( [ { $unwind: "$sizes" } ] )
db.inventory.aggregate( [ { $unwind: { path: "$sizes" } } ] ) preserveNullAndEmptyArrays默认时false
{ "_id" : 1, "item" : "ABC", "sizes" : "S" }
{ "_id" : 1, "item" : "ABC", "sizes" : "M" }
{ "_id" : 1, "item" : "ABC", "sizes" : "L" }
{ "_id" : 3, "item" : "IJK", "sizes" : "M" }
db.inventory.aggregate( [
{ $unwind: { path: "$sizes", preserveNullAndEmptyArrays: true } }
] )
结果是都返回了:
{ "_id" : 1, "item" : "ABC", "sizes" : "S" }
{ "_id" : 1, "item" : "ABC", "sizes" : "M" }
{ "_id" : 1, "item" : "ABC", "sizes" : "L" }
{ "_id" : 2, "item" : "EFG" }
{ "_id" : 3, "item" : "IJK", "sizes" : "M" }
{ "_id" : 4, "item" : "LMN" }
{ "_id" : 5, "item" : "XYZ", "sizes" : null }
$group 进行分组
_id代表分组使用的键,可以指定一个或者多个,如果指定为null就是对所有的文档进行处理,不进行分组
{
"campaign_id": "A",
"campaign_name": "A",
"subscriber_id": "123"
},
{
"campaign_id": "A",
"campaign_name": "A",
"subscriber_id": "123"
},
{
"campaign_id": "A",
"campaign_name": "A",
"subscriber_id": "456"
}
对三个字段进行分组,($match是对文档的筛选)。然后再对第一个group的结果进行分组
db.campaigns.aggregate([
{ "$match": { "subscriber_id": { "$ne": null }}},
// Count all occurrences
{ "$group": {
"_id": {
"campaign_id": "$campaign_id",
"campaign_name": "$campaign_name",
"subscriber_id": "$subscriber_id"
},
"count": { "$sum": 1 }
}},
// Sum all occurrences and count distinct
{ "$group": {
"_id": {
"campaign_id": "$_id.campaign_id",
"campaign_name": "$_id.campaign_name"
},
"totalCount": { "$sum": "$count" },
"distinctCount": { "$sum": 1 }
}}
])
第一个分组后的结果
{
"_id" : {
"campaign_id" : "A",
"campaign_name" : "A",
"subscriber_id" : "456"
},
"count" : 1
}
{
"_id" : {
"campaign_id" : "A",
"campaign_name" : "A",
"subscriber_id" : "123"
},
"count" : 2
}
第二个分组后的结果:
-
{
-
"_id" : {
-
"campaign_id" : "A",
-
"campaign_name" : "A"
-
},
-
"totalCount" : 3,
-
"distinctCount" : 2
- }
$push 将分组后的每一个组中的指定item添加到一个数组中
{ "_id" : 1, "item" : "abc", "price" : 10, "quantity" : 2, "date" : ISODate("2014-01-01T08:00:00Z") }
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1, "date" : ISODate("2014-02-03T09:00:00Z") }
{ "_id" : 3, "item" : "xyz", "price" : 5, "quantity" : 5, "date" : ISODate("2014-02-03T09:05:00Z") }
{ "_id" : 4, "item" : "abc", "price" : 10, "quantity" : 10, "date" : ISODate("2014-02-15T08:00:00Z") }
{ "_id" : 5, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-02-15T09:05:00Z") }
{ "_id" : 6, "item" : "xyz", "price" : 5, "quantity" : 5, "date" : ISODate("2014-02-15T12:05:10Z") }
{ "_id" : 7, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-02-15T14:12:12Z") }
db.sales.aggregate(
[
{
$group:
{
_id: { day: { $dayOfYear: "$date"}, year: { $year: "$date" } },
itemsSold: { $push: { item: "$item", quantity: "$quantity" } }
}
}
]
)
{
"_id" : { "day" : 46, "year" : 2014 },
"itemsSold" : [
{ "item" : "abc", "quantity" : 10 },
{ "item" : "xyz", "quantity" : 10 },
{ "item" : "xyz", "quantity" : 5 },
{ "item" : "xyz", "quantity" : 10 }
]
}
{
"_id" : { "day" : 34, "year" : 2014 },
"itemsSold" : [
{ "item" : "jkl", "quantity" : 1 },
{ "item" : "xyz", "quantity" : 5 }
]
}
{
"_id" : { "day" : 1, "year" : 2014 },
"itemsSold" : [ { "item" : "abc", "quantity" : 2 } ]
}
db.sales.aggregate(
[
{
$group:
{
_id: { day: { $dayOfYear: "$date"}, year: { $year: "$date" } },
itemsSold: { $addToSet: "$item" } //数组中的元素不重复
}
}
]
)
#result
{ "_id" : { "day" : 46, "year" : 2014 }, "itemsSold" : [ "xyz", "abc" ] }
{ "_id" : { "day" : 34, "year" : 2014 }, "itemsSold" : [ "xyz", "jkl" ] }
{ "_id" : { "day" : 1, "year" : 2014 }, "itemsSold" : [ "abc" ] }
db.sales.aggregate(
[
{
$group : {
_id : null, # 如果为null,就统计出全部
totalPrice: { $sum: { $multiply: [ "$price", "$quantity" ] } },
averageQuantity: { $avg: "$quantity" },
count: { $sum: 1 }
}
}
]
)
$lookup 用于表的连接
两个collection分别是order 和 Info,数据结构如下


我们基于order表的item字段进行联表查询,也就是以order为左表(基准表),Info为右表(附表)
结果如下:
结果集是将order表中的所有数据列出,再添加上连接表Info中对应的数据(紫色,红色部分)
{
"_id" : 1,
"item" : "Jack",
"price" : 12,
"Info_doc" : [
{
"_id" : 1,
"name" : "Jack",
"city" : "Tianjin",
"sex" : 1
}
]
},
/* 2 */
{
"_id" : 2,
"item" : "Johnny",
"price" : 33,
"Info_doc" : [
{
"_id" : 2,
"name" : "Johnny",
"city" : "Xingtai",
"sex" : 1
}
]
},
/* 3 */
{
"_id" : 3,
"Info_doc" : [ ---请注意为什么_id:3的数据也会联表带出来呢?下面有分析
{
"_id" : 5,
"name" : null,
"city" : "yyyy"
},
{
"_id" : 6
}
]
}
官网有句原话描述: If an input document does not contain the localField, the $lookup treats the field as having a value of null for matching purposes.
If a document in the from collection does not contain the foreignField, the $lookup treats the value as null for matching purposes.
意思就是说如果无论左表,右表,只要文档中不包含连接字段,则就会假设此文档中有此字段且值为null对待
那么又一个问题来了,我只想联表查询后显示某些字段值,不想全部都显示出来,怎么办?
答案:
使用$project,代码如下:

连表操作在mongoose中还可以使用populate()
Story.
find().
populate({ path: 'fans', select: 'name' }).
populate({ path: 'fans', select: 'email' });
// The above is equivalent to:
Story.find().populate({ path: 'fans', select: 'email' });
Story.
find(...).
populate({
path: 'fans',
match: { age: { $gte: 21 }},可以实现对引用过来的文档的筛选
// Explicitly exclude `_id`, see http://bit.ly/2aEfTdB
select: 'name -_id',
options: { limit: 5 }
})
User.
findOne({ name: 'Val' }).
populate({
path: 'friends',
// Get friends of friends - populate the 'friends' array for every friend
populate: { path: 'friends' }
});
aggregate可以多个lookup
const ret = await ctx.model.BinLocation.aggregate([
{
$match: { isDeleted: false, blueprintId: { $in: blueprints }, binLocation: billData.binLocation, productCode: billData.productCode }
},
{
$lookup: { from: 'bin_polygon', localField: 'binLocation', foreignField: 'binLocation', as: 'binInfo' }
},
{
$lookup: { from: 'bin_meta', localField: 'productCode', foreignField: 'productCode', as: 'binMeta' }
},
{
$lookup: { from: 'shipment', localField: 'productCode', foreignField: 'productCode', as: 'shipmentData' }
},
{
$match: { 'binInfo.blueprintId': { $in: blueprints } }
},
{
$match: { 'shipmentData.blueprintId': { $in: blueprints }, 'shipmentData.binLocation': billData.binLocation, 'shipmentData.createdAt': { $gte: startTime, $lt: endTime } }
},
{
$project: { binLocation: 1, productCode: 1, num: 1, volume: 1, binInfo: { floor: 1, blueprintId: 1 }, binMeta: { projectName: 1, volume: 1 }, shipmentData: 1 }
}
]);
还可以支持pipeline和let
let用来定义变量,用于查询条件中做帅选,pipeline用于做管道,对被关联表进行管道操作(一般就是match筛选)
db.orders.insert([
{ "_id" : 1, "item" : "almonds", "price" : 12, "ordered" : 2 },
{ "_id" : 2, "item" : "pecans", "price" : 20, "ordered" : 1 },
{ "_id" : 3, "item" : "cookies", "price" : 10, "ordered" : 60 }
])
db.warehouses.insert([
{ "_id" : 1, "stock_item" : "almonds", warehouse: "A", "instock" : 120 },
{ "_id" : 2, "stock_item" : "pecans", warehouse: "A", "instock" : 80 },
{ "_id" : 3, "stock_item" : "almonds", warehouse: "B", "instock" : 60 },
{ "_id" : 4, "stock_item" : "cookies", warehouse: "B", "instock" : 40 },
{ "_id" : 5, "stock_item" : "cookies", warehouse: "A", "instock" : 80 }
])
db.orders.aggregate([
{
$lookup:
{
from: "warehouses",
let: { order_item: "$item", order_qty: "$ordered" },
pipeline: [
{ $match:
{ $expr:
{ $and:
[
{ $eq: [ "$stock_item", "$$order_item" ] },
{ $gte: [ "$instock", "$$order_qty" ] }
]
}
}
},
{ $project: { stock_item: 0, _id: 0 } }
],
as: "stockdata"
}
}
])
{ "_id" : 1, "item" : "almonds", "price" : 12, "ordered" : 2,
"stockdata" : [ { "warehouse" : "A", "instock" : 120 }, { "warehouse" : "B", "instock" : 60 } ] }
{ "_id" : 2, "item" : "pecans", "price" : 20, "ordered" : 1,
"stockdata" : [ { "warehouse" : "A", "instock" : 80 } ] }
{ "_id" : 3, "item" : "cookies", "price" : 10, "ordered" : 60,
"stockdata" : [ { "warehouse" : "A", "instock" : 80 } ] }
通过实践证明,aggregate操作在数据量很大的情况下是一个大坑!
bulkwrite
bulkWrite()接受一个写操作数组并执行其中的每个操作。默认情况下,按顺序执行操作
{ "_id" : 1, "char" : "Brisbane", "class" : "monk", "lvl" : 4 },
{ "_id" : 2, "char" : "Eldon", "class" : "alchemist", "lvl" : 3 },
{ "_id" : 3, "char" : "Meldane", "class" : "ranger", "lvl" : 3 }
try {
db.characters.bulkWrite([
{ insertOne: { "document": { "_id": 4, "char": "Dithras", "class": "barbarian", "lvl": 4 } } },
{ insertOne: { "document": { "_id": 4, "char": "Taeln", "class": "fighter", "lvl": 3 } } },
{ updateOne : {
"filter" : { "char" : "Eldon" },
"update" : { $set : { "status" : "Critical Injury" } }
} },
{ deleteOne : { "filter" : { "char" : "Brisbane"} } },
{ replaceOne : {
"filter" : { "char" : "Meldane" },
"replacement" : { "char" : "Tanys", "class" : "oracle", "lvl": 4 }
} }
], { ordered : false } );
} catch (e) {
print(e);
}
经常遇到一个需求就是,批量更新操作,不同的条件,不同的更新数据,实现可以使用bulk
var bulk = db.items.initializeUnorderedBulkOp();
bulk.find( { status: "D" } ).update( { $set: { status: "I", points: "0" } } );
bulk.find( { item: null } ).update( { $set: { item: "TBD" } } );
bulk.execute();
db.collection.initializeUnorderedBulkOp()
db.collection.initializeOrderedBulkOp()
Executing an ordered list of operations on a sharded collection will generally be slower than executing an unordered list
bulk构造器用无序的会更快
MongoDB的聚合管道通过一系列操作处理文档,如$match筛选数据,$group进行分组,$unwind拆分数组,$sort排序。本文介绍了聚合过程、常用操作符及其应用场景,包括$first和$unwind等,展示了如何使用aggregate命令进行数据聚合和报表制作。
659

被折叠的 条评论
为什么被折叠?



