原文链接
https://www.practical-mongodb-aggregations.com/guides/project.html
以下为翻译及学习笔记。
聚合管道是一系列指令的有序序列,称之为Stage(阶段,状态)。某一阶段的全部输出即为下一阶段的全部输入,直到结束。管道体现了高组合性,因为stage是无状态且独立的组成部分。聚合可以将复杂计算细化为一个一个独立的阶段,可隔离测试。
When To Use $set & $unset
在你想获得大部分输入的列,要增加,修改或者删除小部分数据的时候用set或unset。
// INPUT (a record from the source collection to be operated on by an aggregation)
{
_id: ObjectId("6044faa70b2c21f8705d8954"),
card_name: "Mrs. Jane A. Doe",
card_num: "1234567890123456",
card_expiry: "2023-08-31T23:59:59.736Z",
card_sec_code: "123",
card_provider_name: "Credit MasterCard Gold",
transaction_id: "eb1bd77836e8713656d9bf2debba8900",
transaction_date: ISODate("2021-01-13T09:32:07.000Z"),
transaction_curncy_code: "GBP",
transaction_amount: NumberDecimal("501.98"),
reported: true
}
// OUTPUT (a record in the results of the executed aggregation)
{
card_name: "Mrs. Jane A. Doe",
card_num: "1234567890123456",
card_expiry: ISODate("2023-08-31T23:59:59.736Z"), // Field type converted from text
card_sec_code: "123",
card_provider_name: "Credit MasterCard Gold",
transaction_id: "eb1bd77836e8713656d9bf2debba8900",
transaction_date: ISODate("2021-01-13T09:32:07.000Z"),
transaction_curncy_code: "GBP",
transaction_amount: NumberDecimal("501.98"),
reported: true,
card_type: "CREDIT" // New added literal value field
}
// BAD
[
{"$project": {
// Modify a field + add a new field
"card_expiry": {"$dateFromString": {"dateString": "$card_expiry"}},
"card_type": "CREDIT",
// Must now name all the other fields for those fields to be retained
"card_name": 1,
"card_num": 1,
"card_sec_code": 1,
"card_provider_name": 1,
"transaction_id": 1,
"transaction_date": 1,
"transaction_curncy_code": 1,
"transaction_amount": 1,
"reported": 1,
// Remove _id field
"_id": 0,
}},
]
以上可见,用project会使pipeline很冗长。只改了两行但却要额外再标记其他所有的列,否则将会在转换的时候丢失这些列。这也就十列,要是是个交易系统,有一百列呢?
// GOOD
[
{"$set": {
// Modified + new field
"card_expiry": {"$dateFromString": {"dateString": "$card_expiry"}},
"card_type": "CREDIT",
}},
{"$unset": [
// Remove _id field
"_id",
]},
]
When To Use $project
当原输入文档和输出的文档结构差异很大的时候,用project更好。这个情况大概率出现在你不再需要大部分原有列的情况。
这次,对于同样的表,假设你需要的新聚合结果与输入的文档结构非常不同,且需要的列远少于原有列:
// OUTPUT (a record in the results of the executed aggregation)
{
transaction_info: {
date: ISODate("2021-01-13T09:32:07.000Z"),
amount: NumberDecimal("501.98")
},
status: "REPORTED"
}
这会儿用set或unset相较于用project就会显得冗余,需要在排除列的时候列出所有不需要的列。
// BAD
[
{"$set": {
// Add some fields
"transaction_info.date": "$transaction_date",
"transaction_info.amount": "$transaction_amount",
"status": {"$cond": {"if": "$reported", "then": "REPORTED", "else": "UNREPORTED"}},
}},
{"$unset": [
// Remove _id field
"_id",
// Must name all other existing fields to be omitted
"card_name",
"card_num",
"card_expiry",
"card_sec_code",
"card_provider_name",
"transaction_id",
"transaction_date",
"transaction_curncy_code",
"transaction_amount",
"reported",
]},
]
因此,这会儿添加新字段project就更灵活点。
// GOOD
[
{"$project": {
// Add some fields
"transaction_info.date": "$transaction_date",
"transaction_info.amount": "$transaction_amount",
"status": {"$cond": {"if": "$reported", "then": "REPORTED", "else": "UNREPORTED"}},
// Remove _id field
"_id": 0,
}},
]
另一个project相较于set/addFields的潜在劣势是,当用project来指定所有需要的包含列时,可能容易失误从原数据中定义了比计划中更多的列。后面要是再跟个group,就掩盖掉你的手滑。导致最后聚合里也没出现这个错误的输出列。你就纳闷了“这咋回事??”。如果你想在聚合里用覆盖索引(避免直接访问数据)会发生啥。大多数情况下,mongodb的聚合机制可以追踪字段的依赖关系,跟据文档自身情况,来分析哪些字段是不需要的。如果你加了新列则会覆盖掉这个特性。常见的错误如在project的包含阶段由于默认包含而忘记排除_id。这个错误就会悄咪咪的扼杀潜在的优化。如果你一定要用project,让他出现越晚越好。这样保证你可以知道这个聚合最后最精确的输出是什么。类似 _id这种列可能早就被前面的group标记为不需要的列了。
Main Takeaway
总结,能用set/unset就不用project,除非输出文档与输入结构非常不一致,你只需要取一小部分列为输出的时候再用project。