mongoDB聚合学习Set/Unset/Project

Umarudive

已于 2022-04-04 23:55:38 修改

阅读量393

点赞数

分类专栏： Mongo 文章标签：数据库 mongodb

于 2022-04-04 23:54:39 首次发布

原文链接：https://www.practical-mongodb-aggregations.com/guides/project.html

版权

Mongo 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

原文链接
https://www.practical-mongodb-aggregations.com/guides/project.html
以下为翻译及学习笔记。

聚合管道是一系列指令的有序序列，称之为Stage(阶段，状态)。某一阶段的全部输出即为下一阶段的全部输入，直到结束。管道体现了高组合性，因为stage是无状态且独立的组成部分。聚合可以将复杂计算细化为一个一个独立的阶段，可隔离测试。
Alternatives for MongoDB aggregation pipelines composability

When To Use $set & $unset

在你想获得大部分输入的列，要增加，修改或者删除小部分数据的时候用set或unset。

// INPUT  (a record from the source collection to be operated on by an aggregation)
{
  _id: ObjectId("6044faa70b2c21f8705d8954"),
  card_name: "Mrs. Jane A. Doe",
  card_num: "1234567890123456",
  card_expiry: "2023-08-31T23:59:59.736Z",
  card_sec_code: "123",
  card_provider_name: "Credit MasterCard Gold",
  transaction_id: "eb1bd77836e8713656d9bf2debba8900",
  transaction_date: ISODate("2021-01-13T09:32:07.000Z"),
  transaction_curncy_code: "GBP",
  transaction_amount: NumberDecimal("501.98"),
  reported: true
}

// OUTPUT  (a record in the results of the executed aggregation)
{
  card_name: "Mrs. Jane A. Doe",
  card_num: "1234567890123456",
  card_expiry: ISODate("2023-08-31T23:59:59.736Z"), // Field type converted from text
  card_sec_code: "123",
  card_provider_name: "Credit MasterCard Gold",
  transaction_id: "eb1bd77836e8713656d9bf2debba8900",
  transaction_date: ISODate("2021-01-13T09:32:07.000Z"),
  transaction_curncy_code: "GBP",
  transaction_amount: NumberDecimal("501.98"),
  reported: true,
  card_type: "CREDIT"                               // New added literal value field
}

// BAD
[
  {"$project": {
    // Modify a field + add a new field
    "card_expiry": {"$dateFromString": {"dateString": "$card_expiry"}},
    "card_type": "CREDIT",        

    // Must now name all the other fields for those fields to be retained
    "card_name": 1,
    "card_num": 1,
    "card_sec_code": 1,
    "card_provider_name": 1,
    "transaction_id": 1,
    "transaction_date": 1,
    "transaction_curncy_code": 1,
    "transaction_amount": 1,
    "reported": 1,                
    
    // Remove _id field
    "_id": 0,
  }},
]

以上可见，用project会使pipeline很冗长。只改了两行但却要额外再标记其他所有的列，否则将会在转换的时候丢失这些列。这也就十列，要是是个交易系统，有一百列呢？

// GOOD
[
  {"$set": {
    // Modified + new field
    "card_expiry": {"$dateFromString": {"dateString": "$card_expiry"}},
    "card_type": "CREDIT",        
  }},
  
  {"$unset": [
    // Remove _id field
    "_id",
  ]},
]

When To Use $project

当原输入文档和输出的文档结构差异很大的时候，用project更好。这个情况大概率出现在你不再需要大部分原有列的情况。
这次，对于同样的表，假设你需要的新聚合结果与输入的文档结构非常不同，且需要的列远少于原有列：

// OUTPUT  (a record in the results of the executed aggregation)
{
  transaction_info: { 
    date: ISODate("2021-01-13T09:32:07.000Z"),
    amount: NumberDecimal("501.98")
  },
  status: "REPORTED"
}

这会儿用set或unset相较于用project就会显得冗余，需要在排除列的时候列出所有不需要的列。

// BAD
[
  {"$set": {
    // Add some fields
    "transaction_info.date": "$transaction_date",
    "transaction_info.amount": "$transaction_amount",
    "status": {"$cond": {"if": "$reported", "then": "REPORTED", "else": "UNREPORTED"}},
  }},
  
  {"$unset": [
    // Remove _id field
    "_id",

    // Must name all other existing fields to be omitted
    "card_name",
    "card_num",
    "card_expiry",
    "card_sec_code",
    "card_provider_name",
    "transaction_id",
    "transaction_date",
    "transaction_curncy_code",
    "transaction_amount",
    "reported",         
  ]}, 
]

因此，这会儿添加新字段project就更灵活点。

// GOOD
[
  {"$project": {
    // Add some fields
    "transaction_info.date": "$transaction_date",
    "transaction_info.amount": "$transaction_amount",
    "status": {"$cond": {"if": "$reported", "then": "REPORTED", "else": "UNREPORTED"}},
    
    // Remove _id field
    "_id": 0,
  }},
]

另一个project相较于set/addFields的潜在劣势是，当用project来指定所有需要的包含列时，可能容易失误从原数据中定义了比计划中更多的列。后面要是再跟个group，就掩盖掉你的手滑。导致最后聚合里也没出现这个错误的输出列。你就纳闷了“这咋回事？？”。如果你想在聚合里用覆盖索引（避免直接访问数据）会发生啥。大多数情况下，mongodb的聚合机制可以追踪字段的依赖关系，跟据文档自身情况，来分析哪些字段是不需要的。如果你加了新列则会覆盖掉这个特性。常见的错误如在project的包含阶段由于默认包含而忘记排除_id。这个错误就会悄咪咪的扼杀潜在的优化。如果你一定要用project，让他出现越晚越好。这样保证你可以知道这个聚合最后最精确的输出是什么。类似 _id这种列可能早就被前面的group标记为不需要的列了。