MongoDB 聚合框架类比MYSQL

最新推荐文章于 2021-12-07 15:05:45 发布

大鹏的世界

最新推荐文章于 2021-12-07 15:05:45 发布

阅读量1k

点赞数

分类专栏： Mongodb

Mongodb 专栏收录该内容

24 篇文章 1 订阅

订阅专栏

MongoDB 2.1引入了聚合框架，可以替代MapReduce用于一般的聚合操作。如果你有看过相关文档，应该已经注意到了这个新特性。本文就主要介绍一下MongoDB 2.1中的聚合框架。

　　Pipeline语法简介

　　MongoDB聚合就是把一系列特殊操作符作用于一个集合。一个操作符就是一个拥有单个属性的JavaScript对象，其属性即操作符名称，其值是一个可选对象：

{ $name: { /* options */ } }

　　支持的操作符命名有：$project, $match, $limit, $skip, $unwind, $group, and $sort,它们每个都有其各自的选项集。一系列操作符就称为管道（Pipeline）：

[{ $project: { /* options */ } }, { $match: { /* options */ } }, { $group: { /* options */ } }]

　　当在执行一个Pipeline时，MongoDB会互相传递操作符。”传递”在此处借用了其在Linux中的含义：一个操作符的输出会成为接下来操作符的输入。而每个操作符的结果会是文档的一个新的集合。所以MongoDB会如下所示来执行前面的管道：

collection | $project | $match | $group => result

　　你可以给一个管道随意添加任意多的操作符，甚至是在两个不同的位置两次添加相同操作符：

　　这也就解释了为何一个管道不写成简单的JavaScript对象，而是一个对象集：在一个对象中，同一个操作符不能出现两次：

// The first appearance of $match and $group would be ignored with this syntax
{
  $match:   { /* options */ },
  $group:   { /* options */ },
  $match:   { /* options */ },
  $project: { /* options */ },
  $group:   { /* options */ }
}
// So MongoDB imposes a collection of JavaScript objects instead
[
  { $match:   { /* options */ } },
  { $group:   { /* options */ } },
  { $match:   { /* options */ } },
  { $project: { /* options */ } },
  { $group:   { /* options */ } }
]
// That's longer and cumbersome to read, but you'll get used to it

　　要在一个MongoDB集合上执行管道，则要在集合上使用aggregate（）函数：

db.books.aggregate([{ $project: { title: 1 } }]);

　　提示：如果你使用Node.js，本地适配器（从v0.9.9.2开始）和ODM（从v3.1.0开始）都是支持新聚合框架的。例如，想要在MongoDB模型上执行之前的Pipeline，你只需要写如下代码：

Books.aggregate([{ $project: { title: 1 } }], function(err, results) {
// do something with the result
});

　　聚合框架的主要好处是MongoDB在执行它时省却了JavaScript引擎的开销。直接以C++实现使得它执行起来速度是非常快的。相比较于经典SQL聚合，聚合框架的主要限制就是它被局限于一个单一集合。也就是说，你不能应用类似连接的操作在数个集合上进行MongoDB聚合。除此之外，它还是非常之强大。

　　在本文，我还将举例说明Pipeline操作符的威力，并与它们SQL中的同类进行比较。

　　选择，重命名，组合

　　可以使用$project 操作符来选择或是重命名集合中的属性，这与SQL中SELECT语句的使用是类似的

/ sample data
> db.books.find();
[
  { _id: 147, title: "War and Peace", ISBN: 9780307266934 },
  { _id: 148, title: "Anna Karenina", ISBN: 9781593080273 },
  { _id: 149, title: "Pride and Prejudice", ISBN: 9783526419358 },
]

# sample data
> SELECT * FROM book;
+-----+-----------------------+---------------+
| id  | title                 | ISBN          |
+-----+-----------------------+---------------+
| 147 | 'War and Peace'       | 9780307266934 |
| 148 | 'Anna Karenina'       | 9781593080273 |
| 149 | 'Pride and Prejudice' | 9783526419358 |
+-----+-----------------------+---------------+
> db.books.aggregate([
  { $project: {
    title: 0,           // eliminate from the output
    reference: "$ISBN"  // use ISBN as source
  } }
]);
[
  { _id: 147, reference: 9780307266934 },
  { _id: 148, reference: 9781593080273 },
  { _id: 149, reference: 9783526419358 },
]

> SELECT id, ISBN AS reference FROM book;
+-----+---------------+
| id  | reference     |
+-----+---------------+
| 147 | 9780307266934 |
| 148 | 9781593080273 |
| 149 | 9783526419358 |
+-----+---------------+

　　$project 操作符还可以使用任意支持的表达式操作符（$and, $or, $gt, $lt, $eq, $add, $mod, $substr, $toLower, $toUpper, $dayOfWeek, $hour, $cond, $ifNull, to name a few）来创建组合字段以及子文档。

　　归并文档

　　归并文档用的就是$group操作符。

// fastest way
> db.books.count();
3
// if you really want to use aggregation
> db.books.aggregate([
  { $group: {
    // _id is required, so give it a constant value
    // to group all the collection into one result
    _id: null,
    // increment nbBooks for each document
    nbBooks: { $sum: 1 }
  } }
]);
[
  { _id: null, nbBooks: 3 }
]

> SELECT COUNT(*) FROM book;
+----------+
| COUNT(*) |
+----------+
| 3        |
+----------+
// sample data
> db.books.find()
[
  { _id: 147, title: "War and Peace", author_id: 72347 },
  { _id: 148, title: "Anna Karenina", author_id: 72347 },
  { _id: 149, title: "Pride and Prejudice", author_id: 42345 }
]

# sample data
> SELECT * FROM book
+-----+---------------------+-----------+
| id  | title               | author_id |
+-----+---------------------+-----------+
| 147 | War and Peace       | 72347     |
| 148 | Anna Karenina       | 72347     |
| 149 | Pride and Prejudice | 42345     |
+-----+---------------------+-----------+
> db.books.aggregate([
  { $group: {
    // group by author_id
    _id: "$author_id",
    // increment nbBooks for each document
    nbBooks: { $sum: 1 }
  } }
]);
[
  { _id: 72347, nbBooks: 2 },
  { _id: 42345, nbBooks: 1 }
]

> SELECT author_id, COUNT(*)
  FROM book
  GROUP BY author_id;
+-----------+----------+
| author_id | COUNT(*) |
+-----------+----------+
| 72347     | 2        |
| 42345     | 1        |
+-----------+----------+

　　多操作符Pipeline

　　一个管道可能不止有一个操作符。以下就是$group操作符和$project的组合：

> db.books.aggregate([
  { $group: {
    _id: "$author_id",
    nbBooks: { $sum: 1 }
  } },
  { $project: {
    _id: 0,
    authorId: "$_id",
    nbBooks: 1
  } }
]);
[
  { authorId: 72347, nbBooks: 2 },
  { authorId: 42345, nbBooks: 1 }
]

> SELECT author_id AS author, COUNT(*) AS nb_books
  FROM book
  GROUP BY author_id;
+--------+----------+
| author | nb_books |
+--------+----------+
| 72347  | 2        |
| 42345  | 1        |
+--------+----------+

　　更为复杂的聚合

　　$group支持大量的聚合函数：$first, $last, $min, $max, $avg, $sum, $push, 以及$addToSet。可以查看MongoDB文档http://docs.mongodb.org/manual/reference/aggregation

// sample data
> db.reviews.find();
[
  { _id: "455", bookId: "974147",
    date: new Date("2012-07-10"), score: 1 },
  { _id: "456", bookId: "345335",
    date: new Date("2012-07-12"), score: 5 },
  { _id: "457", bookId: "345335",
    date: new Date("2012-07-13"), score: 2 },
  { _id: "458", bookId: "974147",
    date: new Date("2012-07-16"), score: 3 }
]

# sample data
> SELECT * FROM review;
+-----+---------+--------------+-------+
| id  | book_id | date         | score |
+-----+---------+--------------+-------+
| 455 | 974147  | "2012-07-10" | 1     |
| 456 | 345335  | "2012-07-12" | 5     |
| 457 | 345335  | "2012-07-13" | 2     |
| 458 | 974147  | "2012-07-16" | 3     |
+-----+---------+--------------+-------+
> db.reviews.aggregate([
  { $group: {
    _id: "$bookId",
    avgScore:  { $avg: "$score" },
    maxScore:  { $max: "$score" },
    nbReviews: { $sum: 1 }
  } }
]);
[
  { _id: 345335, avgScore: 3.5, maxScore: 5, nbReviews: 2 },
  { _id: 974147, avgScore: 3, maxScore: 3, nbReviews: 2 }
]

> SELECT book_id,
         AVG(score) as avg_score,
         MAX(score) as max_score,
         COUNT(*) as nb_reviews
  FROM review
  GROUP BY book_id ;
+---------+------------+----------+------------+
| book_id | avg_score | max_score | nb_reviews |
+---------+------------+----------+------------+
| 345335  | 3.5       | 5         | 2          |
| 974147  | 2         | 3         | 2          |
+---------+------------+----------+------------+

　　条件

　　你可以对集合加以限制，使其被查询对象处理，再传递给$match操作符。至于你是将此操作符置于$group操作符之前还是之后，也就决定着它在SQL中的同等角色是WHERE还是HAVING。

> db.reviews.aggregate([
  { $match : {
    date: { $gte: new Date("2012-07-11") }
  } },
  { $group: {
    _id: "$bookId",
    avgScore: { $avg: "$score" }
  } }
]);
[
  { _id: 345335, avgScore: 3.5 },
  { _id: 974147, avgScore: 3 }
]

> SELECT book_id, AVG(score)
  FROM review
  WHERE review.date > "2012-07-11"
  GROUP BY review.book_id ;
+---------+------------+
| book_id | AVG(score) |
+---------+------------+
| 345335  | 3.5        |
| 974147  | 3          |
+---------+------------+
> db.reviews.aggregate([
  { $group: {
    _id: "$bookId",
    avgScore: { $avg: "$score" }
  } },
  { $match : {
    avgScore: { $gt: 3 }
  } }
]);
[
  { _id: 345335, avgScore: 3.5 }
]

> SELECT book_id, AVG(score) AS avg_score
  FROM review
  GROUP BY review.book_id
  HAVING avg_score > 3;
+---------+------------+
| book_id | AVG(score) |
+---------+------------+
| 345335  | 3.5        |
+---------+------------+

　　开发嵌入式数组

　　如果集合中的文件包含数组，那么你就可以使用操作符将这些数组分散到几个特定的文档。

// sample data
> db.articles.find();
[
  {
    _id: 12351254,
    title: "Space Is Getting Closer",
    tags: ["science", "space", "iss"]
  },
  {
    _id: 22956492,
    title: "Computer Solves Rubiks Cube",
    tags: ["computing", "science"]
  }
]

# sample data
> SELECT * FROM article;
+------------+---------------------------+
| id       | title                       |
+----------+-----------------------------+
| 12351254 | Space Is Getting Closer     |
| 22956492 | Computer Solves Rubiks Cube |
+------------+---------------------------+
> SELECT * FROM tag;
+-----+------------+-----------+
| id  | article_id | name      |
+-----+------------+-----------+
| 534 | 12351254   | science   |
| 535 | 12351254   | space     |
| 536 | 12351254   | iss       |
| 816 | 22956492   | computing |
| 817 | 22956492   | science   |
+-----+------------+-----------+
> db.articles.aggregate([
  { $unwind: "$tags" }
]);
[
  {
    _id: 12351254,
    title: "Space Is Getting Closer",
    tags: "science"
  },
  {
    _id: 12351254,
    title: "Space Is Getting Closer",
    tags: "space"
  },
  {
    _id: 22956492,
    title: "Computer Solves Rubiks Cube",
    tags: "computing"
  },
  {
    _id: 22956492,
    title: "Computer Solves Rubiks Cube",
    tags: "science"
  }
]

> SELECT article.id, article.title, tag.name
  FROM article LEFT JOIN tag
  ON article.id = tag.article_id;
+------------+-----------------------------+-----------+
| article.id | article.title               | tag.name  |
+------------+-----------------------------+-----------+
| 12351254   | Space Is Getting Closer     | science   |
| 12351254   | Space Is Getting Closer     | space     |
| 22956492   | Computer Solves Rubiks Cube | computing |
| 22956492   | Computer Solves Rubiks Cube | science   |
+------------+-----------------------------+-----------+

　　聚合开发数组

　　聚合框架真正的威力是在你将$unwind传送给$group时才得以体现的。这与在SQL中使用LEFT JOIN…GROUP BY是类似的。

> db.articles.aggregate([
  { $unwind: "$tags" },
  { $group: {
    _id: "$tags",
    nbArticles: { $sum: 1 }
  } }
]);
[
  { _id: "science", nbArticles: 2 },
  { _id: "space", nbArticles: 1 },
  { _id: "computing", nbArticles: 1 },
]

> SELECT tag.name, COUNT(article.id) AS nb_articles
  FROM article LEFT JOIN tag
  ON article.id = tag.article_id
  GROUP BY tag.name;
+-----------+-------------+
| tqg.name  | nb_articles |
+-----------+-------------+
| science   | 2           |
| space     | 1           |
| computing | 1           |
+-------------+-----------+
> db.articles.aggregate([
  { $unwind: "$tags" },
  { $group: {
    _id: "$tags",
    articles: { $addToSet: "$_id" }
  } }
]);
[
  { _id: "science", articles: [12351254, 22956492] },
  { _id: "space", articles: [12351254] },
  { _id: "computing", articles: [22956492] },
]

> SELECT tag.name, GROUP_CONCAT(article.id) AS articles
  FROM article LEFT JOIN tag
  ON article.id = tag.article_id
  GROUP BY tag.name;
+-----------+-------------------+
| tqg.name  | articles          |
+-----------+-------------------+
| science   | 12351254,22956492 |
| space     | 12351254          |
| computing | 22956492          |
+-------------+-----------------+

　　结论

　　想象下可以用这个功能来做些什么呢?一个接着一个的传输操作符可以进行归并，排序，限定等操作。在MongoDB自带文档中有个很具代表性的例子，它是用两个连续的$group操作符来组成一个管道。而在SQL数据库中只能用子查询才能做到这一点。

　　如果你所用的MapReduce功能足够简单，则可将你的MongoDB代码重构为聚合框架，执行起来会更快。