5.MongoDB之正则表达式与聚合框架

焱齿

已于 2022-02-19 10:37:31 修改

阅读量1.7k

点赞数 2

分类专栏： mongodb 文章标签： mongodb

于 2021-02-08 20:37:45 首次发布

本文链接：https://blog.csdn.net/mijichui2153/article/details/113762201

版权

mongodb 专栏收录该内容

20 篇文章 10 订阅

订阅专栏

关于正则表达式这里又称为模糊匹配；

关于聚合框架实际上应该包括聚合管道(aggregation pipeline)和MapReduce。

一、模糊查询与正则表达式

1、测试数据

#首先插入如下几条数据
db.mycollection.save({ "_id" : ObjectId("60176db81d49ac9562d09552"), "index_name" : "robot_date_webim_2852199203_821678010_801211737966432", "uniqueid" : "", "systime" : NumberLong("1612137600000000") })
db.mycollection.save({ "_id" : ObjectId("60176dbb1d49ac9562d09553"), "systime" : NumberLong("1612148154742301"), "time" : { "value" : NumberLong("1612148154742301"), "type" : 3 }, "index_name" : "robot_webim_2852199203_821678010_801211737966432", "session_id" : { "value" : "webim_2852199203_801211737966432_1612148147671", "type" : 2 }, "uniqueid" : "1612148154_8115996", "content" : { "value" : "问题11\n姚明有多高\n姚明有多瘦\n姚明有多矮\n周杰伦会跳舞\n周杰伦会唱歌", "type" : 2 }})
db.mycollection.save({ "_id" : ObjectId("60176dbe1d49ac9562d09554"), "content" : { "value" : "姚明有多高", "type" : 2 }, "uniqueid" : "1612148157_9306913", "session_id" : { "value" : "webim_2852199203_801211737966432_1612148147671", "type" : 2 }, "index_name" : "robot_webim_2852199203_821678010_801211737966432", "systime" : NumberLong("1612148157968382"), "time" : { "value" : NumberLong("1612148157968382"), "type" : 3 } })
db.mycollection.save({ "_id" : ObjectId("60176dbf1d49ac9562d09555"), "session_id" : { "value" : "webim_2852199203_801211737966432_1612148147671", "type" : 2 }, "time" : { "value" : NumberLong("1612148159149489"), "type" : 3 }, "index_name" : "robot_webim_2852199203_821678010_801211737966432", "uniqueid" : "1612148159_9679002", "systime" : NumberLong("1612148159149489"), "content" : { "value" : "高", "type" : 2 } })
db.mycollection.save({ "_id" : ObjectId("60176dc11d49ac9562d09556"), "index_name" : "robot_webim_2852199203_821678010_801211737966432", "uniqueid" : "1612148161_1992952", "systime" : NumberLong("1612148161059899"), "time" : { "type" : 3, "value" : NumberLong("1612148161059899") }, "session_id" : { "value" : "webim_2852199203_801211737966432_1612148147671", "type" : 2 }, "content" : { "value" : "刘翔想打篮球", "type" : 2 }})

2、find常规查询

（1）查询某字段包含关键子"string"的记录，等同于sql的 like ‘%string%’

#查询index_name包含关键字"date_webim"的记录(注:其中的关键词不能用双引号括起来)
db.mycollection.find({"index_name":/date_webim/})
db.mycollection.find({index_name:/date_webim/}) //这样也是可以的

#查询content.value字段包括关键词"姚明"的记录
db.mycollection.find({"content.value":/姚明/})

#再附加其他条件当然也是ok的
db.mycollection.find({systime:{"$gte":1612137600000000},"index_name":/date_webim/})

（2）查询某字段开头是"string"的记录，等同于sql的 like ‘string%’

#查询index_name开头是"robot"的记录
db.mycollection.find({"index_name":/^robot/})
db.mycollection.find({index_name:/^robot/})

#查询content.value开头是"姚明"的记录
db.mycollection.find({"content.value":/^姚明/})

（4）查询某字段结尾是"string"的记录，等同于sql的 like ‘%string’

#查询index_name结尾是"966432"的记录
db.mycollection.find({"index_name":/966432$/})
db.mycollection.find({index_name:/966432$/})

#查询content.value结尾是"篮球"的记录
db.mycollection.find({"content.value":/篮球$/})

3、$regex查询

使用$regex同样可以实现上述效果。

（1）$regex语法如下：

{ <field>: { $regex: /pattern/, $options: '<options>' } }
{ <field>: { $regex: 'pattern', $options: '<options>' } }
{ <field>: { $regex: /pattern/<options> } }

（2）$option选项

① i ：加上 "$options":"$i" 表示忽略大小写；

② m：会更改^和$元字符的默认行为，分别使用与行的开头结尾匹配，而不是与输入字符串的开头和结尾匹配。

③ x：忽略非转义的空白字符。设置x选项后正则表达式中的非转义的空白字符将会被忽略，同时井号(#)被解释为注释的开头注，只能显示位于option选项中。

④ s：单行匹配模式。设置s选项后会改变模式中的点号(.)元字符的默认行为，他会匹配所有字符，包括换行符(/n)，只能显示位于option选项中。

（3）$regex实现上述匹配

1）查询某字段包含关键子"string"的记录，等同于sql的 like ‘%string%’

db.mycollection.find({index_name:{$regex:"date_webim"}})
db.mycollection.find({"content.value":{$regex:"姚明"}})
db.mycollection.find({index_name:{$regex:"WEBIM",$options:$i}})

2）查询某字段开头是"string"的记录，等同于sql的 like ‘string%’

db.mycollection.find({"content.value":{$regex:"^姚明"}})

3）查询某字段结尾是"string"的记录，等同于sql的 like ‘%string’

db.mycollection.find({"content.value":{$regex:"唱歌$"}})

二、mongodb聚合框架参考

0、单一目的的聚合方法

最典型的就是count、distinct。

(1)count：统计满足查询条件的消息条数

(2)distinct：获取满足查询条件的指定字段的不重复值,以数组形式返回。

用法如下:

db.collection_name.distinct(field,query,options)
field:指定返回的字段（string）
query：条件查询（document）
option：其他的选项（document），查看options

使用举例：

db.collection_event_track_96.distinct('kfuin')
db.collection_event_track_96.distinct("kfuin")

db.collection_event_track_96.distinct("kfuin")
db.collection_event_track_96.distinct("kfuin",{systime:{$gt:16319337580000}})
db.collection_event_track_96.distinct("kfuin",{"$and":[{systime:{$gt:1631093568000000}},{systime:{$lt:1631933758000000}}]})

1、聚合管道(aggeregate pipeline)

聚合管道：顾名思义就是基于管道模式的聚合框架,简单来说就是在聚合管道中前一个阶段处理的结果会作为后一阶段处理的输入,documents通过管道中的各个阶段处理后得到最终的聚合结果。

0、一个实例

对于集合mkt_track_detail_878中的如下数据，可以通过如下pipeline语句聚合到

{
    "_id":"b143faa06c500f8c8e6a55c3ccb012f2",
    "kfuin":2852150878,
    "mkt_type":14,
    "sa_list":[
        "3_wx50dedeabf88ac326_o8jkBwHYScxHy-wvBsBZ3Fs4J9BA"
    ],
    "sensor_properties":{
        "$marketing_type":"服务号运营",
        "$name_of_marketing_event":"企点营销测试号",
        "$sdk_type":"openapi",
        "$utm_source":"公众号微信回调",
        "ContentType":"文本",
        "SendContent":"12314",
        "$appid":"wx50dedeabf88ac326",
        "$ip":"9.148.128.234"
    },
    "sensor_event":"粉丝发送消息",
    "sensor_type":"track",
    "session_id":"o8jkBwHYScxHy-wvBsBZ3Fs4J9BA_1619607958907",
    "time":1619608209000000
}

db.mkt_track_detail_878.aggregate(
[
	{
		$match:{
			kfuin:2852150878,
			sa_list:{"$in":["3_wx50dedeabf88ac326_o8jkBwHYScxHy-wvBsBZ3Fs4J9BA","1_1241133123"]},
			"sensor_event" : "粉丝发送消息",
			"sensor_properties.ContentType" : {
               "$in" : [ "文本", "文章" ]
            },
			"time" : {
               "$gte" : 1604061021000000
            }
		}
	},
	{
        "$group" : {
           "_id" : null,
           "count" : {
               "$sum" : 1
           }
        }
  	}
]
)

1、mongo聚合的stage与mysql对应

mongo聚合主要有如下stage。$count , $group, $match, $project, $unwind, $limit, $skip, $sort, $sortByCount, $lookup, $out, $addFields

SQL 操作/函数	mongodb聚合操作
where	$match	用于过滤数据,只输出符合条件的文档;$match使用MongoDB的标准查询操作。
group by	$group	将集合中的文档分组，可用于统计结果。
having	$match	在 SQL 中增加 HAVING 子句原因是，WHERE 关键字无法与合计函数一起使用。
select	$project	修改输入文档的结构。可用来重命令、增加或删除域，也可用于创建计算结果即嵌套文档。
order by	$sort	输出文档排序后输出
limit	$limit	用来限制MongoDB返回的文档树
sum()	$sum	求和累加
count()	$sum	行数汇总
join	$lookup （v3.2 新增）
	$count	返回包含输入到stage的文档的计数，理解为返回与表或试图的find()查询匹配的文档的计数。即不执行find()操作，而是计数并返回与查询匹配的结果数。

2、聚合过程

db.collection.aggregate()是基于数据处理的聚合管道，每个文档通过一个由多个阶段(stage)组成的管道，可以对每个阶段的管道进行分组、过滤等功能；然后通过一些列的处理，输出相应的结果。通过下图可以了解Aggregate处理的过程。

（1）db.collection.aggregate()可以用多个构建创建一个管道，对一连串的文档进行处理。这些构建包括：筛选操作match、映射操作project、分组操作group、排序操作sort、限制操作limit、跳过操作skip等。

（2）db.collection.aggregate()使用了MongoDB内置的原生操作，聚合效率非常高。通过这种方式SQL的group by的功能，即用户不编写JS脚本就可以实现聚合的复杂操作。

（3）每个阶段管道限制为100MB内存。如果一个节点管道超过这个极限，MongoDB会产生错误。为了处理大型数据集用户可以通过设置allowDiskUse为true方方式吧数据写入临时文件。这样就就解决了100M内存的限制。

（4）db.collection.aggregate()可以作用于分片集合，但结果不能输在分片集合；MapReduce可以作用于分片集合，结果也可以输在分片集合。

（5）db.collection.aggregate()方法可以返回一个指针（cursor），数据放在内存中，直接操作。跟Mongo shell 一样指针操作。
（6）db.collection.aggregate()输出的结果只能保存在一个文档中，BSON Document大小限制为16M。可以通过返回指针解决，版本2.6中后面：DB.collect.aggregate()方法返回一个指针，可以返回任何结果集的大小。

3、聚合的基本语法格式如下：

db.collection.aggregate(pipeline, options)

其中：

pipeline	array	在版本2.6中更改:该方法仍然可以接受管道阶段作为独立的参数，而不是数组中的元素;如果不将管道指定为数组，则不能指定options参数。
options	document	可选的只有在将管道指定为数组时才可用。

示例如下：

options选项文档可以包含一下字段和值：

`explain`	boolean	可选。指定返回关于管道处理的信息(ps:这个explain好像远远比不了find语句的explain方法)
`allowDiskUse`	boolean	可选。允许写入临时文件。当设置为true时，聚合操作可以将数据写入dbPath目录中的_tmp子目录
`cursor`	document	可选。指定游标的初始批处理大小。该`cursor` 字段的值是具有该字段的文档`batchSize`。
`maxTimeMS`	非负整数	可选。指定处理游标操作的时间限制（以毫秒为单位）。如果未指定maxTimeMS的值，则操作不会超时。值`0`显式指定默认的无界行为。 MongoDB使用与之相同的机制终止超出其分配时间限制的操作db.killOp()。MongoDB仅终止其指定中断点之一的操作。
`bypassDocumentValidation`	boolean	可选。只有在指定$out聚合操作符时才可用。使db.collection。聚合以在操作期间绕过文档验证。这样可以插入不满足验证要求的文档。
`readConcern`	document	可选的。指定读取问题。 readConcern选项具有以下语法： `readConcern: { level: <value> }` "local"。这是默认的读取关注级别。 "available"。当读取操作和afterClusterTime以及“level”未指定时，这是对辅助节点的读取的默认值。查询返回实例的最新数据。 "majority"。适用于使用WiredTiger存储引擎的副本集。 "linearizable"。仅适用于读取操作 primary。
`collation`	document	可选。指定要用于操作的排序规则。排序规则允许用户为字符串比较指定特定于语言的规则，例如字母和重音标记的规则。排序规则选项具有以下语法： `collation: {` `locale: <string>,` `caseLevel: <boolean>,` `caseFirst: <string>,` `strength: <int>,` `numericOrdering: <boolean>,` `alternate: <string>,` `maxVariable: <string>,` `backwards: <boolean>` `}` 指定排序规则时，该`locale`字段是必填字段; 所有其他校对字段都是可选的。有关字段的说明，请参阅排序文档。如果未指定排序规则但集合具有默认排序规则（请参阅参考资料db.createCollection()），则该操作将使用为集合指定的排序规则。如果没有为集合或操作指定排序规则，MongoDB使用先前版本中使用的简单二进制比较进行字符串比较。您无法为操作指定多个排序规则。例如，您不能为每个字段指定不同的排序规则，或者如果使用排序执行查找，则不能对查找使用一个排序规则，而对排序使用另一个排序规则。
`hint`	string or document	可选。用于聚合的索引。索引位于运行聚合的初始集合/视图上。通过索引名称或索引规范文档指定索引。注意该`hint`不适$lookup和 $graphLookup阶段。
`comment`	string	可选。用户可以指定任意字符串，以帮助通过数据库事件探查器，currentOp和日志跟踪操作。

可能报错：

1、聚合结果限制16M以内，否则报错。

Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in

原因是聚合的结果必须要限制在16M以内操作，（mongodb支持的最大影响信息的大小），否则必须放在磁盘中做缓存（allowDiskUse=True）。其实在[]后面加上",{allowDiskUse:true}"就好了。

应用举例一：

#首先插入如下测试数据：
db.mycollection.save({ "_id" : 1, "subject" : "History", "score" : 88 })
db.mycollection.save({ "_id" : 2, "subject" : "History", "score" : 92 })
db.mycollection.save({ "_id" : 3, "subject" : "History", "score" : 97 })
db.mycollection.save({ "_id" : 4, "subject" : "History", "score" : 71 })
db.mycollection.save({ "_id" : 5, "subject" : "History", "score" : 79 })
db.mycollection.save({ "_id" : 6, "subject" : "History", "score" : 83 })

（1）$count：

#语法
{$count: <string>}  #后面的string你给输出设置的属性

执行：
1）$match 阶段排除score小于等于80的文档，将大于80的文档传到下个阶段
2）$count阶段返回聚合管道中剩余文档的计数，并将该值分配给名为passing_scores的字段。
执行结果：

mongos> db.mycollection.aggregate(
    [
        {$match:{score:{"$gt":80}}},
        {$count:"number_of_pass_80"}
    ]
)
{ "number_of_pass_80" : 4 }

（2）$group

释义：按指定的表达式对文档进行分组，并将不同分组的文档输出到下一个阶段。输出的的文档包含一个_id字段，该字段按键包含不同的组。输出文档还可以包含计算字段，该字段保存由$group的_id字段分布的一些accumulator表达式的值。$group不会输出具体的文档而只是统计信息。

#语法
{ $group: { _id: <expression>, <field1>: { <accumulator1> : <expression1> }, ... } }

1）_id字段是必填的；但是可以通过为其制定为null的方式来为整个输出文档计算累计值。

2）剩余的计算字段是可选的，并使用<accumulator>运算符进行计算。

3）_id和<accumulator>表达式可以接受任何有效的表达式。accumulator支持的操作符如下：

名称	描述	类比sql
$avg	计算均值	avg
$first	返回每组第一个文档，如果有排序，按照排序，如果没有按照默认的存储的顺序的第一个文档。	limit 0,1
$last	返回每组最后一个文档，如果有排序，按照排序，如果没有按照默认的存储的顺序的最后个文档。	-
$max	根据分组，获取集合中所有文档对应值得最大值。	max
$min	根据分组，获取集合中所有文档对应值得最小值。	min
$push	将指定的表达式的值添加到一个数组中。	-
$addToSet	将表达式的值添加到一个集合中（无重复值，无序）。	-
$sum	计算总和	sum
$stdDevPop	返回输入值的总体标准偏差（population standard deviation）	-
$stdDevSamp	返回输入值的样本标准偏差（the sample standard deviation）	-

$group阶段的内存限制为100M。默认情况下，如果stage超过此限制，$group将产生错误。但是，要允许处理大型数据集，请将allowDiskUse选项设置为true以启用$group操作以写入临时文件。

注：如果想让输出好看的话就在结尾加个.pretty()。

#首先插入如下待测数据
db.mycollection.save({ "_id" : 1, "item" : "abc", "price" : 10, "quantity" : 2, "date" : ISODate("2014-03-01T08:00:00Z") })
db.mycollection.save({ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1, "date" : ISODate("2014-03-01T09:00:00Z") })
db.mycollection.save({ "_id" : 3, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-03-15T09:00:00Z") })
db.mycollection.save({ "_id" : 4, "item" : "xyz", "price" : 5, "quantity" : 20, "date" : ISODate("2014-04-04T11:21:39.736Z") })
db.mycollection.save({ "_id" : 5, "item" : "abc", "price" : 10, "quantity" : 10, "date" : ISODate("2014-04-04T21:23:13.331Z") })

#一、接下来一月份/日期/年份对文档进行分组，并计算每个分组的总价totalPrice、平均数量average quantity；并且计算每个组的文档数量
mongos> db.mycollection.aggregate(
[
	{
		$group:{
			_id: {
				month:{$month:"$date"},
				day:{$dayOfMonth:"$date"},
				year:{$year:"$date"}
			},
			totalPrice:{$sum:{$multiply:["$price","$quantity"]}},
			averageQuantity:{$avg:"$quantity"},
			count:{$sum:1}
		}
	}
]
).pretty()

{
        "_id" : {
                "month" : null,
                "day" : null,
                "year" : null
        },
        "totalPrice" : 0,
        "averageQuantity" : null,
        "count" : 1
}
{
        "_id" : {
                "month" : 4,
                "day" : 4,
                "year" : 2014
        },
        "totalPrice" : 200,
        "averageQuantity" : 15,
        "count" : 2
}
{
        "_id" : {
                "month" : 3,
                "day" : 15,
                "year" : 2014
        },
        "totalPrice" : 50,
        "averageQuantity" : 10,
        "count" : 1
}
{
        "_id" : {
                "month" : 3,
                "day" : 1,
                "year" : 2014
        },
        "totalPrice" : 40,
        "averageQuantity" : 1.5,
        "count" : 2
}


#二、group null。以下聚合操作将指定组_id为null；其效果是计算集合中所有文档的总价格和平均数量及其技术。

db.mycollection.aggregate(
[
	{
		$group:{
			_id: null,
			totalPrice:{$sum:{$multiply:["$price","$quantity"]}},
			averageQuantity:{$avg:"$quantity"},
			count:{$sum:1}
		}
	}
]
)

{
    "_id" : null, 
    "totalPrice" : 290, 
    "averageQuantity" : 8.6, 
    "count" : 6 
}


#三、使用$group按item字段对文档进行分组(不展示其他字段)
db.mycollection.aggregate(
[
	{
		$group:{
			_id:"$item"
		}

	}
]
)

{ "_id" : null }
{ "_id" : "xyz" }
{ "_id" : "jkl" }
{ "_id" : "abc" }

#四、数据转换——将集合中的数据按price分组转换成item数组(items可以自定义,是分组后的列表)
db.mycollection.aggregate(
[
	{
		$group:{
			_id:"$price",
			items:{$push:"$item"}
		}

	}
]
)

{ "_id" : null, "items" : [ ] }
{ "_id" : 5, "items" : [ "xyz", "xyz" ] }
{ "_id" : 20, "items" : [ "jkl" ] }
{ "_id" : 10, "items" : [ "abc", "abc" ] }

注：效果就是以price进行分组，列出每个price下面都有哪些个item。


#四、数据转换——以item分组,使用$$ROOT系统变量
db.mycollection.aggregate(
[
	{
		$group:{
			_id:"$item",
			items:{$push:"$$ROOT"}
		}

	}
]
)

{
        "_id" : "xyz",
        "items" : [
                {
                        "_id" : 3,
                        "item" : "xyz",
                        "price" : 5,
                        "quantity" : 10,
                        "date" : ISODate("2014-03-15T09:00:00Z")
                },
                {
                        "_id" : 4,
                        "item" : "xyz",
                        "price" : 5,
                        "quantity" : 20,
                        "date" : ISODate("2014-04-04T11:21:39.736Z")
                }
        ]
}
{
        "_id" : "jkl",
        "items" : [
                {
                        "_id" : 2,
                        "item" : "jkl",
                        "price" : 20,
                        "quantity" : 1,
                        "date" : ISODate("2014-03-01T09:00:00Z")
                }
        ]
}
{
        "_id" : "abc",
        "items" : [
                {
                        "_id" : 1,
                        "item" : "abc",
                        "price" : 10,
                        "quantity" : 2,
                        "date" : ISODate("2014-03-01T08:00:00Z")
                },
                {
                        "_id" : 5,
                        "item" : "abc",
                        "price" : 10,
                        "quantity" : 10,
                        "date" : ISODate("2014-04-04T21:23:13.331Z")
                }
        ]
}

注：可见系统变量$$ROOT的效果就是元素所有属性的意思

注：其实仔细对比删除查询语句和输出就知道输出的数据的结构完全是根据查询语句来的。

（3）$match

释义：过滤文档，仅将符合指定条件的文档传递到下一个管道阶段。

$match接受一个指定查询条件的文档；查询语法与读操作查询语法相同。

{ $match: { <query> } }

$match用于对文档进行筛选，之后可以在得到的文档子集上做聚合，$match可以使用除了地理空间以外的所有常规查询操作符。在实际应用中应尽可能的将$match放在管道前面的位置。这样的做的好处和明显。1）可以快速的将不需要的文档过滤掉，减少后续管道的工作量 2）在投射和分值之前执行$match，查询可以使用索引。

注意：由上面可知$match的位置和使用次数都不是定死的。举个例子,你可以先对输入文档进行一个match过滤，然后进行了一个XX操作；操作完后依然可以在使用一次match进行过滤得到最终结果。

限制：

1)不能再$match查询中使用$作为聚合管道的一部分；

2)要在$match阶段使用$text，$match阶段必须是管道的第一阶段。

3)视图不支持文本搜索。

#首先插入如下记录
db.mycollection.save({ "_id" : ObjectId("512bc95fe835e68f199c8686"), "author" : "dave", "score" : 80, "views" : 100 })
db.mycollection.save({ "_id" : ObjectId("512bc962e835e68f199c8687"), "author" : "dave", "score" : 85, "views" : 521 })
db.mycollection.save({ "_id" : ObjectId("55f5a192d4bede9ac365b257"), "author" : "ahn", "score" : 60, "views" : 1000 })
db.mycollection.save({ "_id" : ObjectId("55f5a192d4bede9ac365b258"), "author" : "li", "score" : 55, "views" : 5000 })
db.mycollection.save({ "_id" : ObjectId("55f5a1d3d4bede9ac365b259"), "author" : "annT", "score" : 60, "views" : 50 })
db.mycollection.save({ "_id" : ObjectId("55f5a1d3d4bede9ac365b25a"), "author" : "li", "score" : 94, "views" : 999 })
db.mycollection.save({ "_id" : ObjectId("55f5a1d3d4bede9ac365b25b"), "author" : "ty", "score" : 95, "views" : 1000 })

#一、筛选anthor是dove的记录
db.mycollection.aggregate(
[
	{
		$match:{author:"dave"}
	}
]
)

{ "_id" : ObjectId("512bc95fe835e68f199c8686"), "author" : "dave", "score" : 80, "views" : 100 }
{ "_id" : ObjectId("512bc962e835e68f199c8687"), "author" : "dave", "score" : 85, "views" : 521 }


#二、使用$match选择要处理的文档，然后将结果输出到$group管道以计算文档的数量:
db.mycollection.aggregate(
[
	{
		$match:{$or:[{score:{$gt:70,$lt:90}},{views:{$gte:1000}}]}
	},
	{
		$group:{
			_id:null,
			count:{$sum:1}
		}
	}

]
)

{ "_id" : null, "count" : 5 }

（4）$unwind

释义：从输出文档解构数组字段以输出每个元素的文档。简单来说就是将指定的数组拆分以其中每个元素替换原数组形成len(数组长度)个文档（看个例子就明白了）。

#语法。unwind:解开松开的意思
{ $unwind: <field path> }

#首先插入如下一条数据
db.mycollection.save({ "_id" : 1, "item" : "ABC1", sizes: [ "S", "M", "L"] })


#执行unwind语句
db.mycollection.aggregate(
[
	{$unwind:"$sizes"}
]
)

#效果如下：
{ "_id" : 1, "item" : "ABC1", "sizes" : "S" }
{ "_id" : 1, "item" : "ABC1", "sizes" : "M" }
{ "_id" : 1, "item" : "ABC1", "sizes" : "L" }

由上图可知其效果为：每个文档与输入文档相同，只不过用sizes字段中的每个元素替换原sizes的值形成了一些列的文档。

（5）$project

释义：$project可以从文档中选择想要的字段和不想要的字段。和mongo命令行语句中的projection意思一样。

#语法
{ $project: { <specification(s)> } }

$project管道符的作用是选择字段、重命名字段、派生字段。

specifications有如下形式：

<field>:<1/0 or true、false> 是否包含该字段。1/true表示选择，0/false表示不选择子field。

#数据依然采用前一个延时的数据

#一、通过$project stage仅输出文档的author/views字段,不输出_id字段。
db.mycollection.aggregate(
[
	{$project:{author:1,views:1,_id:0}}
]
)

{ "author" : "dave", "views" : 100 }
{ "author" : "dave", "views" : 521 }
{ "author" : "ahn", "views" : 1000 }
{ "author" : "li", "views" : 5000 }
{ "author" : "annT", "views" : 50 }
{ "author" : "li", "views" : 999 }
{ "author" : "ty", "views" : 1000 }

（6）$limit

释义：限制传到管道中下一阶段的文档数

#语法
{ $limit: <positive integer> }

#同样采用前面已经存在数据

#仅选择3条记录
db.mycollection.aggregate(
[
	{$limit:3}
]
)

{ "_id" : ObjectId("512bc95fe835e68f199c8686"), "author" : "dave", "score" : 80, "views" : 100 }
{ "_id" : ObjectId("512bc962e835e68f199c8687"), "author" : "dave", "score" : 85, "views" : 521 }
{ "_id" : ObjectId("55f5a192d4bede9ac365b257"), "author" : "ahn", "score" : 60, "views" : 1000 }

注意：

（7）$skip

释义：跳过进入stage的指定数量的文档，将其余的文档传递到管道中的下一个阶段。

#语法
{ $skip: <positive integer> }

#依然采用上面的数据

#首先输出全部数据
db.mycollection.aggregate()
{ "_id" : ObjectId("512bc95fe835e68f199c8686"), "author" : "dave", "score" : 80, "views" : 100 }
{ "_id" : ObjectId("512bc962e835e68f199c8687"), "author" : "dave", "score" : 85, "views" : 521 }
{ "_id" : ObjectId("55f5a192d4bede9ac365b257"), "author" : "ahn", "score" : 60, "views" : 1000 }
{ "_id" : ObjectId("55f5a192d4bede9ac365b258"), "author" : "li", "score" : 55, "views" : 5000 }
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b259"), "author" : "annT", "score" : 60, "views" : 50 }
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b25a"), "author" : "li", "score" : 94, "views" : 999 }
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b25b"), "author" : "ty", "score" : 95, "views" : 1000 }

#skip前四条数据
db.mycollection.aggregate(
[
	{$skip:4}
]
)
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b259"), "author" : "annT", "score" : 60, "views" : 50 }
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b25a"), "author" : "li", "score" : 94, "views" : 999 }
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b25b"), "author" : "ty", "score" : 95, "views" : 1000 }

（8）$sort

释义：对所有输入文档进行排序，并按排序顺序将他们返回到管道。

#语法
{ $sort: { <field1>: <sort order>, <field2>: <sort order> ... } }

$sort指定要排序的字段和相应的排序顺序。<sort order>取值如下：

1) 1 指定升序

2) -1 指定降序

3) {$mata:"textScore"} 按照降序排序计算出的textScore数据。

#依然采用上面的测试数据

#一、score第一优先级升序排序;views第二优先级升序。
db.mycollection.aggregate(
[
	{$sort:{score:1,views:1}}
]
)
{ "_id" : ObjectId("55f5a192d4bede9ac365b258"), "author" : "li", "score" : 55, "views" : 5000 }
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b259"), "author" : "annT", "score" : 60, "views" : 50 }
{ "_id" : ObjectId("55f5a192d4bede9ac365b257"), "author" : "ahn", "score" : 60, "views" : 1000 }
{ "_id" : ObjectId("512bc95fe835e68f199c8686"), "author" : "dave", "score" : 80, "views" : 100 }
{ "_id" : ObjectId("512bc962e835e68f199c8687"), "author" : "dave", "score" : 85, "views" : 521 }
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b25a"), "author" : "li", "score" : 94, "views" : 999 }
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b25b"), "author" : "ty", "score" : 95, "views" : 1000 }

#二、views第一优先级升序排序;score第二优先级升序。
db.mycollection.aggregate(
[
	{$sort:{views:1,score:1}}
]
)
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b259"), "author" : "annT", "score" : 60, "views" : 50 }
{ "_id" : ObjectId("512bc95fe835e68f199c8686"), "author" : "dave", "score" : 80, "views" : 100 }
{ "_id" : ObjectId("512bc962e835e68f199c8687"), "author" : "dave", "score" : 85, "views" : 521 }
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b25a"), "author" : "li", "score" : 94, "views" : 999 }
{ "_id" : ObjectId("55f5a192d4bede9ac365b257"), "author" : "ahn", "score" : 60, "views" : 1000 }
{ "_id" : ObjectId("55f5a1d3d4bede9ac365b25b"), "author" : "ty", "score" : 95, "views" : 1000 }
{ "_id" : ObjectId("55f5a192d4bede9ac365b258"), "author" : "li", "score" : 55, "views" : 5000 }

注：由上述事例可知$sort后面出现的各个属性的顺序就是就是排序的优先级。

（9）$sortByCount

释义：根据指定表达式的值对传入的文档分组，然后计算每个不同组中文档的数量。

应用举例二：

#首先向mongodb中插入以下记录
db.mycollection.save({ "_id" : 1, "cust_id" : "abc1", "ord_date" : ISODate("2012-11-02T17:04:11.102Z"), "status" : "A", "amount" : 50 })
db.mycollection.save({ "_id" : 2, "cust_id" : "xyz1", "ord_date" : ISODate("2013-10-01T17:04:11.102Z"), "status" : "A", "amount" : 100 })
db.mycollection.save({ "_id" : 3, "cust_id" : "xyz1", "ord_date" : ISODate("2013-10-12T17:04:11.102Z"), "status" : "D", "amount" : 25 })
db.mycollection.save({ "_id" : 4, "cust_id" : "xyz1", "ord_date" : ISODate("2013-10-11T17:04:11.102Z"), "status" : "D", "amount" : 125 })
db.mycollection.save({ "_id" : 5, "cust_id" : "abc1", "ord_date" : ISODate("2013-11-12T17:04:11.102Z"), "status" : "A", "amount" : 25 })

1、统计所有数量amount的和
db.getCollection("mycollection").aggregate(
	[
		{
			$group: {
				_id: null,  
				count: {$sum: "$amount"}
			}
		}
	]
)

{ "_id" : null, "count" : 325 }

2、对每个cust_id统计数量amount和
db.getCollection("mycollection").aggregate(
	[
		{
			$group: {
				_id: "$cust_id",  
				count: {$sum: "$amount"}
			}
		}
	]
)

{ "_id" : "xyz1", "count" : 250 }
{ "_id" : "abc1", "count" : 75 }

3、对每一个唯一的cust_id和ord_date对进行分组,计算amount总和
(1)根据cust_id和年份唯一对进行分组
db.getCollection("mycollection").aggregate(
	[
		{
			$group: {
				_id: {
					cust_id: "$cust_id",
					ord_date:{
						year: {$year: "$ord_date"}
					}
				},  
				count: {$sum: "$amount"}
			}
		}
	]
)

{ "_id" : { "cust_id" : "abc1", "ord_date" : { "year" : 2013 } }, "count" : 25 }
{ "_id" : { "cust_id" : "xyz1", "ord_date" : { "year" : 2013 } }, "count" : 250 }
{ "_id" : { "cust_id" : "abc1", "ord_date" : { "year" : 2012 } }, "count" : 50 }



(2)根据cust_id+年份+月份唯一对进行分组
db.getCollection("mycollection").aggregate(
	[
		{
			$group: {
				_id: {
					cust_id: "$cust_id",
					ord_date:{
						year: {$year: "$ord_date"},
						month: {$month: "$ord_date"}
					}
				},  
				count: {$sum: "$amount"}
			}
		}
	]
)

{ "_id" : { "cust_id" : "xyz1", "ord_date" : { "year" : 2013, "month" : 10 } }, "count" : 250 }
{ "_id" : { "cust_id" : "abc1", "ord_date" : { "year" : 2013, "month" : 11 } }, "count" : 25 }
{ "_id" : { "cust_id" : "abc1", "ord_date" : { "year" : 2012, "month" : 11 } }, "count" : 50 }


(3)根据cust_id+年份+月份+天唯一对进行分组
db.getCollection("mycollection").aggregate(
	[
		{
			$group: {
				_id: {
					cust_id: "$cust_id",
					ord_date:{
						year: {$year: "$ord_date"},
						month: {$month: "$ord_date"},
						day: {$dayOfMonth: "$ord_date"}
					}
				},  
				count: {$sum: "$amount"}
			}
		}
	]
)

{ "_id" : { "cust_id" : "abc1", "ord_date" : { "year" : 2013, "month" : 11, "day" : 12 } }, "count" : 25 }
{ "_id" : { "cust_id" : "abc1", "ord_date" : { "year" : 2012, "month" : 11, "day" : 2 } }, "count" : 50 }
{ "_id" : { "cust_id" : "xyz1", "ord_date" : { "year" : 2013, "month" : 10, "day" : 11 } }, "count" : 125 }
{ "_id" : { "cust_id" : "xyz1", "ord_date" : { "year" : 2013, "month" : 10, "day" : 1 } }, "count" : 100 }
{ "_id" : { "cust_id" : "xyz1", "ord_date" : { "year" : 2013, "month" : 10, "day" : 12 } }, "count" : 25 }


4、根据cust_id进行分组统计其数量后 刷选数其中数量大于2的统计结果
(1)没过滤的结果
db.getCollection("mycollection").aggregate(
	[
		{
			$group:{
				_id:"$cust_id",
				count:{$sum:1}
			}
		}
	]
)
(2)过滤后的效果
db.getCollection("mycollection").aggregate(
	[
		{
			$group:{
				_id:"$cust_id",
				count:{$sum:1}
			}
		},
		{
			$match:{
				count: {$gt: 2}
			}
		}
	]
)

5、对每个cust_id、年份唯一分组 计算数量总和;然后仅仅返回累加和大于25的记录。

(1)没有过滤的效果
db.getCollection("mycollection").aggregate(
	[
		{
			$group:{
				_id:{
					cust_id: "$cust_id",
					year: {$year: "$ord_date"}
				},
				total:{$sum:"$amount"}
			}
		}
	]
)

{ "_id" : { "cust_id" : "abc1", "year" : 2013 }, "total" : 25 }
{ "_id" : { "cust_id" : "xyz1", "year" : 2013 }, "total" : 250 }
{ "_id" : { "cust_id" : "abc1", "year" : 2012 }, "total" : 50 }

(2)有过滤的效果
db.getCollection("mycollection").aggregate(
	[
		{
			$group:{
				_id:{
					cust_id: "$cust_id",
					year: {$year: "$ord_date"}
				},
				total:{$sum:"$amount"}
			}
		},
		{
			$match:{
				total: {$gt: 25}
			}
		}
	]
)

{ "_id" : { "cust_id" : "xyz1", "year" : 2013 }, "total" : 250 }
{ "_id" : { "cust_id" : "abc1", "year" : 2012 }, "total" : 50 }


6、对status=A的记录根据cust_id分组计算amount总和
(1)未根据status过滤的效果如下
db.getCollection("mycollection").aggregate(
	[
		{
			$group:{
				_id: {      
					cust_id: "$cust_id"
				},
				total: {$sum: "$amount"}
			}
		}
	]
)

{ "_id" : { "cust_id" : "xyz1" }, "total" : 250 }
{ "_id" : { "cust_id" : "abc1" }, "total" : 75 }

(2)根据status过滤的效果如下
//这里的_id直接写成 "_id: "$cust_id","也是可以的
db.getCollection("mycollection").aggregate(
	[
		{
			$match:{
				status: {$eq: "A"}
			}
		},
		{
			$group:{
				_id: {      
					cust_id: "$cust_id"
				},
				total: {$sum: "$amount"}
			}
		}
	]
)

{ "_id" : { "cust_id" : "xyz1" }, "total" : 100 }
{ "_id" : { "cust_id" : "abc1" }, "total" : 75 }



7、对status=A的记录根据cust_id分组计算amount总和 仅展示总数大于等于100的记录
/*
    就是说match等这些可以有多个,根据自然语义排布就是可以的。
*/
db.getCollection("mycollection").aggregate(
	[
		{
			$match:{
				status: {$eq: "A"}
			}
		},
		{
			$group:{
				_id: {      
					cust_id: "$cust_id"
				},
				total: {$sum: "$amount"}
			}
		},
		{
			$match:{
				total:{$gte:100}
			}
		}
	]
)

{ "_id" : { "cust_id" : "xyz1" }, "total" : 100 }

对一个集合的数据按照kfuin进行分组：

db.collection_event_track_649.aggregate(
[
	{
        "$group" : {
        	"_id":"$kfuin",
           "count" : {
               "$sum" : 1
           }
        }
  	}
]
)

2、MapReduce

MapReduce：mapreduce是一种编程模型，用于大规模数据集的并行运算。MapReduce采用分而治之的思想，简单来说就是：将待处理的大数据集分为若干个小数据集，对各个小数据集进行计算获取中间结果，最后整合中间结果获取最终的结果。mongodb也实现了mapreduce，用法还是比较简单的，语法如下：

db.collection.mapReduce(
   function() {emit(key,value);},                  //map 函数
   function(key,values) {return reduceFunction},   //reduce 函数
   {
      out: collection,   //输出
      query: document,   //查询条件，在Map函数执行前过滤出符合条件的documents
      sort: document,    //再Map前对documents进行排序
      limit: number      //发送给map函数的document条数
   }
)

mongoDB的MapReduce可以简单分为两个阶段：

Map阶段：

　　栗子中的map函数为 function() {emit(this.cust_id,this.amount)} ，执行map函数前先进行query过滤，找到 status=A 的documents，然后将过滤后的documents发送给map函数，map函数遍历documents，将cust_id作为key，amount作为value，如果一个cust_id有多个amount值时，value为数组[amount1,amount2..]，栗子的map函数获取的key/value集合中有两个key/value对： {“A123”:[500,250]}和{“B212”:200}

Reduce阶段:

　　reduce函数封装了我们的聚合逻辑，reduce函数会逐个计算map函数传过去的key/value对，在上边栗子中的reduce函数的目的是计算amount的总和。

　　上边栗子最终结果存放在集合myresultColl中(out的作用)，通过命令 db.myresultColl.find() 查看结果如下：

[
        {
                "_id" : "A123",
                "value" : 750
        },
        {
                "_id" : "B212",
                "value" : 200
        }
]

　　MapReduce属于轻量级的集合框架，没有聚合管道那么多运算符，聚合的思路也比较简单：先写一个map函数用于确定需要整合的key/value对(就是我们感兴趣的字段)，然后写一个reduce函数，其内部封装着我们的聚合逻辑，reduce函数会逐一处理map函数获取的key/value对，以此获取聚合结果。

小结

　　本节通过几个栗子简单介绍了mongoDB的聚合框架：集合管道和MapReduce，聚合管道提供了十分丰富的运算符，让我们可以方便地进行各种聚合操作，因为聚合管道使用的是mongoDB内置函数所以计算效率一般不会太差。需要注意：①管道处理结果会放在一个document中，所以处理结果不能超过16M(Bson格式的最大尺寸)，②聚合过程中内存不能超过100M(可以通过设置{“allowDiskUse”: True}来解决)；MapReduce的map函数和reduce函数都是我们自定义的js函数，这种聚合方式比聚合管道更加灵活，我们可以通过编写js代码来处理复杂的聚合任务，MapReduce的缺点是聚合的逻辑需要我们自己编码实现。综上，对于一些简单的固定的聚集操作推荐使用聚合管道，对于一些复杂的、大量数据集的聚合任务推荐使用MapReduce。

焱齿

关注

2
点赞
踩
13

收藏

觉得还不错? 一键收藏
打赏
0
评论
5.MongoDB之正则表达式与聚合框架

1、mongo与mysql集合类对比SQL 操作/函数 mongodb聚合操作 where $match 用于过滤数据,只输出符合条件的文档;$match使用MongoDB的标准查询操作。 group by $group 将集合中的文档分组，可用于统计结果。 having $match 在 SQL 中增加 HAVING 子句原因是，WHERE 关键字无法与合计函数一起使用。 select $project 修改输入文档...
复制链接

扫一扫