MongoDB聚合运算符：$regexFindAll

最新推荐文章于 2024-06-04 12:00:00 发布

原子星

最新推荐文章于 2024-06-04 12:00:00 发布

阅读量1k

点赞数 18

分类专栏： mongodb 文章标签： mongodb 数据库

本文链接：https://blog.csdn.net/superatom01/article/details/137755869

版权

mongodb 专栏收录该内容

237 篇文章 3 订阅

订阅专栏

MongoDB聚合运算符：$regexFindAll

$regexFindAll在聚合表达式中提供正则表达式 (regex) 模式匹配功能。该运算符返回一个文档数组，其中包含每个匹配项的信息，如果未找到匹配项，则返回空数组。

在MongoDB 4.2 之前，聚合管道只能在$match阶段使用查询操作符 $regex。

语法

{ $regexFindAll: { input: <expression> , regex: <expression>, options: <expression> } }

$regexFindAll可以用来比较任何类型的值，针对不同的类型使用特定的BSON比较顺序。

字段说明

input：要应用正则表达式模式的字符串。可以是字符串或任何解析为字符串的有效表达式。
regex：要应用的正则表达式模式。可以是任何解析为字符串或正则表达式模式/<pattern>/的有效表达式。使用regex /<pattern>/时，还可以指定regex选项 i 和 m（但不能指定 s 或 x 选项）：
- "pattern"
- /<pattern>/
- /<pattern>/<options>
  或者，也可以使用选项字段指定 regex 选项。要指定 s 或 x 选项，必须使用选项字段。
  不能同时在 regex 和 optioins 字段中指定选项。
options：可选字段，以下<options>可与正则表达式一起使用。
- i：大小写不敏感，可同时匹配大写和小写，可以在选项字段中指定该选项，也可以将其作为 regex 字段的一部分。
- m：对于包含锚点（即：^ 表示开始，$ 表示结束）的模式，如果字符串有多行值，则匹配每行的开头或结尾。如果不使用该选项，这些锚点将匹配字符串的开头或结尾。如果模式中不包含锚点，或者字符串值中没有换行符（如 \n），则 m 选项不起作用。
- x："Extended"功能可忽略模式中的所有空白字符，除非转义或包含在字符类中。此外，它会忽略中间的字符，包括未转义的井号/井号 (#) 字符和下一个新行，以便在复杂的模式中包含注释。这只适用于数据字符；空白字符永远不会出现在模式的特殊字符序列中。x选项不影响VT字符（即代码 11）的处理。
- 允许点字符（即.）匹配包括换行符在内的所有字符。只能在options字段中指定选项。

返回值

该运算符返回一个数组：

如果该运算符未找到匹配项，则该运算符将返回一个空数组。
如果运算符找到匹配项，则该运算符将返回一个文档数组，其中包含每个匹配项的以下信息：
- 输入中的第一个匹配字符串，
- 输入中匹配字符串的代码点索引（不是字节索引），以及
- 与匹配字符串捕获的组相对应的字符串数组。捕获组在正则表达式模式中使用未转义的括号 () 指定。`

[ { "match" : <string>, "idx" : <num>, "captures" : <array of strings> }, ... ]

使用

PCRE库

从 6.1 版开始，MongoDB 使用 PCRE2（Perl 兼容正则表达式）库来实现正则表达式模式匹配。

$regexFindAll和排序规则

$regexFindAll 忽略为集合、db.collection.aggregate()和索引（如果使用）指定的排序规则。

例如，创建一个排序规则强度为 1 的示例集合（即仅比较基本字符并忽略其他差异，例如大小写和变音符号）：

db.createCollection( "myColl", { collation: { locale: "fr", strength: 1 } } )

插入以下文档：

db.myColl.insertMany([
   { _id: 1, category: "café" },
   { _id: 2, category: "cafe" },
   { _id: 3, category: "cafE" }
])

使用集合的排序规则，以下操作执行不区分大小写和不区分变音符号的匹配：

db.myColl.aggregate( [ { $match: { category: "cafe" } } ] )

该操作返回以下 3 个文档：

{ "_id" : 1, "category" : "café" }
{ "_id" : 2, "category" : "cafe" }
{ "_id" : 3, "category" : "cafE" }

但是，聚合表达式 $regexFindAll 忽略排序规则；也就是说，以下正则表达式模式匹配示例区分大小写和变音符号：

db.myColl.aggregate( [ { $addFields: { results: { $regexFindAll: { input: "$category", regex: /cafe/ }  } } } ] )
db.myColl.aggregate(
   [ { $addFields: { results: { $regexFindAll: { input: "$category", regex: /cafe/ }  } } } ],
   { collation: { locale: "fr", strength: 1 } }     // 在$regexFindAll中被忽略
)

两个操作都会返回以下内容：

{ "_id" : 1, "category" : "café", "results" : [ ] }
{ "_id" : 2, "category" : "cafe", "results" : [ { "match" : "cafe", "idx" : 0, "captures" : [ ] } ] }
{ "_id" : 3, "category" : "cafE", "results" : [ ] }

要执行不区分大小写的正则表达式模式匹配，可改用 i 选项。

捕获输出

如果正则表达式模式包含捕获组并且该模式在输入中找到匹配项，则结果中的捕获数组对应于匹配字符串捕获的组。捕获组在正则表达式模式中使用未转义的括号 () 指定。捕获数组的长度等于模式中捕获组的数量，并且数组的顺序与捕获组出现的顺序匹配。

使用下面的脚本创建contacts集合：

db.contacts.insertMany([
  { "_id": 1, "fname": "Carol", "lname": "Smith", "phone": "718-555-0113" },
  { "_id": 2, "fname": "Daryl", "lname": "Doe", "phone": "212-555-8832" },
  { "_id": 3, "fname": "Polly", "lname": "Andrews", "phone": "208-555-1932" },
  { "_id": 4, "fname": "Colleen", "lname": "Duncan", "phone": "775-555-0187" },
  { "_id": 5, "fname": "Luna", "lname": "Clarke", "phone": "917-555-4414" }
])

以下管道将正则表达式模式 /(C(ar)*)ol/ 应用于 fname 字段：

db.contacts.aggregate([
  {
    $project: {
      returnObject: {
        $regexFindAll: { input: "$fname", regex: /(C(ar)*)ol/ }
      }
    }
  }
])

正则表达式模式找到与 fname 值 Carol 和 Colleen 的匹配项：

{ "_id" : 1, "returnObject" : [ { "match" : "Carol", "idx" : 0, "captures" : [ "Car", "ar" ] } ] }
{ "_id" : 2, "returnObject" : [ ] }
{ "_id" : 3, "returnObject" : [ ] }
{ "_id" : 4, "returnObject" : [ { "match" : "Col", "idx" : 0, "captures" : [ "C", null ] } ] }
{ "_id" : 5, "returnObject" : [ ] }

该模式包含捕获组 (C(ar)*)，其中包含嵌套组 (ar)，捕获数组中的元素对应于两个捕获组，如果某个组（例如 Colleen 和组 (ar)）未捕获匹配的文档，则 $regexFindAll 会用空占位符替换该组。

如前面的示例所示，捕获数组包含每个捕获组的一个元素（对于非捕获使用 null）。下面的示例通过将捕获组的逻辑或应用到电话字段来搜索具有纽约市区号的电话号码。每组代表纽约市的一个区号：

db.contacts.aggregate([
  {
    $project: {
      nycContacts: {
        $regexFindAll: { input: "$phone", regex: /^(718).*|^(212).*|^(917).*/ }
      }
    }
  }
])

对于与正则表达式模式匹配的文档，捕获数组包含匹配的捕获组，并将任何非捕获组替换为 null：

{ "_id" : 1, "nycContacts" : [ { "match" : "718-555-0113", "idx" : 0, "captures" : [ "718", null, null ] } ] }
{ "_id" : 2, "nycContacts" : [ { "match" : "212-555-8832", "idx" : 0, "captures" : [ null, "212", null ] } ] }
{ "_id" : 3, "nycContacts" : [ ] }
{ "_id" : 4, "nycContacts" : [ ] }
{ "_id" : 5, "nycContacts" : [ { "match" : "917-555-4414", "idx" : 0, "captures" : [ null, null, "917" ] } ] }

举例

$regexFindAll及其选项

使用脚本创建products集合：

db.products.insertMany([
   { _id: 1, description: "Single LINE description." },
   { _id: 2, description: "First lines\nsecond line" },
   { _id: 3, description: "Many spaces before     line" },
   { _id: 4, description: "Multiple\nline descriptions" },
   { _id: 5, description: "anchors, links and hyperlinks" },
   { _id: 6, description: "métier work vocation" }
])

默认情况下，$regexFindAll 执行区分大小写的匹配。例如，以下聚合对描述字段执行区分大小写的 $regexFindAll。正则表达式模式 /line/ 未指定任何分组：

db.products.aggregate([
   { $addFields: { returnObject: { $regexFindAll: { input: "$description", regex: /line/ } } } }
])

操作返回下面的结果：

{
   "_id" : 1,
   "description" : "Single LINE description.",
   "returnObject" : [ ]
}
{
   "_id" : 2,
   "description" : "First lines\nsecond line",
   "returnObject" : [ { "match" : "line", "idx" : 6, "captures" : [ ]}, { "match" : "line", "idx" : 19, "captures" : [ ] } ]
}
{
   "_id" : 3,
   "description" : "Many spaces before     line",
   "returnObject" : [ { "match" : "line", "idx" : 23, "captures" : [ ] } ]
}
{
   "_id" : 4,
   "description" : "Multiple\nline descriptions",
   "returnObject" : [ { "match" : "line", "idx" : 9, "captures" : [ ] }
] }
{
   "_id" : 5,
   "description" : "anchors, links and hyperlinks",
   "returnObject" : [ ]
}
{
   "_id" : 6,
   "description" : "métier work vocation",
   "returnObject" : [ ]
}

以下正则表达式模式 /lin(e|k)/ 指定模式中的分组 (e|k)：

db.products.aggregate([
   { $addFields: { returnObject: { $regexFindAll: { input: "$description", regex: /lin(e|k)/ } } } }
])

操作返回下面的结果：

{
   "_id" : 1,
   "description" : "Single LINE description.",
   "returnObject": [ ]
}
{
   "_id" : 2,
   "description" : "First lines\nsecond line",
   "returnObject" : [ { "match" : "line", "idx" : 6, "captures" : [ "e" ] }, { "match" : "line", "idx" : 19, "captures" : [ "e" ] } ]
}
{
   "_id" : 3,
   "description" : "Many spaces before     line",
   "returnObject" : [ { "match" : "line", "idx" : 23, "captures" : [ "e" ] } ]
}
{
   "_id" : 4,
   "description" : "Multiple\nline descriptions",
   "returnObject" : [ { "match" : "line", "idx" : 9, "captures" : [ "e" ] } ]
}
{
   "_id" : 5,
   "description" : "anchors, links and hyperlinks",
   "returnObject" : [ { "match" : "link", "idx" : 9, "captures" : [ "k" ] }, { "match" : "link", "idx" : 24, "captures" : [ "k" ] } ]
}
{
   "_id" : 6,
   "description" : "métier work vocation",
   "returnObject" : [ ]
}

在返回选项中，idx 字段是代码点索引，而不是字节索引。可以使用正则表达式模式 /tier/：

db.products.aggregate([
   { $addFields: { returnObject: { $regexFindAll: { input: "$description", regex: /tier/ } } } }
])

该操作返回以下内容，其中只有最后一条记录与模式匹配，并且返回的 idx 为 2（如果使用字节索引则为 3）

{ "_id" : 1, "description" : "Single LINE description.", "returnObject" : [ ] }
{ "_id" : 2, "description" : "First lines\nsecond line", "returnObject" : [ ] }
{ "_id" : 3, "description" : "Many spaces before     line", "returnObject" : [ ] }
{ "_id" : 4, "description" : "Multiple\nline descriptions", "returnObject" : [ ] }
{ "_id" : 5, "description" : "anchors, links and hyperlinks", "returnObject" : [ ] }
{ "_id" : 6, "description" : "métier work vocation",
             "returnObject" : [ { "match" : "tier", "idx" : 2, "captures" : [ ] } ] }

**注意：**不能在 regex 和选项字段中同时指定选项。

i 选项

要执行不区分大小写的模式匹配，要将i选项包含在正则表达式字段或选项字段中：

//将 i 指定为正则表达式字段的一部分
{ $regexFindAll: { input: "$description", regex: /line/i } }
//在选项字段中指定 i
{ $regexFindAll: { input: "$description", regex: /line/, options: "i" } }
{ $regexFindAll: { input: "$description", regex: "line", options: "i" } }

例如，以下聚合对描述字段执行不区分大小写的 $regexFindAll。正则表达式模式 /line/ 不指定任何分组：

db.products.aggregate([
   { $addFields: { returnObject: { $regexFindAll: { input: "$description", regex: /line/i } } } }
])

操作返回下面的结果：

{
   "_id" : 1,
   "description" : "Single LINE description.",
   "returnObject" : [ { "match" : "LINE", "idx" : 7, "captures" : [ ] } ]
}
{
   "_id" : 2,
   "description" : "First lines\nsecond line",
   "returnObject" : [ { "match" : "line", "idx" : 6, "captures" : [ ] }, { "match" : "line", "idx" : 19, "captures" : [ ] } ]
}
{
   "_id" : 3,
   "description" : "Many spaces before     line",
   "returnObject" : [ { "match" : "line", "idx" : 23, "captures" : [ ] } ]
}
{
   "_id" : 4,
   "description" : "Multiple\nline descriptions",
   "returnObject" : [ { "match" : "line", "idx" : 9, "captures" : [ ] } ]
}
{
   "_id" : 5,
   "description" : "anchors, links and hyperlinks",
   "returnObject" : [ ]
}
{ "_id" : 6, "description" : "métier work vocation", "returnObject" : [ ] }

m 选项

要匹配多行字符串的每一行的指定锚点（例如 ^、$），请将 m 选项包含在正则表达式字段或选项字段中：

// 将 m 指定为正则表达式字段的一部分
{ $regexFindAll: { input: "$description", regex: /line/m } }
// 在选项字段中指定 m
{ $regexFindAll: { input: "$description", regex: /line/, options: "m" } }
{ $regexFindAll: { input: "$description", regex: "line", options: "m" } }

以下示例同时包含 i 和 m 选项，用于匹配以字母 s或S` 开头的多行字符串的行：

db.products.aggregate([
   { $addFields: { returnObject: { $regexFindAll: { input: "$description", regex: /^s/im } } } }
])

操作返回下面的结果：

{
   "_id" : 1,
   "description" : "Single LINE description.",
   "returnObject" : [ { "match" : "S", "idx" : 0, "captures" : [ ] } ]
}
{
   "_id" : 2,
   "description" : "First lines\nsecond line",
   "returnObject" : [ { "match" : "s", "idx" : 12, "captures" : [ ] } ]
}
{
   "_id" : 3,
   "description" : "Many spaces before     line",
   "returnObject" : [ ]
}
{
   "_id" : 4,
   "description" : "Multiple\nline descriptions",
   "returnObject" : [ ]
}
{
   "_id" : 5,
   "description" : "anchors, links and hyperlinks",
   "returnObject" : [ ]
}
{ "_id" : 6, "description" : "métier work vocation", "returnObject" : [ ] }

x 选项

要忽略模式中所有未转义的空白字符和注释（由未转义的哈希 # 字符和下一个换行符表示），需在选项字段中包含 s 选项：

// 在选项字段中指定 x
{ $regexFindAll: { input: "$description", regex: /line/, options: "x" } }
{ $regexFindAll: { input: "$description", regex: "line", options: "x" } }

以下示例包含用于跳过未转义空格和注释的 x 选项：

db.products.aggregate([
   { $addFields: { returnObject: { $regexFindAll: { input: "$description", regex: /lin(e|k) # matches line or link/, options:"x" } } } }
])

操作返回下面的结果：

{
   "_id" : 1,
   "description" : "Single LINE description.",
   "returnObject" : [ ]
}
{
   "_id" : 2,
   "description" : "First lines\nsecond line",
   "returnObject" : [ { "match" : "line", "idx" : 6, "captures" : [ "e" ] }, { "match" : "line", "idx" : 19, "captures" : [ "e" ] } ]
}
{
   "_id" : 3,
   "description" : "Many spaces before     line",
   "returnObject" : [ { "match" : "line", "idx" : 23, "captures" : [ "e" ] } ]
}
{
   "_id" : 4,
   "description" : "Multiple\nline descriptions",
   "returnObject" : [ { "match" : "line", "idx" : 9, "captures" : [ "e" ] } ]
}
{
   "_id" : 5,
   "description" : "anchors, links and hyperlinks",
   "returnObject" : [ { "match" : "link", "idx" : 9, "captures" : [ "k" ] }, { "match" : "link", "idx" : 24, "captures" : [ "k" ] } ]
}
{ "_id" : 6, "description" : "métier work vocation", "returnObject" : [ ] }

s 选项

要允许模式中的点字符（即 .）匹配包括换行符在内的所有字符，需在选项字段中包含 s 选项：

// 在选项字段中指定 s
{ $regexFindAll: { input: "$description", regex: /m.*line/, options: "s" } }
{ $regexFindAll: { input: "$description", regex: "m.*line", options: "s" } }

以下示例包含 s 选项以允许点字符（即 .）匹配包括换行符在内的所有字符，以及 i 选项以执行不区分大小写的匹配：

db.products.aggregate([
   { $addFields: { returnObject: { $regexFindAll: { input: "$description", regex:/m.*line/, options: "si"  } } } }
])

操作返回下面的结果：

{
   "_id" : 1,
   "description" : "Single LINE description.",
   "returnObject" : [ ]
}
{
   "_id" : 2,
   "description" : "First lines\nsecond line",
   "returnObject" : [ ]
}
{
   "_id" : 3,
   "description" : "Many spaces before     line",
   "returnObject" : [ { "match" : "Many spaces before line", "idx" : 0, "captures" : [ ] } ]
}
{
   "_id" : 4,
   "description" : "Multiple\nline descriptions",
   "returnObject" : [ { "match" : "Multiple\nline", "idx" : 0, "captures" : [ ] } ]
}
{
   "_id" : 5,
   "description" : "anchors, links and hyperlinks",
   "returnObject" : [ ]
}
{ "_id" : 6, "description" : "métier work vocation", "returnObject" : [ ] }

使用 $regexFindAll 从字符串中解析电子邮件

使用下面的脚本创建feedback集合：

db.feedback.insertMany([
   { "_id" : 1, comment: "Hi, I'm just reading about MongoDB -- aunt.arc.tica@example.com"  },
   { "_id" : 2, comment: "I wanted to concatenate a string" },
   { "_id" : 3, comment: "How do I convert a date to string? cam@mongodb.com" },
   { "_id" : 4, comment: "It's just me. I'm testing.  fred@MongoDB.com" }
])

以下聚合使用$regexFindAll从评论字段中提取电子邮件（不区分大小写）：

db.feedback.aggregate( [
    { $addFields: {
       "email": { $regexFindAll: { input: "$comment", regex: /[a-z0-9_.+-]+@[a-z0-9_.+-]+\.[a-z0-9_.+-]+/i } }
    } },
    { $set: { email: "$email.match"} }
] )

阶段1

该阶段使用 $addFields 向文档添加新的字段电子邮件。新字段是一个数组，其中包含对注释字段执行 $regexFindAll 的结果：

{ "_id" : 1, "comment" : "Hi, I'm just reading about MongoDB -- aunt.arc.tica@example.com", "email" : [ { "match" : "aunt.arc.tica@example.com", "idx" : 38, "captures" : [ ] } ] }
{ "_id" : 2, "comment" : "I wanted to concatenate a string", "email" : [ ] }
{ "_id" : 3, "comment" : "How do I convert a date to string? Contact me at either cam@mongodb.com or c.dia@mongodb.com", "email" : [ { "match" : "cam@mongodb.com", "idx" : 56, "captures" : [ ] }, { "match" : "c.dia@mongodb.com", "idx" : 75, "captures" : [ ] } ] }
{ "_id" : 4, "comment" : "It's just me. I'm testing.  fred@MongoDB.com", "email" : [ { "match" : "fred@MongoDB.com", "idx" : 28, "captures" : [ ] } ] }

阶段2

该阶段使用 $set 将电子邮件数组元素重置为"email.match"值。如果 email 的当前值为 null，则 email 的新值将设置为 null：

{ "_id" : 1, "comment" : "Hi, I'm just reading about MongoDB -- aunt.arc.tica@example.com", "email" : [ "aunt.arc.tica@example.com" ] }
{ "_id" : 2, "comment" : "I wanted to concatenate a string", "email" : [ ] }
{ "_id" : 3, "comment" : "How do I convert a date to string? Contact me at either cam@mongodb.com or c.dia@mongodb.com", "email" : [ "cam@mongodb.com", "c.dia@mongodb.com" ] }
{ "_id" : 4, "comment" : "It's just me. I'm testing.  fred@MongoDB.com", "email" : [ "fred@MongoDB.com" ] }

使用捕获的分组来解析用户名

使用下面的脚本创建feedback集合：

db.feedback.insertMany([
   { "_id" : 1, comment: "Hi, I'm just reading about MongoDB -- aunt.arc.tica@example.com"  },
   { "_id" : 2, comment: "I wanted to concatenate a string" },
   { "_id" : 3, comment: "How do I convert a date to string? Contact me at either cam@mongodb.com or c.dia@mongodb.com" },
   { "_id" : 4, comment: "It's just me. I'm testing.  fred@MongoDB.com" }
])

要回复反馈，假设想要解析电子邮件地址的本地部分以用作问候语中的名称。使用 $regexFindAll 结果中返回的捕获字段，可以解析出每个电子邮件地址的本地部分：

db.feedback.aggregate( [
    { $addFields: {
       "names": { $regexFindAll: { input: "$comment", regex: /([a-z0-9_.+-]+)@[a-z0-9_.+-]+\.[a-z0-9_.+-]+/i } },
    } },
    { $set: { names: { $reduce: { input:  "$names.captures", initialValue: [ ], in: { $concatArrays: [ "$$value", "$$this" ] } } } } }
] )

阶段1

该阶段使用 $addFields 阶段将新字段names添加到文档中。新字段包含对comment字段执行 $regexFindAll 的结果：

{
   "_id" : 1,
   "comment" : "Hi, I'm just reading about MongoDB -- aunt.arc.tica@example.com",
   "names" : [ { "match" : "aunt.arc.tica@example.com", "idx" : 38, "captures" : [ "aunt.arc.tica" ] } ]
}
{ "_id" : 2, "comment" : "I wanted to concatenate a string", "names" : [ ] }
{
   "_id" : 3,
   "comment" : "How do I convert a date to string? Contact me at either cam@mongodb.com or c.dia@mongodb.com",
   "names" : [
      { "match" : "cam@mongodb.com", "idx" : 56, "captures" : [ "cam" ] },
      { "match" : "c.dia@mongodb.com", "idx" : 75, "captures" : [ "c.dia" ] }
    ]
}
{
   "_id" : 4,
   "comment" : "It's just me. I'm testing.  fred@MongoDB.com",
   "names" : [ { "match" : "fred@MongoDB.com", "idx" : 28, "captures" : [ "fred" ] } ]
}

阶段2

该阶段使用 $set 和 $reduce 运算符将names重置为包含"$names.captures"元素的数组。

{
   "_id" : 1,
   "comment" : "Hi, I'm just reading about MongoDB -- aunt.arc.tica@example.com",
   "names" : [ "aunt.arc.tica" ]
}
{ "_id" : 2, "comment" : "I wanted to concatenate a string", "names" : [ ] }
{
   "_id" : 3,
   "comment" : "How do I convert a date to string? Contact me at either cam@mongodb.com or c.dia@mongodb.com",
   "names" : [ "cam", "c.dia" ]
}
{
   "_id" : 4,
   "comment" : "It's just me. I'm testing.  fred@MongoDB.com",
   "names" : [ "fred" ]
}