mongodb 文本索引
Text search is a very common requirement in most applications, and you would expect most databases to support text search out of the box if you create an index on the field.
文本搜索是大多数应用程序中非常普遍的要求,并且,如果您在字段上创建索引,则希望大多数数据库开箱即用地支持文本搜索。
But when I tried to implement text search for my app, it turned out to be much more complex. After some research, I’ve uncovered three main ways to implement text search with MongoDB.
但是,当我尝试为我的应用程序实施文本搜索时,事实证明它要复杂得多。 经过研究,我发现了三种使用MongoDB实现文本搜索的主要方法。
1.创建文本索引 (1. Create a Text Index)
This is the first approach that you’ll find if you Google “full text search in mongo.” It’s the most efficient way to implement text search according to MongoDB’s documentation. As an example, consider the following data:
如果您使用Google“蒙哥文全文搜索”,这是您会找到的第一种方法。 根据MongoDB的文档,这是实现文本搜索的最有效方法。 例如,请考虑以下数据:
db.names.insert(
[
{ _id: 1, name: "Army of Ants" },
{ _id: 2, name: "Army Ants" },
{ _id: 3, name: "Ant Man" },
{ _id: 4, name: "Armies" }
]
)
Now create the index because the index will make it happen!
现在创建索引,因为索引会使其成功!
> db.names.createIndex({ name: "text" })
Now try the following queries:
现在尝试以下查询:
> db.names.find({"$text": {"$search": "Army"}})
{ "_id" : 4, "name" : "Armies" }
{ "_id" : 2, "name" : "Army Ants" }
{ "_id" : 1, "name" : "Army of Ants" }
>
>
> db.names.find({"$text": {"$search": "Arm"}})
As you can see, if you search for Army, it brings all the documents that had the exact word Army or any known variation of that word in the names column. But it doesn’t work for Arm.
如您所见,如果您搜索Army ,它将带所有带有确切单词Army或该单词的任何已知变体的文档到名称列中。 但这对Arm无效。
So, our text search is smart enough to match Armies when we search for Army but not dumb enough to partially match Arm with Army or Armies.
所以,我们的文本搜索是智能足以匹配Armies ,当我们搜索Army ,但不是哑巴,足以部分匹配Arm与Army或Armies 。
My product manager was hoping (or rather expecting) that if I searched for Arm, it would bring up the three results that came up when I searched for Army.
我的产品经理希望(或更希望的是)如果我搜索Arm ,它将显示我搜索Army时出现的三个结果。
To solve this, I thought it would be a good idea to understand why the two documents did not match when I searched for Arm instead of Army.
为了解决这个问题,我认为当我搜索Arm而不是Army时,为什么这两个文档不匹配是个好主意。
Like I suspected, tokenisation! The text index breaks the data (in this case, Army Ants and Army of Ants) by the white space into tokens. So Army Ants becomes [Army, Ants] and Army of Ants becomes [Army, of, Ants]. And when you search for Army, the word matches with one of the tokens in both documents, which is the reason why you see both documents in the results when you search for Army.
就像我怀疑的那样,标记化! 文本索引将数据(在本例中为Army Ants和Army of Ants )按空格分隔为令牌。 因此[Army, Ants] Army Ants成为[Army, Ants] , Army of Ants [Army, of, Ants]成为[Army, of, Ants] 。 当您搜索Army ,单词与两个文档中的标记之一都匹配,这就是为什么在搜索Army时在结果中看到两个文档的原因。
Note: I’m oversimplifying here. The actual process of tokenisation includes so much more (e.g. the stripping of insignificant words like of).
注意:我在这里简化了。 标记化的实际过程包括更多(例如,去除不重要的词,如of )。
So it seems like it would be next to impossible to satisfy our PM’s “hopes” with a text search. You might think we’ll have to venture into the world of autocompletes.
因此,似乎不可能通过文本搜索来满足我们PM的“希望”。 您可能会认为我们将不得不冒险进入自动完成世界。
2.使用普通的老而强大的正则表达式 (2. Using Plain Old — but Powerful — Regex)
Regular expressions are very inefficient in most databases because they offer much more flexible ways for us to search, and technology is always about trade-offs. We trade efficiency for search flexibility.
正则表达式在大多数数据库中效率很低,因为它们为我们提供了更为灵活的搜索方式,而技术始终是权衡取舍。 我们以效率为代价,以提高搜索的灵活性。
> db.names.find({"name": {"$regex": "Arm"}})
{ "_id" : 1, "name" : "Army of Ants" }
{ "_id" : 2, "name" : "Army Ants" }
{ "_id" : 4, "name" : "Armies" }
Voilà! However, it isn’t a pretty picture behind the scenes. Let’s put on our explain() glasses.
瞧! 但是,这不是幕后的漂亮图片。 让我们戴上我们的explain()眼镜。
> db.names.find({"name": {"$regex": "Arm"}}).explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.names",
"indexFilterSet" : false,
"parsedQuery" : {
"name" : {
"$regex" : "Arm"
}
},
"queryHash" : "420CA52A",
"planCacheKey" : "420CA52A",
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"name" : {
"$regex" : "Arm"
}
},
"direction" : "forward"
},
"rejectedPlans" : [ ]
},
...
}
See that COLLSCAN hiding behind the curtains?
看到COLLSCAN 躲在窗帘后面?
Let’s see if creating an index would make this better.
让我们看看创建索引是否会使它更好。
> db.names.createIndex({"name": 1})
...
> db.names.find({"name": {"$regex": "Arm"}}).explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.names",
"indexFilterSet" : false,
"parsedQuery" : {
"name" : {
"$regex" : "Arm"
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"filter" : {
"name" : {
"$regex" : "Arm"
}
},
"keyPattern" : {
"name" : 1
},
"indexName" : "name_1",
...
"indexBounds" : {
"name" : [
"[\"\", {})",
"[/Arm/, /Arm/]"
]
}
}
},
"rejectedPlans" : [ ]
},
...
"ok" : 1
}
Well, the IXSCAN looks better at first glance. But if you are quirky enough and your instincts don’t allow you to think that it could be that simple, you are in for a treat. Let’s look at the executionStats.
好吧,乍看之下IXSCAN看起来更好。 但是,如果您足够古怪,并且您的直觉不允许您认为事情可能是如此简单,那么您就该当好了。 让我们看一看executionStats 。
> db.names.find({"name":{"$regex":"Arm"}}).explain('executionStats')
{
"queryPlanner" : {
...
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"filter" : {
"name" : {
"$regex" : "Arm"
}
},
...
"indexBounds" : {
"name" : [
"[\"\", {})",
"[/Arm/, /Arm/]"
]
}
}
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 3,
"executionTimeMillis" : 0,
"totalKeysExamined" : 4,
"totalDocsExamined" : 3,
"executionStages" : {
...
}
},
"ok" : 1
}
See the value of totalKeysExamined? It was like all the keys in the index were being scanned. Let’s add a few more rows to see if this value increases proportionally.
看到totalKeysExamined的值吗? 就像索引中的所有键都在被扫描一样。 让我们再添加几行,看看该值是否成比例增加。
> db.names.insert(
[
{ _id: 5, name: "Unrelated name"},
{ _id: 6, name: "Completely unrelated name" },
{ _id: 7, name: "Another Completely unrelated name" },
{ _id: 8, name: "Yet Another Completely unrelated name" }
]
)
> db.names.find({"name":{"$regex":"Arm"}}).explain('executionStats')
{
"queryPlanner" : {
...
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"filter" : {
"name" : {
"$regex" : "Arm"
}
},
...
"indexBounds" : {
"name" : [
"[\"\", {})",
"[/Arm/, /Arm/]"
]
}
}
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 3,
"executionTimeMillis" : 0,
"totalKeysExamined" : 8,
"totalDocsExamined" : 3,
"executionStages" : {
...
}
},
"ok" : 1
}
totalKeysExamined is increasing with the number of rows. So clearly, I can’t deploy this to production — even though it works. Entire index scans are not good!
totalKeysExamined 随着行数的增加而增加。 显然,即使可以使用,我也无法将其部署到生产中。 整个索引扫描都不好!
After looking around a bit:
环顾四周后:
>db.names.find({"name":{"$regex":"^Arm"}}).explain("executionStats")
{
"queryPlanner" : {
...
"parsedQuery" : {
"name" : {
"$regex" : "^Arm"
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"name" : 1
},
...
"indexBounds" : {
"name" : [
"[\"Arm\", \"Arn\")",
"[/^Arm/, /^Arm/]"
]
}
}
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 3,
"executionTimeMillis" : 0,
"totalKeysExamined" : 4,
"totalDocsExamined" : 3,
"executionStages" : {
...
}
},
"ok" : 1
}
So this is performant enough for production! We are limiting the number of rows that are scanned (efficiency) by compromising on flexibility. We can now only match from the beginning of the name.
因此,这足以用于生产! 我们通过牺牲灵活性来限制要扫描的行数(效率)。 现在,我们只能从名称开头进行匹配。
But the task isn’t done yet, is it?
但是任务还没有完成吗?
> db.names.find({"name": {"$regex": "^Ant"}})
{ "_id" : 3, "name" : "Ant Man" }
So this is not matching Army of Ants or Army Ants, and the PM is really hoping I can make this work as well.
因此,这与Army of Ants或“ Army Ants Army of Ants不匹配,并且项目Army Ants真的希望我也能做到这一点。
I researched for a while but could not find anything that would allow me to make this possible out of the box.
我研究了一段时间,但找不到任何可以使我开箱即用的东西。
3.破解 (3. Hack It)
It was clear that it was time for some out-of-the-box thinking.
显然,是时候进行一些开箱即用的思考了。
Seems like our problem would be solved if we could break the names down by the whitespace and then use the regex with ^ on each of those broken words. So if we could break Army of Ants into [Army, of, Ants] and then search for ^Ant in all of the broken words, you’ll match Army of Ants.
如果我们可以用空格将名称分解,然后在每个破碎的单词上使用带有^的正则表达式,似乎可以解决我们的问题。 因此,如果我们可以将Army of Ants分解为[Army, of, Ants] ,然后在所有残破的单词中搜索^Ant ,那么您将匹配Army of Ants 。
But how do we leverage the database for most of these tasks? MongoDB will be more efficient than whatever code we write.
但是,我们如何利用数据库来完成大多数这些任务? MongoDB将比我们编写的任何代码更高效。
Array fields to the rescue. If we just break the names down by whitespace, turn them into a list ([Army, of, Ants]), and store this in an array field, we can do a ^ regex match on each of them. Mongo takes care of it if we create an index on this field (multikey index).
数组字段可以拯救。 如果只按空格将名称分解,将它们变成一个列表( [Army, of, Ants] ),然后将其存储在数组字段中,则可以对它们进行^ regex匹配。 如果我们在此字段上创建索引( multikey index ),Mongo会处理它。
Let’s see it in action in a new collection.
让我们在一个新的集合中看到它的实际效果。
> db.new_names.insert(
[
{ "_id" : 1, "name" : "Army of Ants", "name_search": ["Army", "of", "Ants"] },
{ "_id" : 2, "name" : "Army Ants", "name_search": ["Army", "Ants"] },
{ "_id" : 3, "name" : "Ant Man", "name_search": ["Ant", "Man"] },
{ "_id" : 4, "name" : "Armies", "name_search": ["Armies"] },
{ "_id" : 5, "name" : "Unrelated name", "name_search": ["Unrelated", "name"] },
{ "_id" : 6, "name" : "Completely unrelated name", "name_search": ["Completely", "Unrelated", "name"] },
{ "_id" : 7, "name" : "Another Completely unrelated name", "name_search": ["Another", "Completely", "Unrelated", "name"] },
{ "_id" : 8, "name" : "Yet Another Completely unrelated name", "name_search": ["Yet", "Another", "Completely", "Unrelated", "name"] }
]
)
> db.new_names.createIndex({"name_search": 1})
>
> db.new_names.find({"name_search": {"$regex": "^Ant"}})
{ "_id" : 3, "name" : "Ant Man", "name_search" : [ "Ant", "Man" ] }
{ "_id" : 1, "name" : "Army of Ants", "name_search" : [ "Army", "of", "Ants" ] }
{ "_id" : 2, "name" : "Army Ants", "name_search" : [ "Army", "Ants" ] }
Well, it works now. Let’s check if we are still efficient.
好吧,现在可以了。 让我们检查我们是否仍然有效。
> db.new_names.find({"name_search": {"$regex": "^Ant"}}).explain('executionStats')
{
"queryPlanner" : {
...
"parsedQuery" : {
"name_search" : {
"$regex" : "^Ant"
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"name_search" : 1
},
...
"indexBounds" : {
"name_search" : [
"[\"Ant\", \"Anu\")",
"[/^Ant/, /^Ant/]"
]
}
}
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 3,
"executionTimeMillis" : 0,
"totalKeysExamined" : 4,
"totalDocsExamined" : 3,
"executionStages" : {
...
}
},
"ok" : 1
}
So, it’s efficient. We are using an index scan and not scanning the entire index.
因此,它是有效的。 我们正在使用索引扫描,而不是扫描整个索引。
结论 (Conclusion)
That was all! Databases are complex and inefficient if not used properly. Every database has some limitations, and whenever we try something new, we must make sure that it would be efficient. Otherwise, you are just setting yourself up for failure in the future.
就这些! 如果使用不当,数据库将非常复杂且效率低下。 每个数据库都有一些局限性,每当我们尝试一些新的东西时,我们都必须确保它是有效的。 否则,您只是在为将来的失败做好准备。
Remember that I pointed out that technologies are all about trade-offs? Can you see what the trade-off is in the last approach? Leave me a comment!
还记得我指出的技术都是权衡的吗? 您能看到最后一种方法的权衡吗? 给我留言!
翻译自: https://medium.com/@varunb94/text-search-in-mongodb-34c1f70ab86d
mongodb 文本索引
本文介绍了MongoDB如何创建和使用文本索引进行文本搜索。内容来源于对原文的翻译,探讨了在MongoDB中实现高效文本检索的方法。
4264

被折叠的 条评论
为什么被折叠?



