MongoDB 全文搜索教程

MongoDB 全文搜索教程

返回原文英文原文:MongoDB Text Search Tutorial

In my introduction to text search in MongoDB, we had a look at the basic features. Today we’ll have a closer look at the details.

API

You may have noticed that a text search is not executed with a find() command. Instead you call

db.foo.runCommand( "text", {search: "bar"} )
Remember it’s an experimental feature still. Adding it to the implementation of the find() command would have mixed critical production code with the new text search feature. When executed via a runCommand() call, text search can be run and tested in isolation.

 

I expect to see a new query operator like$textor$textsearchas soon as text search is integrated with the standard find() command.

译者信息

译者信息

enixyu
enixyu 翻译于 1年前

1人 顶 此译文

 

在我的那篇MongoDB全文检索入门篇一文中,我们已经对MongoDB的基本功能有了一个初步的了解。今天,通过这篇文章,我们来更进一步的讨论MongoDB全文检索功能。

API

你会发现全文搜索并非是通过find()命令实现,而是通过调用

db.foo.runCommand( "text", {search: "bar"} )
请牢记这个命令现在还处于实验阶段。通过这个命令实现find()功能,会在生产环境中掺入危险的代码。通过runCommand()这个命令来执行搜索,运行和测试可以实现分离。

我多么的希望一个新的检索操作符,例如$textor $textsearch 可以和标准的find()命令相结合。

 

Text Query Syntax

In the previous examples we just searched for a single word. We can do more than that. Let’s have a look at the following example:

db.foo.drop()
db.foo.ensureIndex( {txt: "text"} )
db.foo.insert( {txt: "Robots are superior to humans"} )
db.foo.insert( {txt: "Humans are weak"} )
db.foo.insert( {txt: "I, Robot - by Isaac Asimov"} )
A search for “robot” will find two documents, the same it true for “human”:
> db.foo.runCommand("text", {search: "robot"}).results.length
2
> db.foo.runCommand("text", {search: "human"}).results.length
2
When searching for multiple terms, an OR search is performed, yielding three documents in our example:
> db.foo.runCommand("text", {search: "human robot"}).results.length
3
I would have expected that the given search words are AND-ed not OR-ed.
译者信息

译者信息

enixyu
enixyu 翻译于 1年前

1人 顶 此译文

 

文本检索语法

在前面的例子中,我们只是搜索一个单词。我们可以搜索的更复杂一些,让我们来看看以下代码:

db.foo.drop()
db.foo.ensureIndex( {txt: "text"} )
db.foo.insert( {txt: "Robots are superior to humans"} )
db.foo.insert( {txt: "Humans are weak"} )
db.foo.insert( {txt: "I, Robot - by Isaac Asimov"} )
搜索单词“robot”, 会得到2个结果,而搜索“human”结果也是2个。
> db.foo.runCommand("text", {search: "robot"}).results.length
2
> db.foo.runCommand("text", {search: "human"}).results.length
2
当我们搜索条件包含多个单词,数据库会执行或的操作,搜索结果会得到3个。
> db.foo.runCommand("text", {search: "human robot"}).results.length
3
我希望搜索的单词之间是与的关系而不是或的关系。       

 

Negation

By adding a heading minus sign to a search word, you can exclude documents containing that word. Let’s say, we want all documents on “robot” but no “humans”.

> db.foo.runCommand("text", {search: "robot -humans"})
{
        "queryDebugString" : "robot||human||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.6666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ebc484214a1e88aaa4ada0"),
                                "txt" : "I, Robot - by Isaac Asimov"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 212
        },
        "ok" : 1
}

 

Phrase Search

 

By enclosing multiple words inside quotes (“foo bar”) you perform a phrase search. Inside a phrase, order is important and stop words are also taken into account:

> db.foo.runCommand("text", {search: '"robots are"'})
{
        "queryDebugString" : "robot||||robots are||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.6666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ebc482214a1e88aaa4ad9e"),
                                "txt" : "Robots are superior to humans"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 185
        },
        "ok" : 1
}
Please have a look at the “queryDebugField”:
"queryDebugString" : "robot||||robots are||"
It tells us that our search string contains one stem “robot” but also the phrase “robots are”. That’s the reason we have only one hit. Compare that to these searches:
> // order matters inside phrase
> db.foo.runCommand("text", {search: '"are robots"'}).results.length
0
> // no phrase search --> OR query
> db.foo.runCommand("text", {search: 'are robots'}).results.length
2

 

 

译者信息

译者信息

enixyu
enixyu 翻译于 1年前

1人 顶 此译文

 

其它翻译版本:1(点击译者名切换) AlfredCheung

取反

通过在搜索单词前加上减号'-',可以在搜索的时候,排除包含该单词的记录。也就是说,我们需要搜索包含“robot”,但是不包含“humans”的记录。

> db.foo.runCommand("text", {search: "robot -humans"})
{
        "queryDebugString" : "robot||human||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.6666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ebc484214a1e88aaa4ada0"),
                                "txt" : "I, Robot - by Isaac Asimov"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 212
        },
        "ok" : 1
}
词组搜索

通过用引号包含由多个单词组成的词组(“foo bar”),就可以实现词组搜索。在词组里面,单词的顺序十分重要,同时搜索结束单词也需要考虑。

> db.foo.runCommand("text", {search: '"robots are"'})
{
        "queryDebugString" : "robot||||robots are||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.6666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ebc482214a1e88aaa4ad9e"),
                                "txt" : "Robots are superior to humans"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 185
        },
        "ok" : 1
}
请查看如下的例子"queryDebugField"
"queryDebugString" : "robot||||robots are||"
我们需要搜索条件中包含"robot"的词根,同时也包含"robots are"的词组。这就是为什么我们只找到一条记录。请比较如下的搜索:
> // order matters inside phrase
> db.foo.runCommand("text", {search: '"are robots"'}).results.length
0
> // no phrase search --> OR query
> db.foo.runCommand("text", {search: 'are robots'}).results.length
2
译者信息

译者信息

AlfredCheung
AlfredCheung 翻译于 1年前

0人 顶 此译文

 

其它翻译版本:1(点击译者名切换) enixyu

否定

如果在某个搜索词前面加减号,表示排除含有这个词的文档。比如说,我们想要搜索含有“robot”,但是又不含有“human”的文档。

> db.foo.runCommand("text", {search: "robot -humans"})
{
        "queryDebugString" : "robot||human||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.6666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ebc484214a1e88aaa4ada0"),
                                "txt" : "I, Robot - by Isaac Asimov"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 212
        },
        "ok" : 1
}

词组搜索

如果将多个单词用引号括起来(比如“foo bar”),表示执行的是词组搜索。在词组搜索中,单词的顺序很重要,而且会把停用词(译者注:为节省存储空间和提高搜索效率,搜索引擎在索引页面或处理搜索请求时会自动忽略某些字或词,这些字或词即被称为停用词,stop words,包括一些助词和常用词)也考虑在内。

> db.foo.runCommand("text", {search: '"robots are"'})
{
        "queryDebugString" : "robot||||robots are||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.6666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ebc482214a1e88aaa4ad9e"),
                                "txt" : "Robots are superior to humans"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 185
        },
        "ok" : 1
}
我们看一下“queryDebugField”:

 

"queryDebugString" : "robot||||robots are||"
我们的搜索字符串既包括一个词“robot”,也包括一个词组“robots are”。所以我们的搜索会命中一次。可以跟下面的搜索结果比较一下:
> // 在词组搜索中,顺序也决定结果
> db.foo.runCommand("text", {search: '"are robots"'}).results.length
0
> // 这不是词组搜索,只是一个“或”查询
> db.foo.runCommand("text", {search: 'are robots'}).results.length
2
Multi Language Support

 

Stemming and stop word filtering are both language dependent. So we have to tell MongoDB what language to use for indexing and searching if you want to use other languages than the default which is English. MongoDB uses the open source Snowball stemmer that supports these languages.

In order to use another language for indexing and searching, you do this when creating the index:

> db.de.insert( {txt: "Ich bin Dein Vater, Luke." } )
> db.de.validate().keysPerIndex["text.de.$txt_text"]
2
With this setting, MongoDB assumes that all text in the field “txt” and all text searches on that collection are in German. Let’s see if it works:
> db.de.runCommand("text", {search: "ich"}).results.length
0
> db.de.runCommand("text", {search: "Vater"}).results.length
1
> db.de.runCommand("text", {search: "Luke"}).results.length
1
译者信息

译者信息

AlfredCheung
AlfredCheung 翻译于 1年前

0人 顶 此译文

 

多语言支持

分词和停用词过滤都是与语言有关的。如果你希望用英语以外的语言来创建索引和搜索,那么必须告诉MongoDB。MongoDB用的是开源的Snowball分词器,它支持这些语言这些语言

如果希望使用其它语言,需要在创建索引时这样写:

db.de.ensureIndex( {txt: "text"}, {default_language: "german"} )
MongoDB就会认为“txt”中的文本是德语,而且我们搜索的文本也是德语。我们看看是不是这样的:
> db.de.insert( {txt: "Ich bin Dein Vater, Luke." } )
> db.de.validate().keysPerIndex["text.de.$txt_text"]
2
As you can see, there are only two index keys, so stop word filtering did occur (this time with a German stop word list. Vater is the German word for father, not some typo with Vader) Let’s try some searches:
db.de.insert( {language:"english", txt: "Ich bin ein Berliner" } )
Please note that we don’t have to give the language we are searching for because it is derived from the index. We have hits for the meaningful words “Vater” and “Luke”, but not for the stop word “ich” (which means “I”).

 

It it also possible to mix multiple languages in the same index. Each single document can have its own language:

db.de.insert( {language:"english", txt: "Ich bin ein Berliner" } )
译者信息

译者信息

红薯
红薯 翻译于 1年前

0人 顶 此译文

 
            如你所见,这里只有两个索引关键字,因此停用词过滤就会起效(这里用的是德语的停用词,Vater 是德语中的 father 意思) ,我们再试试其他一些搜索:
db.de.insert( {language:"english", txt: "Ich bin ein Berliner" } )

请注意,我们不一定需要在搜索的时候提供语言,因为这是从索引继承而来。我们已经命中了同义词 Vater 和 Luke,但没有命中停用词 ich (意思是 I)

我们还可以在同一个索引中混合多种不同的语言,每个文档都有它独立的语言:

db.de.insert( {language:"english", txt: "Ich bin ein Berliner" } )
If a field “language” is present, its content defines the language for stemming and stop word filtering for the indexed field(s) of that document. The word “ich” is not a stop word in English, so it is indexed now.
// default language: german -> no hits
> db.de.runCommand("text", {search: "ich"})
{
        "queryDebugString" : "||||||",
        "language" : "german",
        "results" : [ ],
        "stats" : {
                "nscanned" : 0,
                "nscannedObjects" : 0,
                "n" : 0,
                "timeMicros" : 96
        },
        "ok" : 1
}
 
// search for English -> one hit
> db.de.runCommand("text", {search: "ich", language: "english"})
{
        "queryDebugString" : "ich||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.625,
                        "obj" : {
                                "_id" : ObjectId("50ed163b1e27d5e73741fafb"),
                                "language" : "english",
                                "txt" : "Ich bin ein Berliner"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 1,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 161
        },
        "ok" : 1
}
What happened here? The default language for searching is German. So the first search has no result (as before). In the second search we say to search for English text (to be more precise: for index keys that were generated with an English stemmer and stop words). That’s why we find the famous sentence from JFK.

 

What does that mean? Well, you have are real multi language text search at hand. You can store text messages from around the world in one collection and still search them dependent on the language.

译者信息

译者信息

小编辑
小编辑 翻译于 1年前

1人 顶 此译文

 

其它翻译版本:1(点击译者名切换) Khiyuan

如果存在 “language” 字段,其内容就相当于为文档的索引数据定义了流数据的语言和停用词过滤。单词 ich 在英语中并不是停用词,因此它被索引了。
// default language: german -> no hits
> db.de.runCommand("text", {search: "ich"})
{
        "queryDebugString" : "||||||",
        "language" : "german",
        "results" : [ ],
        "stats" : {
                "nscanned" : 0,
                "nscannedObjects" : 0,
                "n" : 0,
                "timeMicros" : 96
        },
        "ok" : 1
}
 
// search for English -> one hit
> db.de.runCommand("text", {search: "ich", language: "english"})
{
        "queryDebugString" : "ich||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.625,
                        "obj" : {
                                "_id" : ObjectId("50ed163b1e27d5e73741fafb"),
                                "language" : "english",
                                "txt" : "Ich bin ein Berliner"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 1,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 161
        },
        "ok" : 1
}

这里到底发生什么事情?默认的搜索语言是德语,因此首次搜索没有返回任何结果。而第二次搜索时,我们搜索英语文本,这也是为什么我们能从这个句子中找出 JFK。

这是什么意思呢?嗯,你已经有了这种的多语言文本搜索,你可以在一个集合中存储来自全世界不同语言的文本信息,然后仍然使用你的母语进行搜索。

译者信息

译者信息

Khiyuan
Khiyuan 翻译于 1年前

0人 顶 此译文

 

其它翻译版本:1(点击译者名切换) 小编辑

            如果“language”字段存在,它声明了索引字段中词干分析和停用词过滤的语言。单词“ich” 在英语中并非停用词,所以它被索引了。
// default language: german -> no hits
> db.de.runCommand("text", {search: "ich"})
{
        "queryDebugString" : "||||||",
        "language" : "german",
        "results" : [ ],
        "stats" : {
                "nscanned" : 0,
                "nscannedObjects" : 0,
                "n" : 0,
                "timeMicros" : 96
        },
        "ok" : 1
}
 
// search for English -> one hit
> db.de.runCommand("text", {search: "ich", language: "english"})
{
        "queryDebugString" : "ich||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.625,
                        "obj" : {
                                "_id" : ObjectId("50ed163b1e27d5e73741fafb"),
                                "language" : "english",
                                "txt" : "Ich bin ein Berliner"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 1,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 161
        },
        "ok" : 1
}

发生了什么?搜索的默认语言是德语,因此第一个搜索没有结果,第二次搜索时,我们声明按英语搜索(进一步说,是按照英语的词干和停用词)。这样我们就从 JFK 中获得了流行语。

这是什么意思?你真的能进行多语言全文搜索。你可以将全世界搜集来的文本信息存在同一个集合中,并按照语言检索他们。

 

Multiple Fields

A text index can span more that one field. If you are using more than one field, each field can have its one weight. That enables you to have indexed text parts of your document with different meanings.

> db.mail.ensureIndex( {subject: "text", body: "text"}, {weights: {subject: 10} } )
> db.mail.getIndices()
[
        ...
        {
                "v" : 0,
                "key" : {
                        "_fts" : "text",
                        "_ftsx" : 1
                },
                "ns" : "de.mail",
                "name" : "subject_text_body_text",
                "weights" : {
                        "body" : 1,
                        "subject" : 10
                },
                "default_language" : "english",
                "language_override" : "language"
        }
]
We created a text index spanning the fields “subject” and “body”, where the first got a weight of 10 and the latter the standard weight 1. Let’s see what impact these weights have:
> db.mail.insert( {subject: "Robot leader to minions", body: "Humans suck", prio: 0 } )
> db.mail.insert( {subject: "Human leader to minions", body: "Robots suck", prio: 1 } )
> db.mail.runCommand("text", {search: "robot"})
{
        "queryDebugString" : "robot||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 6.666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ed1be71e27d5e73741fafe"),
                                "subject" : "Robot leader to minions",
                                "body" : "Humans suck"
                                "prio" : 0 
                        }
                },
                {
                        "score" : 0.75,
                        "obj" : {
                                "_id" : ObjectId("50ed1bfd1e27d5e73741faff"),
                                "subject" : "Human leader to minions",
                                "body" : "Robots suck"
                                "prio" : 1
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 2,
                "timeMicros" : 148
        },
        "ok" : 1
}
The document with “robot” in the “subject” field has much higher score because the weight of 10 is a taken as a multiplier.
译者信息

译者信息

AlfredCheung
AlfredCheung 翻译于 1年前

0人 顶 此译文

 

多字段

文本索引可以跨越多个字段。在这种情况下,每个字段可以有自己的权重。我们可以利用权重,为文档的不同的部分赋予不同的意义。

> db.mail.ensureIndex( {subject: "text", body: "text"}, {weights: {subject: 10} } )
> db.mail.getIndices()
[
        ...
        {
                "v" : 0,
                "key" : {
                        "_fts" : "text",
                        "_ftsx" : 1
                },
                "ns" : "de.mail",
                "name" : "subject_text_body_text",
                "weights" : {
                        "body" : 1,
                        "subject" : 10
                },
                "default_language" : "english",
                "language_override" : "language"
        }
]
我们创建了一个跨越两个字段的文本索引,“subject”和“body”,它们的权重分别是10和1。我们看下权重有什么影响:
> db.mail.insert( {subject: "Robot leader to minions", body: "Humans suck", prio: 0 } )
> db.mail.insert( {subject: "Human leader to minions", body: "Robots suck", prio: 1 } )
> db.mail.runCommand("text", {search: "robot"})
{
        "queryDebugString" : "robot||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 6.666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ed1be71e27d5e73741fafe"),
                                "subject" : "Robot leader to minions",
                                "body" : "Humans suck"
                                "prio" : 0 
                        }
                },
                {
                        "score" : 0.75,
                        "obj" : {
                                "_id" : ObjectId("50ed1bfd1e27d5e73741faff"),
                                "subject" : "Human leader to minions",
                                "body" : "Robots suck"
                                "prio" : 1
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 2,
                "timeMicros" : 148
        },
        "ok" : 1
}
可以看到,“subject”字段含有“robot”的文档会有更高的得分,那是因为它有10的权重,作为倍数乘了上去。       

 

Filtering and Projection

You can apply additional search criteria via filtering:

> db.mail.runCommand("text", {search: "robot", filter: {prio:0} } )
{
        "queryDebugString" : "robot||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 6.666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ed22621e27d5e73741fb04"),
                                "subject" : "Robot leader to minions",
                                "body" : "Humans suck",
                                "prio" : 0
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 2,
                "n" : 1,
                "timeMicros" : 185
        },
        "ok" : 1
}
Please note that filtering does not use an index.

 

If you are interested only in a subset of fields, you can use projection (similar to the aggreation framework):

> db.mail.runCommand("text", {search: "robot", projection: {_id:0, prio:0} } )
{
        "queryDebugString" : "robot||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 6.666666666666666,
                        "obj" : {
                                "subject" : "Robot leader to minions",
                                "body" : "Humans suck"
                        }
                },
                {
                        "score" : 0.75,
                        "obj" : {
                                "subject" : "Human leader to minions",
                                "body" : "Robots suck"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 2,
                "timeMicros" : 127
        },
        "ok" : 1
}
Filtering and projection can be combined, of course.
译者信息

译者信息

AlfredCheung
AlfredCheung 翻译于 1年前

0人 顶 此译文

 

过滤与投射

我们还可以利用过滤来附加额外的搜索条件:

> db.mail.runCommand("text", {search: "robot", filter: {prio:0} } )
{
        "queryDebugString" : "robot||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 6.666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ed22621e27d5e73741fb04"),
                                "subject" : "Robot leader to minions",
                                "body" : "Humans suck",
                                "prio" : 0
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 2,
                "n" : 1,
                "timeMicros" : 185
        },
        "ok" : 1
}
需要注意的是,过滤并不会使用索引。

如果我们关心的只是一部分字段,可以使用投射(类似于汇聚框架):

> db.mail.runCommand("text", {search: "robot", projection: {_id:0, prio:0} } )
{
        "queryDebugString" : "robot||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 6.666666666666666,
                        "obj" : {
                                "subject" : "Robot leader to minions",
                                "body" : "Humans suck"
                        }
                },
                {
                        "score" : 0.75,
                        "obj" : {
                                "subject" : "Human leader to minions",
                                "body" : "Robots suck"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 2,
                "timeMicros" : 127
        },
        "ok" : 1
}
过滤和投射是可以一起使用的。       

 

Summary

With this second part on MongoDB text search we had a look at the more intereting features of the text search capability. For a start that’s quite a good toolbox to implement your own search engines. I’m looking forward your feedback.

译者信息

译者信息

AlfredCheung
AlfredCheung 翻译于 1年前

1人 顶 此译文

 

总结

我们在这篇文章里学习了MongoDB文本搜索的一些有趣功能。它应该对我们实现搜索引擎有很好的帮助。期待大家的反馈。

转载于:https://www.cnblogs.com/HuiLove/p/3966066.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值