sql server的搜索_在SQL Server中进行全文本搜索

sql server的搜索

介绍 (Introduction)

In most cases, we will use clustered and non-clustered indexes to help a query go faster, but these kinds of indexes have their own limitations and cannot be used for fast text lookup. For instance, a LIKE operator will lead SQL Server to scan the whole table in order to pick up values that meet the expression next to this operator. This means it won’t be fast in every case, even if an index is created for considered column.

在大多数情况下,我们将使用聚簇和非聚簇索引来帮助查询更快,但是这些类型的索引有其自身的局限性,不能用于快速文本查找。 例如,一个LIKE运算符将导致SQL Server扫描整个表,以获取满足该运算符旁边表达式的值。 这意味着即使为考虑的列创建索引,它也不会在每种情况下都很快。

Microsoft SQL Server comes up with an answer to part of this issue with a Full-Text Search feature. This feature lets users and application run character-based lookups efficiently by creating a particular type of index referred to as a Full-Text Index. This index can be built on the top of one or more columns for a particular table. These columns can be of following data types:

Microsoft SQL Server通过全文搜索功能为部分问题提供了答案。 通过创建称为全文索引的特定类型的索引,此功能使用户和应用程序可以有效地运行基于字符的查找。 该索引可以建立在特定表的一列或多列的顶部。 这些列可以是以下数据类型:

  • char,
  • 字符
  • varchar,
  • varchar ,
  • nchar,
  • nchar ,
  • nvarchar,
  • nvarchar ,
  • text,
  • 文字
  • ntext,
  • 文字
  • image,
  • 图片
  • xml,
  • xml
  • varbinary(max)
  • varbinary(最大)
  • FILESTREAM
  • 文件流

The building and usage of Full-Text indexes is always performed in a specific language context like English or French.

全文索引的建立和使用始终在特定的语言环境(例如英语或法语)中执行。

In the following sections, we will first take some time to understand overview how a Full-Text Search feature works. In this part, we will define some concepts and use them to understand how a Full-Text Index is built and maintained. We’ll even go through an illustrative example. Once we are done with theoretical aspects, we’ll then focus on some practical aspects in order to use and maintain this feature: we will see how to create a Full-Text indexed table, how to list out which tables have a Full-Text index and on which columns and much more

在以下各节中,我们将首先花一些时间来概述“全文搜索”功能的工作原理。 在这一部分中,我们将定义一些概念,并使用它们来理解如何建立和维护全文索引。 我们甚至将通过一个说明性示例。 一旦完成了理论方面的工作,我们便将重点放在一些实际方面,以便使用和维护此功能:我们将看到如何创建全文索引表,如何列出具有全文索引的表索引以及在哪些列上等等

概念 (Concepts)

Definitions

定义

Now that we know what the purpose of Full-Text Search feature is, let’s invest some time in the understanding of how it works. This will help us manage this feature.

现在我们知道了全文搜索功能的目的是什么,让我们花一些时间来了解它的工作原理。 这将帮助我们管理此功能。

Notice that, already at SQL Server installation, we can tell that this feature is special as the installer defines a daemon service called “fdhost.exe”. This process will be referred in following as the “filter daemon host“.

请注意,在SQL Server安装中,我们已经知道该功能很特殊,因为安装程序定义了一个名为“ fdhost.exe”的守护程序服务。 以下将该过程称为“过滤器守护程序主机”。

It is started by a service launcher called MSSQLFDLauncher for security concerns. It will exchange data with SQL Server service (sqlservr.exe) via shared memory or a named pipe. Fdhost.exe process will access, filter and tokenize user data in order to actually build Full-Text indexes. It’s also called to analyze Full-Text queries, including word breaking and stemming (see below for more info).

出于安全方面的考虑,它由称为MSSQLFDLauncher的服务启动器启动。 它将通过共享内存或命名管道与SQL Server服务(sqlservr.exe)交换数据。 Fdhost.exe进程将访问,过滤和标记用户数据,以便实际构建全文索引。 它也被称为分析全文查询,包括分词和词干分析(更多信息,请参见下文)。

This means that the entire Full-Text Search feature is spread across these two processes: fdhost.exe and sqlserv.exe and that some components of this feature interact with each other’s. Let’s review these components:

这意味着整个全文搜索功能分布在以下两个过程中:fdhost.exe和sqlserv.exe,并且该功能的某些组件相互交互。 让我们回顾一下这些组件:

  • User tables – (sqlserv.exe) – tables for which a full-text index exists. 用户表 –(sqlserv.exe)–存在全文索引的表。
  • Full-Text gatherer – in sqlserv.exe – a thread responsible for scheduling and driving index population so as for monitoring. 全文收集器 –在sqlserv.exe中–一个负责调度和驱动索引填充以便进行监视的线程。
  • Thesaurus files – (sqlserv.exe)– files that contain synonyms of search terms. 同义词库文件 (sqlserv.exe)–包含搜索词同义词的文件。
  • StopLists – in sqlserv.exe – objects that contain list of common words that can be ignored as they are not significant for a lookup (e.g. « and », « or », « but ») StopLists –在sqlserv.exe中–包含可被忽略的常用单词列表的对象,因为它们对于查找而言并不重要(例如«和»,«或»,«但»)
  • Query Processor thread – (sqlserv.exe)– thread that compiles and executes T-SQL queries and send Full-Text search to the Full-Text Engine twice: once at compilation and once during query execution. The query results is matched against the full-text index. 查询处理器线程 –(sqlserv.exe)–编译和执行T-SQL查询并将全文搜索两次发送到全文引擎的线程:一次在编译时,一次在查询执行期间。 查询结果与全文索引匹配。
  • Full-Text Engine – (sqlserv.exe)– can be seen as part of the Query Processor. It compiles and runs full-text queries and takes stoplists and thesaurus files into account before sending back results sets for these queries. 全文引擎 (sqlserv.exe)可以看作是查询处理器的一部分。 它会编译并运行全文查询,并在向回发送这些查询的结果集之前,考虑到停机清单和同义词库文件。
  • Full-Text Indexer – (sqlserv.exe)– This thread builds the structure used to store index tokens. 全文索引器 –(sqlserv.exe)–此线程构建用于存储索引令牌的结构。
  • Filter Daemon Manager – (sqlserv.exe)– this thread monitors the status of the fdhost.exe daemon service. 筛选器后台程序管理器 –(sqlserv.exe)–此线程监视fdhost.exe后台程序服务的状态。
  • Protocol Handler Thread – (fdhost.exe) – this thread pulls data from memory for further processing and accesses data from a user table. 协议处理程序线程 –(fdhost.exe)–该线程从内存中提取数据以进行进一步处理,并从用户表访问数据。
  • Filters – (fdhost.exe) – they are specific by document type and allow the extraction of text data from various data types like varbinary, image or xml. They will be used, for instance, in order to remove any embedded formatting on the text of a MS Word document. You can run following query in order to get an overview of the filters defined by default: 过滤器 –(fdhost.exe)–特定于文档类型,可以从各种数据类型(如varbinary,image或xml)中提取文本数据。 例如,将使用它们来删除MS Word文档文本上的任何嵌入式格式。 您可以运行以下查询,以获取默认情况下定义的过滤器的概述:
     
    EXEC sp_help_fulltext_system_components 'filter';
     
    
  • Word breakers and 分词系统stemmers (fdhost.exe) – Each language has its set of word breakers. These components help to find the boundaries of each word in a sentence based on lexical rules of its associated language. So they help tokenizing sentences. Moreover, each word breaker is used in pair with a stemmer component. This component helps to find the root of a verb (its inflectional form) and conjugates verb, also based on language-specific rules. For instance, it will consider all these forms as being the same: “writing”, “wrote”, “writer” are all forms of the word “write”. Words identified by either of these components are inserted as keywords into a full-text index. 词干分析器 (fdhost.exe)–每种语言都有其分词器集合。 这些组件有助于根据其相关语言的词汇规则找到句子中每个单词的边界。 因此,它们有助于标记化句子。 此外,每个断字器与词干提取器组件一起使用。 该组件还基于特定于语言的规则,帮助找到动词的根(其变形形式)和共轭动词。 例如,它将所有这些形式都视为相同:“写作”,“写”,“作家”都是“写作”一词的所有形式。 由这两个组件之一标识的单词将作为关键字插入全文索引。

Architecture of a Full-Text Index

全文索引的体系结构

First of all, we have to know that any full-text index is stored into what Microsoft calls a “ full-text catalog”. It’s like a container for Full-Text indexes. Why did Microsoft define a logical container for Full-Text indexes? Simply because these indexes are usually split across multiple internal tables that are called full-text index fragments. These fragments are created as we insert or update records.

首先,我们必须知道任何全文索引都存储在Microsoft所谓的“全文目录”中。 就像全文索引的容器一样。 Microsoft为什么要为全文索引定义逻辑容器? 仅仅因为这些索引通常被拆分为多个内部表,这些内部表称为全文索引片段。 这些片段是在我们插入或更新记录时创建的。

We can get back data about a Full-Text Index using dynamic management views and functions. One of these is the sys.dm_fts_index_keywords_by_document function. It returns a data set with the following columns:

我们可以使用动态管理视图和功能来获取有关全文索引的数据。 其中之一是sys.dm_fts_index_keywords_by_document函数。 它返回包含以下列的数据集:

  • A hexadecimal representation of the keyword

    关键字的十六进制表示形式
  • A human-readable representation of the keyword

    关键字的易于理解的表示形式
  • The identifier of the column subject to a Full-Text Index

    接受全文索引的列的标识符
  • The identifier of the document or the row from which the current keyword has been indexed

    索引当前关键字的文档或行的标识符
  • The number of times this keyword has been found in that document or row indicated by previous column

    在该文档或上一列指示的行中找到此关键字的次数

Here is a sample results set:

这是一个示例结果集:

In that results set, we can see that for document with identifier “14536”, there are 3 occurrences o

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值