在sqlserver 2008中编写自定义全文检索字典

原文出处:https://blogs.msdn.microsoft.com/sqlfts/2009/12/03/creating-custom-dictionaries-for-special-terms-to-be-indexed-as-is-in-sql-server-2008-full-text-indexes/

-------------------------------------------------

Did You Know?:  That SQL Server 2008 provides the ability to create a Custom Dictionary of special terms to be indexed ‘as-is’ (i.e: ‘AT&T’) into the FTIndex?… 

你知道吗?:SqlServer 2008 提供了一个针对全文检索编写自定义拼写条目的索引 'as-is'(i.e:'AT&T')的功能?


Many times you find yourself in a situation where you would like to index a given word/term ‘as-is’ in your Full-Text Index. However, due the specific tokenization behavior from a given WB, it gets splitted in several individual terms. This might cause at query time recall issues.

很多时候你会发现,当你在全文检索时搜索一个字/词组 'as-is'时,却会因为特殊的分词规则而被拆分成多个词,造成返回结果不符合预期。


For instance, your data contains the following term “ISBN-7”. Internally, most (if not all) WBs will tokenize this as ‘ISBN’ and ‘7’ separately. These two terms are going to be persisted in the FTIndex instead of the single and original “ISBN-7”. Note that because ‘7’ might be a default noise word for most languages (single digit numbers typically are considered as noise words), what happens is that the FTIndex will only index the term ‘ISBN’.

例如,你的数据中包含词组"ISBN-7",多数时候(并非全部)会被拆分成"ISBN"和"7",并将这些解析为全文索引使用的内容而不是词组本身。备注:因为'7'可能在多数语言中都被默认为是一个干扰词(单独的数字多数情况下都是干扰词),所以全文索引只会检索'ISBN'。


Later, at query time when you search for ‘ISBN-7’ using any of our available predicates (i.e: CONTAINS) or tvfs (i.e:CONTAINSTABLE), the same process will occur and the query will be internally translated to a query for ‘ISBN’ only. You will then receive all rows/documents containing ‘ISBN’, not only the one you cared about initially: ‘ISBN-7’. 

然后,当你查询时使用任何谓词(contains)或表值函数(containstable)检索'ISBN-7'时,就会发生同样的情况,检索内容仅仅为'ISBN'。你会得到所有行/文档中包含'ISBN'的内容,而不是预期的'ISBN-7'。


By disabling STOPLIST for a given FTIndex, you are going to solve the missing results part of the problem, but you are going to have larger FTIndexes leading to performance implications and still not solving the problem of false positives, as you will still find documents containing ‘ISBN’ and ‘7’ somewhere in the corpus of the document, but not necessarily together (next to each other) as you wanted originally.

通过特定的全文索引禁用STOPLIST,也许你可以解决一部分问题并找回丢失的预期内容,但这对大数据量时,这一方法即影响了检索性能,又没能完全解决数据不完整的情况,你会发现返回的数据中又出现了'ISBN'和'7'不连接在一起的情况。


This is a common problem when querying for email addresses for example, where the ‘@’ character serves as separator . In this case, you are going to find document containing the exact email address as you desired, but as well documents containing both parts of the email separated for other special character different than ‘@’ , leading to false positives again. There are other different variants of this behavior that can lead to false positives as well, etc.. While this behavior is expected, as you see, it can lead to undesired results.

一个很常见的问题,当你搜索一个邮件地址,由'@'字符作为分隔的符号的数据时,你会发现除了确切的邮件地址外,还有包含了'@'字符前后文的数据也被检索出来。还有其他一些符号也会产生这种可以预见但不希望发生的现象。

 

In order to prevent this, SQL Server 2008 new WB family is introducing the ability to list a set of words/terms that should not be tokenized by a given WB, thus forcing them to be indexed as they are. These lists (called custom dictionaries) are simple text files and associated to each language supported by SQL Server Full-Text Search.

为避免这个现象,SqlServer 2008分词家族引入了编写新的单词/词组功能用于默认的分词之外的全文索引。这个列表(词典)是一些纯文本文件并且关联到各自语言的全文索引检索中。


Here are the details in how to accomplish this:

这里是如果做到这一点的详细内容:

 

(Reference:http://technet.microsoft.com/en-us/library/cc263242.aspx)

(引用:http://technet.microsoft.com/en-us/library/cc263242.aspx)


A custom dictionary is a Unicode-encoded file that can be used to specify words that you want the word breaker of the same language to consider as complete words. Custom dictionaries are not provided, by default. To modify the word breaker behavior for more than one language you must create a separate custom dictionary for each language for which you want to modify the word breaker’s behavior. You cannot create a custom dictionary for the language-neutral word breaker.

自定义词典必须是 Unicode编码的文件,用于在同样语言环境下你所预期的分词方式。默认是没有自定义词典的。并且一个词典也不能应用于多种语言环境下,你必须针对你所支持的每一种语言单独建立分词词典,且分词词典不能和语言环境无关。 (备注:大概是不允许通用所有语言的分词词典?)


-Example created by Venkat, one of our team testers:

-例如有我们的一个测试员Venkat所测试的:


Consider the term “34590-97-8”. When this is passed to the German wordbreaker (for instance), the output terms are many:

比如'34590-97-8'这个词组,因为这个是需要发给德语环境的分词,他的输出内容有很多:


select [display_term] from sys.dm_fts_parser(‘34590-97-8’, 1031,0,0)


display_term


34590


nn34590


97


nn97


8


nn8



This output can be seen using:

完整的输出结果可以用下边的指令查看:


select * from sys.dm_fts_parser(‘34590-97-8’, 1031,0,0)


If you need to keep these numbers together for business reasons (I.e: this is a chemical code, account, phone number,  etc) you then need to create a custom dictionary for German language in this case. Here are the steps:

由于一些商业上的原因(例如:电话、账号、工商代码、税务号等等),你需要将这些数字保持连接在一起,所以需要建立一个针对德语环境的自定义词典。下面是步骤:


1.      Log on to the machine (on which sql server is installed) as a member of the Administrators group.

1、以管理员神风登录到计算机(必须是安装了sql server的)


2.      Start Notepad and type the words you want in your custom dictionary. Be sure to avoid invalid entries as described in the Reference link previously included. Remember that each word must be on a separate line and separated by a carriage return (CR) and line feed (LF).

2、打开记事本,并将单词/词组输入到里面。参考之前引用的内容确认没有无效的条目。保持每个单词一行由硬回车分隔(既有回车符CR又有换行符LF)


3.      On the File menu, click Save As.

3、在文件菜单中选择另存为。


4.      In the Save as type list, select All Files.

4、文件类型中选择所有类型。


5.      In the Encoding list, select Unicode.

5、编码格式中选择 Unicode编码。


6.      In the File name box, type the file name in the following format: CustomNNNN.lex, where NNNN is the language hex code of the language for which you are creating the custom dictionary (i.e: English= 1033). See Table 1, in the document referenced before in this post for a list of valid file names for supported languages and dialects.

6、文件名输入框中按照“CustomNNNN.lex”的格式输入文件名。NNNN就是对应的16进制的语言的编码(例如:英语=1033)。使用指令 SELECT * FROM sys.fulltext_languages 可以查阅所有支持的语言列表。


7.      In the Save in list, navigate to the folder that contains the word breakers. By default, this is the sql Binn directory. Example C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn.

7、在保存设置里,将文件夹导航到分词词典目录。默认情况下是sql的Binn文件夹。例如:C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn.


8.      Click Save.

8、点击保存。


9.      Restart fdhost by executing “exec sp_fulltext_service ‘restart_all_fdhosts’” on the sql server instance.

9、在sqlserver实例中执行“ exec sp_fulltext_service ‘restart_all_fdhosts”重新启动fdhost。


Now the custom dictionary will treat the term differently as a single word:

现在,将以自定义词典来解析你所录入的内容。


select [display_term] from sys.dm_fts_parser(‘34590-97-8’, 1031,0,0)


display_term


34590-97-8


 Currently, these text files only support a list of words and not more complex expressions, as for instance regular ones, etc.. (i.e: ISBN-*).

目前,这些文本文件只支持单词列表而不支持表达式,例如正则、通配符等(比如:ISBN-*)。


Important Note: This will only work with newly shipped SQL Server 2008 Word Breakers and not with older word breakers unchanged in our 2008 release. These new wordbreakers are also shipped with the OS in Vista+ operating systems. These wordbreaker dlls have names as NaturalLanguage6, NLS*.dll etc.  As a notorious exception, the english wordbreaker shipped with SQL Server 2008 is identical to our 2005 version, thus it does not belong to this new category of Word Breakers capable to support a custom dictionary. In order to use a custom dictionary for English in SQL Server 2008, you will need to change the registry registration and indicate that the English WB existing in you OS (only if it is Vista+) is the one that should be used for your SQL intance from that moment on.  Please follow this article in how to change this registration for a given WB or iFilter:

重要提示:这个只能在新版本的SqlServer2008使用而不支持旧版本的分词。新的分词对操作系统的要求比如是Vista以上的版本。如果分词dlls文件类似NaturalLanguage6,NLS*.dll等,则是一个特殊的例外,这些2008版英文分词库实际是使用的2005版本的,所以这些是不支持新的分词自定义词典的。为了2008能使用英文自定义词典,你需要修改注册表信息来声明你的操作系统。请阅读下边链接对应的文章。

http://msdn.microsoft.com/en-us/library/ms345188.aspx


PS:因为这篇文章已经很老了,我相信现在大部分人已经使用2012甚至更高版本的数据库了,所以下边的内容就不再翻译了,文盲的懒癌又犯了


The above article shows how to do it for Danish, Polish etc. To do this for the new English word breaker, you need the corresponding values for the new English Word Breaker and stemmer. The wordbreaker files to copy to your sql Binn directory are below. These should be in your windows\system32 folder on Vista+ operating systems.


1.      NlsData0009.dll


2.      NlsLexicons0009.dll


3.      NlsGrammars0009.dlll


Other settings needed:



String value


TsaurusFile


tsEnu.xml


DWORD value


Locale


00000409


String value


WBreakerClass


{9DAA54E8-CD95-4107-8E7F-BA3F24732D95}


string value


StemmerClass


{61A48126-EF74-4d4a-9DDA-43FD542CAD1E}



We hope this information is useful to you. We are going to include this information in our official Books Online (BOL) release in upcoming SQL Server 2008 SPs as well as in our next major release.

 


Regards!
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值