KingbaseES 全文检索功能介绍

Kingbase 研究院

已于 2022-04-27 09:19:10 修改

阅读量1.5k

点赞数

文章标签：全文检索自然语言处理

于 2021-10-12 13:45:21 首次发布

本文链接：https://blog.csdn.net/lyu1026/article/details/120719624

版权

KingbaseES 内置的缺省的分词解析器采用空格分词，因为中文的词语之间没有空格分割，所以这种方法并不适用于中文。要支持中文的全文检索需要额外的中文分词插件：zhparser and sys_jieba，其中zhparser 支持 GBK 和 UTF8 字符集，sys_jieba 支持 UTF8 字符集。

一、默认空格分词

1、tsvector

test=# SELECT to_tsvector('English','Try not to become a man of success, but rather try to become a man of value');
                             to_tsvector                              
----------------------------------------------------------------------
 'becom':4,13 'man':6,15 'rather':10 'success':8 'tri':1,11 'valu':17
(1 row)

test=# SELECT to_tsvector('simple','Try not to become a man of success, but rather try to become a man of value');
                                                     to_tsvector                                                     
---------------------------------------------------------------------------------------------------------------------
 'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17
(1 row)

test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value');
                                                     to_tsvector                                                     
---------------------------------------------------------------------------------------------------------------------
 'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17
(1 row)

这里可以看到，如果词干分析器是english ，会采取词干标准化的过程；而simple 只是进行小写转换。默认是 simple。

test=# show default_text_search_config;
 default_text_search_config 
----------------------------
 pg_catalog.simple
(1 row)

2、标准化过程

标准化过程会完成以下操作：

总是把大写字母换成小写的
也经常移除后缀（比如英语中的s,es和ing等），这样可以搜索同一个字的各种变体，而不是乏味地输入所有可能的变体。
数字表示词位在原始字符串中的位置，比如“man"出现在第6和15的位置上。
to_tesvetor的默认配置的文本搜索是“英语“。它会忽略掉英语中的停用词（stopword，译注：也就是am is are a an等单词)。

3、tsvector搜索

test=

最低0.47元/天解锁文章

Kingbase 研究院

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
KingbaseES 全文检索功能介绍

KingbaseES 内置的缺省的分词解析器采用空格分词，因为中文的词语之间没有空格分割，所以这种方法并不适用于中文。要支持中文的全文检索需要额外的中文分词插件：zhparser andsys_jieba，其中zhparser 支持 GBK 和 UTF8 字符集，sys_jieba 支持 UTF8 字符集。一、默认空格分词1、tsvectortest=# SELECT to_tsvector('English','Try not to become a man of success, but.
复制链接

扫一扫