PostgreSQL 做实时高效搜索引擎 - 全文检索核心功能

最新推荐文章于 2024-07-06 21:00:38 发布

jinjiajia95

最新推荐文章于 2024-07-06 21:00:38 发布

阅读量2.5k

点赞数

分类专栏： postgres 文章标签： postgres 全文检索索引

本文链接：https://blog.csdn.net/weixin_40746796/article/details/89318480

版权

postgres 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

搜索语法
1.1 tsquery为搜索输入，支持与，或，反，距离语法，如下

& (AND), | (OR), ! (NOT), and <-> (FOLLOWED BY) and <?> (距离多少?),

例子如下：

c有两个位置，在匹配距离时，两个都可以。
postgres=# select to_tsvector('a b c c');
     to_tsvector     
---------------------
 'a':1 'b':2 'c':3,4
(1 row)

相邻
postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <-> b');
 ?column? 
----------
 t
(1 row)

相邻, 实际上就是position相减=1
postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <1> b');
 ?column? 
----------
 t
(1 row)

距离为2，实际上就是position相减=2
postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <2> c');
 ?column? 
----------
 t
(1 row)

距离为3，实际上就是position相减=3
postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <3> c');
 ?column? 
----------
 t
(1 row)

距离为2，实际上就是position相减=2
postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <2> b');
 ?column? 
----------
 f
(1 row)

1.2 支持文档结构，语法如下

与基本的tsquery输入一样，权重可以附加到每个词位，以限制它仅匹配那些权重的tsvector词位。
例如：

SELECT to_tsquery('english', 'Fat | Rats:AB');  
    to_tsquery      
------------------  
 'fat' | 'rat':AB

1.3 支持前缀匹配，语法如下

此外，*可以附加到词位以指定前缀匹配：

SELECT to_tsquery('supern:*A & star:A*B');  
        to_tsquery          
--------------------------  
 'supern':*A & 'star':*AB

1.4 支持同义词字典嵌套，自动翻译。例如

to_tsquery也可以接受单引号短语。当配置包括可能触发此类短语的同义词词典时，这主要是有用的。在下面的例子中，同义词库包含规则超新星星号：sn：

SELECT to_tsquery('''supernovae stars'' & !crab');  
  to_tsquery  
---------------  
 'sn' & !'crab'

1.5 搜索操作符为@@

select * from tbl where $tsvector_col @@ $tsquery;

排序算法
PostgreSQL提供了两个预定义的排名函数，它们考虑了词汇，邻近度和结构信息; 也就是说，他们会考虑查询术语在文档中出现的频率，术语在文档中的接近程度，以及文档中出现的部分的重要程度。

您可以编写自己的排名功能和/或将其结果与其他因素结合起来，以满足您的特定需求。

目前可用的两个排名功能是：

ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4

根据匹配词汇的频率对矢量进行排名。

ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4

ts_rank_cd为封面密度排名

在Clarke，Cormack和Tudhope的“信息处理和管理”杂志中描述了1999年的“一到三期查询的相关性排名”。封面密度类似于ts_rank排名，除了匹配词汇彼此接近之外考虑。

2.1 以上两个排名计算函数都支持文档结构权重。支持权重微调，作为数组参数输入。不输入则使用默认值。

对于这两个函数，可选的权重参数提供了对字实例进行或多或少权衡的能力，具体取决于它们的标注方式。权重数组按顺序指定对每个单词类别进行称重的程度：

{D-weight, C-weight, B-weight, A-weight}

如果未提供权重，则使用以下默认值：

{0.1, 0.2, 0.4, 1.0}

2.2 支持掩码参数，对付长文本

由于较长的文档包含查询字词的可能性较大，因此考虑文档大小是合理的，

例如，

具有五个搜索词实例的百字文档可能比具有五个实例的千字文档更相关。

两个排名函数都采用整数规范化选项，该选项指定文档的长度是否以及如何影响其排名。

整数选项控制多个行为，因此它是一个位掩码：

您可以使用|指定一个或多个行为（例如，2 | 4）。

掩码如下

0 (the default) ignores the document length  
0（缺省）表示跟长度大小没有关系
  
1 divides the rank by 1 + the logarithm of the document length  
1 表示关注度（rank）除以文档长度的对数+1
  
2 divides the rank by the document length  
2表示关注度除以文档的长度

4 divides the rank by the mean harmonic distance between extents (this is implemented only by ts_rank_cd)  
4表示关注度除以范围内的平均谐波距离，只能使用ts_rank_cd实现。
  
8 divides the rank by the number of unique words in document  
8表示关注度除以文档中唯一分词的数量
  
16 divides the rank by 1 + the logarithm of the number of unique words in document  
16表示关注度除以唯一分词数量的对数+1
  
32 divides the rank by itself + 1  
If more than one flag bit is specified, the transformations are applied in the order listed.
32表示关注度除以本身+1

需要特别注意的是，相关函数不使用任何全局信息，所以不可能产生一个所需要的1%或100%的公平归一化。规范化选项32 (rank/(rank+1))可用于所有规模排序到范围零到一之间，当然，这只是一个表面变化；它不会影响搜索结果的排序。

以下示例仅选择排名最高的十个匹配项：

例子

SELECT title, ts_rank_cd(textsearch, query) AS rank  
FROM apod, to_tsquery('neutrino|(dark & matter)') query  
WHERE query @@ textsearch  
ORDER BY rank DESC  
LIMIT 10;  
                     title                     |   rank  
-----------------------------------------------+----------  
 Neutrinos in the Sun                          |      3.1  
 The Sudbury Neutrino Detector                 |      2.4  
 A MACHO View of Galactic Dark Matter          |  2.01317  
 Hot Gas and Dark Matter                       |  1.91171  
 The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953  
 Rafting for Solar Neutrinos                   |      1.9  
 NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774  
 Hot Gas and Dark Matter                       |   1.6123  
 Ice Fishing for Cosmic Neutrinos              |      1.6  
 Weak Lensing Distorts the Universe            | 0.818218

这是使用标准化排名的相同示例：

SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank  
FROM apod, to_tsquery('neutrino|(dark & matter)') query  
WHERE  query @@ textsearch  
ORDER BY rank DESC  
LIMIT 10;  
                     title                     |        rank  
-----------------------------------------------+-------------------  
 Neutrinos in the Sun                          | 0.756097569485493  
 The Sudbury Neutrino Detector                 | 0.705882361190954  
 A MACHO View of Galactic Dark Matter          | 0.668123210574724  
 Hot Gas and Dark Matter                       |  0.65655958650282  
 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973  
 Rafting for Solar Neutrinos                   | 0.655172410958162  
 NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637  
 Hot Gas and Dark Matter                       | 0.617195790024749  
 Ice Fishing for Cosmic Neutrinos              | 0.615384618911517  
 Weak Lensing Distorts the Universe            | 0.450010798361481

2.3结构化文档权重设置

函数setweight可以被用来对tsvector中的项标注一个给定的权重，这里一个权重可以是四个字母之一：A、B、C或D。这通常被用来标记来自文档不同部分的项，例如标题对正文。稍后，这种信息可以被用来排名搜索结果。

因为to_tsvector(NULL) 将返回NULL，不论何时一个域可能为空时，我们推荐使用coalesce。下面是我们推荐的从一个结构化文档创建一个tsvector的方法：

UPDATE tt SET ti =
    setweight(to_tsvector(coalesce(title,'')), 'A')    ||
    setweight(to_tsvector(coalesce(keyword,'')), 'B')  ||
    setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
    setweight(to_tsvector(coalesce(body,'')), 'D');

这里我们已经使用了setweight在完成的tsvector标注每一个词位的来源，并且接着将标注过的tsvector值用tsvector连接操作符||合并在一起
例子：

1:
  UPDATE tt SET ti =
        setweight(to_tsvector(coalesce(title,'')), 'A')    ||
        setweight(to_tsvector(coalesce(keyword,'')), 'B')  ||
        setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
        setweight(to_tsvector(coalesce(body,'')), 'D');
2:
CREATE INDEX ti_idx ON tt USING rum(ti);
3:
select * , ts_rank('{1.0, 1.0, 1.0, 0.5}',ti, to_tsquery('testzhcfg', '婚姻|感情破裂'),32 /* rank/(rank+1) */) AS rank from tt order by rank desc limit 30
##{1.0, 1.0, 1.0, 0.5}为{D, C, B, A} 对应的权重

特殊功能 - 生成文档统计信息

3.1 sqlquery的返回的tsvector一列，统计这一列中，有哪些语义，每个词位出现在多少文本中，每个词位总共出现了多少次。

ts_stat(sqlquery text, [ weights text, ]  
        OUT word text, OUT ndoc integer,  
        OUT nentry integer) returns setof record

返回值

word text — the value of a lexeme  
  
ndoc integer — number of documents (tsvectors) the word occurred in  
  
nentry integer — total number of occurrences of the word

例子

例如，要查找文档集合中十个最常用的单词：

SELECT * FROM ts_stat('SELECT vector FROM apod')  
ORDER BY nentry DESC, ndoc DESC, word  
LIMIT 10;

相同，但只计算重量A或B的单词出现次数：

SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')  
ORDER BY nentry DESC, ndoc DESC, word  
LIMIT 10;

全文检索（可以使用全文检索类型以及gin或rum索引）

4.1 创建索引

CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_name, body));

索引甚至可以连接列：

CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', title || ' ' || body));

另一个方法是创建一个单独的tsvector列控制to_tsvector的输出。这个例子是title和body的一个级联，当其他是NULL的时候，使用coalesce确保一个字段仍然会被索引：

ALTER TABLE pgweb ADD COLUMN textsearchable_index_col tsvector;
UPDATE pgweb SET textsearchable_index_col =
     to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''));

然后我们为加速搜索创建一个GIN索引：

CREATE INDEX textsearch_idx ON pgweb USING gin(textsearchable_index_col);

现在我们准备执行一个快速全文搜索：

SELECT title
FROM pgweb
WHERE textsearchable_index_col @@ to_tsquery('create & table')
ORDER BY last_mod_date DESC
LIMIT 10;

当使用一个单独的列存储tsvector形式时，有必要创建一个触发器以保持tsvector列当前任何时候title或者body的变化。

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON messages FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);

查询重写

函数ts_rewrite搜索一个特定的目标查询事件tsquery，和替换每个替代子查询。实际上这个操作是一个子字符串替换的tsquery-特定版本。目标和替换组合可以被认为是一个查询重写规则。一组这样的重写规则可以是一个强大的搜索帮助。例如，你可以使用同义词扩大搜索（例如，new york, big apple, nyc, gotham）或缩小搜索一些热点问题的直接用户。在这些特性和同义词词典之间功能上有一些重叠。然而，你可以在不重建索引情况下即时修改重写规则，而更新词库需要重建索引才能有效。

ts_rewrite (query tsquery, target tsquery, substitute tsquery) returns tsquery
ts_rewrite的这种形式只适用于一个单一的重写规则：无论出现在query的什么地方，target通过substitute替换。比如：

SELECT ts_rewrite('a & b'::tsquery, 'a'::tsquery, 'c'::tsquery);
 ts_rewrite
------------
 'b' & 'c'

当有许多的重写规则的时候，重写比较缓慢，因为它检查可能匹配的每一个规则。为过滤掉明显非候选规则，我们可以使用tsquery类型的包含操作符。在下面的例子中，我们只选择那些可能与原始查询匹配的规则：

SELECT ts_rewrite('a & b'::tsquery,
                  'SELECT t,s FROM aliases WHERE ''a & b''::tsquery @> t');
 ts_rewrite
------------
 'b' & 'c'

jinjiajia95

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

PostgreSQL 做实时高效 搜索引擎 - 全文检索核心功能

PostgreSQL 做实时高效搜索引擎 - 全文检索核心功能