PostgreSQL里的17种文本相似算法与GIN索引 - pg_similarity

标签

PostgreSQL , 文本相似 , pg_similarity , pg_trgm , rum , fuzzymatch gin , smlar


背景

文本相似算法,结合PostgreSQL的开放索引框架GIN,可以实现各种相似算法的文本高效检索。

PostgreSQL中常见的文本相似搜索插件:rum, pg_trgm, fuzzymatch, pg_similarity, smlar。

其中pg_similarity支持的算法达到了17种。

Introduction

pg_similarity is an extension to support similarity queries on PostgreSQL.

The implementation is tightly integrated in the RDBMS in the sense that it defines operators
so instead of the traditional operators (= and <>) you can use ~~~ and ! (any of these
operators represents a similarity function).

pg_similarity has three main components:

Functions:

a set of functions that implements similarity algorithms available in the literature.

These functions can be used as UDFs and, will be the base for implementing the similarity operators;

Operators:

a set of operators defined at the top of similarity functions.

They use similarity functions to obtain the similarity threshold and,
compare its value to a user-defined threshold to decide if it is a match or not;

Session Variables:

a set of variables that store similarity function parameters. Theses variables can be defined at run time.

  • L1 Distance (as known as City Block or Manhattan Distance);
  • Cosine Distance;
  • Dice Coefficient;
  • Euclidean Distance;
  • Hamming Distance;
  • Jaccard Coefficient;
  • Jaro Distance;
  • Jaro-Winkler Distance;
  • Levenshtein Distance;
  • Matching Coefficient;
  • Monge-Elkan Coefficient;
  • Needleman-Wunsch Coefficient;
  • Overlap Coefficient;
  • Q-Gram Distance;
  • Smith-Waterman Coefficient;
  • Smith-Waterman-Gotoh Coefficient;
  • Soundex Distance.

用法如下

AlgorithmFunctionOperatorUse Index?Parameters
L1 Distanceblock(text, text) returns float8~++yespg_similarity.block_tokenizer (enum)
pg_similarity.block_threshold (float8)
pg_similarity.block_is_normalized (bool)
Cosine Distancecosine(text, text) returns float8~##yespg_similarity.cosine_tokenizer (enum)
pg_similarity.cosine_threshold (float8)
pg_similarity.cosine_is_normalized (bool)
Dice Coefficientdice(text, text) returns float8~-~yespg_similarity.dice_tokenizer (enum)
pg_similarity.dice_threshold (float8)
pg_similarity.dice_is_normalized (bool)
Euclidean Distanceeuclidean(text, text) returns float8~!!yespg_similarity.euclidean_tokenizer (enum)
pg_similarity.euclidean_threshold (float8)
pg_similarity.euclidean_is_normalized (bool)
Hamming Distancehamming(bit varying, bit varying) returns float8
hamming_text(text, text) returns float8
~@~nopg_similarity.hamming_threshold (float8)
pg_similarity.hamming_is_normalized (bool)
Jaccard Coefficientjaccard(text, text) returns float8~??yespg_similarity.jaccard_tokenizer (enum)
pg_similarity.jaccard_threshold (float8)
pg_similarity.jaccard_is_normalized (bool)
Jaro Distancejaro(text, text) returns float8~%%nopg_similarity.jaro_threshold (float8)
pg_similarity.jaro_is_normalized (bool)
Jaro-Winkler Distancejarowinkler(text, text) returns float8~@@nopg_similarity.jarowinkler_threshold (float8)
pg_similarity.jarowinkler_is_normalized (bool)
Levenshtein Distancelev(text, text) returns float8~==nopg_similarity.levenshtein_threshold (float8)
pg_similarity.levenshtein_is_normalized (bool)
Matching Coefficientmatchingcoefficient(text, text) returns float8~^^yespg_similarity.matching_tokenizer (enum)
pg_similarity.matching_threshold (float8)
pg_similarity.matching_is_normalized (bool)
Monge-Elkan Coefficientmongeelkan(text, text) returns float8~||nopg_similarity.mongeelkan_tokenizer (enum)
pg_similarity.mongeelkan_threshold (float8)
pg_similarity.mongeelkan_is_normalized (bool)
Needleman-Wunsch Coefficientneedlemanwunsch(text, text) returns float8~#~nopg_similarity.nw_threshold (float8)
pg_similarity.nw_is_normalized (bool)
Overlap Coefficientoverlapcoefficient(text, text) returns float8~**yespg_similarity.overlap_tokenizer (enum)
pg_similarity.overlap_threshold (float8)
pg_similarity.overlap_is_normalized (bool)
Q-Gram Distanceqgram(text, text) returns float8~~~yespg_similarity.qgram_threshold (float8)
pg_similarity.qgram_is_normalized (bool)
Smith-Waterman Coefficientsmithwaterman(text, text) returns float8~=~nopg_similarity.sw_threshold (float8)
pg_similarity.sw_is_normalized (bool)
Smith-Waterman-Gotoh Coefficientsmithwatermangotoh(text, text) returns float8~!~nopg_similarity.swg_threshold (float8)
pg_similarity.swg_is_normalized (bool)
Soundex Distancesoundex(text, text) returns float8~*~no

参考

http://pgsimilarity.projects.pgfoundry.org/

https://github.com/eulerto/pg_similarity

https://www.pgcon.org/2009/schedule/attachments/108_pg_similarity.pdf

http://www.sigaev.ru/git/gitweb.cgi?p=smlar.git;a=summary

https://github.com/postgrespro/rum

https://www.postgresql.org/docs/9.6/static/fuzzystrmatch.html

https://www.postgresql.org/docs/9.6/static/pgtrgm.html

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值