对盗图、盗文、盗墓深恶痛绝吗？PostgreSQL结合余弦、线性相关算法在文本、图片、数组相似等领域的应用 - 1 理论

背景

很多网站有标签的功能，会根据网页自动生成标签，标签实际上就是该网页的关键词，比如一个卖手机的网页，那么标签是如何生成的呢？

在一篇文档里面，是不是出现越多的词，就越是关键词呢？

比如在中文里面的、是、我、你可能出现次数是比较多的，它们很显然不是关键词，这些属于stop word，是需要被忽略的。

另外还有一些词，虽然不是stop word，但是也可能不算关键词，比如应用、科学、普通话，等等一些在很多文章中都可能出现的词，可能不是关键词。

有什么好的算法能提取文章的关键词呢？

TF-IDF算法

TF-IDF是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。

TF-IDF加权的各种形式常被搜索引擎应用，作为文件与用户查询之间相关程度的度量或评级。除了TF-IDF以外，因特网上的搜索引擎还会使用基于链接分析的评级方法，以确定文件在搜寻结果中出现的顺序。

1. tf比较好理解，就是一个词在当前文本的出现频率，后面会有例子。

2. idf是一个加权，逆向文件频率（inverse document frequency，IDF）是一个词语普遍重要性的度量。也就是说一些普通词，因此越普通的词，它的IDF越低。

某一特定词语的IDF，可以由总文件数目除以包含该词语之文件的数目，再取对数得到IDF。

3. 计算一个词在文档中的关键性，计算TF与IDF的乘积即得到。因此，TF-IDF倾向于过滤掉常见的词语，保留重要的词语。

举例

例1

有很多不同的数学公式可以用来计算TF-IDF。

这边的例子以上述的数学公式来计算。

1. 词频 (TF) 是一词语出现的次数除以该文件的总词语数。

假如一篇文件的总词语数是100个，而词语“母牛”出现了3次，那么“母牛”一词在该文件中的词频就是3/100=0.03。

2. 一个计算文件频率 (IDF) 的方法是文件集里包含的文件总数除以测定有多少份文件出现过“母牛”一词。

所以，如果“母牛”一词在1,000份文件出现过，而文件总数是10,000,000份的话，其逆向文件频率就是 log(10,000,000 / 1,000)=4。

3. 最后“母牛”的TF-IDF的分数为0.03 * 4=0.12。

TF-IDF越高，说明这个词在该文本中越关键。

例2

在某个一共有一千词的网页中“原子能”、“的”和“应用”分别出现了 2 次、35 次和 5 次，那么它们的词频就分别是 0.002、0.035 和 0.005。

我们将这三个数相加，其和 0.042 就是相应网页和查询“原子能的应用” 相关性的一个简单的度量。

概括地讲，如果一个查询包含关键词 w1,w2,...,wN, 它们在一篇特定网页中的词频分别是: TF1, TF2, ..., TFN。（TF: term frequency)。那么，这个查询和该网页的相关性就是:TF1 + TF2 + ... + TFN。

读者可能已经发现了又一个漏洞。

在上面的例子中，词“的”占了总词频的 80% 以上，而它对确定网页的主题几乎没有用。我们称这种词叫“应删除词”（Stopwords)，也就是说在度量相关性是不应考虑它们的频率。在汉语中，应删除词还有“是”、“和”、“中”、“地”、“得”等等几十个。

忽略这些应删除词后，上述网页的相似度就变成了0.007，其中“原子能”贡献了 0.002，“应用”贡献了 0.005。

细心的读者可能还会发现另一个小的漏洞。

在汉语中，“应用”是个很通用的词，而“原子能”是个很专业的词，后者在相关性排名中比前者重要。因此我们需要给汉语中的每一个词给一个权重(IDF)，这个权重的设定必须满足下面两个条件：

1. 一个词预测主题能力越强，权重就越大，反之，权重就越小。我们在网页中看到“原子能”这个词，或多或少地能了解网页的主题。我们看到“应用”一次，对主题基本上还是一无所知。因此，“原子能“的权重就应该比应用大。

2. 应删除词的权重应该是零。

我们很容易发现，如果一个关键词只在很少的网页中出现，我们通过它就容易锁定搜索目标，它的权重也就应该大。

反之如果一个词在大量网页中出现，我们看到它仍然不是很清楚要找什么内容，因此它应该小。

概括地讲，假定一个关键词 w 在 Dw 个网页中出现过，那么 Dw 越大，w的权重越小，反之亦然。

在信息检索中，使用最多的权重是“逆文本频率指数” （Inverse document frequency 缩写为IDF），它的公式为log（D/Dw）其中D是全部网页数。

比如，我们假定中文网页数是D=10亿，应删除词“的”在所有的网页中都出现，即Dw=10亿，那么它的IDF=log(10亿/10亿）= log (1) = 0。

假如专用词“原子能”在两百万个网页中出现，即Dw=200万，则它的权重IDF=log(500) =2.7。

又假定通用词“应用”，出现在五亿个网页中，它的权重IDF = log(2)则只有 0.3。

也就是说，在网页中找到一个“原子能”的匹配相当于找到九个“应用”的匹配。

利用 IDF，上述相关性计算的公式就由词频的简单求和变成了加权求和，即 TF1IDF1 +　TF2IDF2 +... + TFN*IDFN。

在上面的例子中，该网页和“原子能的应用”的相关性为 0.0069，其中“原子能”贡献了 0.0054，而“应用”只贡献了0.0015。

这个比例和我们的直觉比较一致了。

词库

词库与idf

在输入法中，有词库的概念，比如你从事的是IT行业，和同行沟通时，会用到很多IT行业的术语，词库的词组可用于切词。

同一个词语，在不同行业的词库中IDF应该是不一样的，因为他们的集合不一样。比如IT行业，总的文档数是100亿。医疗行业总的文档数是10亿。

“生物”这个词在这两个行业中的IDF会一样吗？

计算IDF

就像人口普查一样，IDF的计算也可以做采样，当然也可以全网计算，特别是搜索引擎，每天都在爬各种网站的内容，搜索引擎的内容应该算是最大的。

从搜索引擎计算出来的IDF相对会比较准确，不过前面也说了，如果做垂直行业的关键字提取，需要建立行业自己的IDF，这样提取的关键字准确度可以做到更高。

那么如何计算IDF呢？实际上很简单，以数据库为例。

《聊一聊双十一背后的技术 - 分词和搜索》

MADlib TF 统计接口

MADlib是Pivotal与伯克利大学合作的一个开源机器学习库，可以用在Greenplum, PostgreSQL, HAWQ等数据库中。

这里面也有一个TF统计的接口

http://madlib.incubator.apache.org/docs/latest/group__grp__text__utilities.html

内容如下，调用term_frequency这个函数可以对存储文本的表进行统计，得到每条记录所有WORD的TF

Term frequency tf(t,d) is to the raw frequency of a word/term in a document, i.e. the number of times that word/term t occurs in document d.

For this function, 'word' and 'term' are used interchangeably.

Note: the term frequency is not normalized by the document length.

    term_frequency(input_table,    
                   doc_id_col,    
                   word_col,    
                   output_table,    
                   compute_vocab)

参数解释如下

Arguments:    

input_table    
  TEXT.     
  The name of the table storing the documents.     
  Each row is in the form <doc_id, word_vector> where doc_id is an id, unique to each document, and word_vector is a text array containing the words in the document.     
  The word_vector should contain multiple entries of a word if the document contains multiple occurrence of that word.    

id_col    
  TEXT.     
  The name of the column containing the document id.    

word_col    
  TEXT.     
  The name of the column containing the vector of words/terms in the document.     
  This column should of type that can be cast to TEXT[], 例如tsvector.    

output_table    
  TEXT.     
  The name of the table to store the term frequency output.     
  The output table contains the following columns:    
    id_col:     
      This the document id column (same as the one provided as input).    
    word:     
      A word/term present in a document. This is either the original word present in word_col or an id representing the word (depending on the value of compute_vocab below).    
    count:     
      The number of times this word is found in the document.    
    compute_vocab:    
      BOOLEAN. (Optional, Default=FALSE) Flag to indicate if a vocabulary is to be created.     
      If TRUE, an additional output table is created containing the vocabulary of all words, with an id assigned to each word.     
      The table is called output_table_vocabulary (suffix added to the output_table name) and contains the following columns:    
    wordid:     
      An id assignment for each word    
    word:     
      The word/term

PostgreSQL 全文统计

PostgreSQL支持全文检索类型(tsvector)，使用ts_stat函数可以对全表统计，word出现的总次数，word出现的文本数。

可以用于提取全表的关键词，也可以用于检查分词的有效性，如果发现文本中有大量的stop-word出现，说明分词效果不好。

https://www.postgresql.org/docs/9.6/static/textsearch-features.html#TEXTSEARCH-STATISTICS

ts_stat函数用法如下

The function ts_stat is useful for checking your configuration and for finding stop-word candidates.    

ts_stat(sqlquery text, [ weights text, ]    
        OUT word text, OUT ndoc integer,    
        OUT nentry integer) returns setof record    

sqlquery is a text value containing an SQL query which must return a single tsvector column.     

ts_stat executes the query and returns statistics about each distinct lexeme (word) contained in the tsvector data.     

The columns returned are    

  word text — the value of a lexeme    

  ndoc integer — number of documents (tsvectors) the word occurred in    

  nentry integer — total number of occurrences of the word    

  If weights is supplied, only occurrences having one of those weights are counted.    

For example, to find the ten most frequent words in a document collection:

例子

SELECT * FROM ts_stat('SELECT vector FROM apod')    
ORDER BY nentry DESC, ndoc DESC, word    
LIMIT 10;    
The same, but counting only word occurrences with weight A or B:    

SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')    
ORDER BY nentry DESC, ndoc DESC, word    
LIMIT 10;

例子

postgres=# SELECT * from (values (tsvector 'a b c'),(to_tsvector( 'b c d b'))) t(word);    
        word             
---------------------    
 'a' 'b' 'c'    
 'b':1,4 'c':2 'd':3    
(2 rows)    

postgres=# SELECT * FROM ts_stat($$SELECT * from (values (tsvector 'a b c'),(to_tsvector( 'b c d b'))) t(word)$$);    
 word | ndoc | nentry     
------+------+--------    
 d    |    1 |      1    
 c    |    2 |      2    
 b    |    2 |      3    
 a    |    1 |      1    
(4 rows)

在数据库中如何训练生成IDF

计算所有词（包括stop word）的idf

测试表，每条记录包含一个PK，同时包含一个文本

create table doc(id int primary key, info text);    

postgres=# insert into doc values (1,'hi i am digoal');    
INSERT 0 1    

postgres=# insert into doc values (2,'hi i am abc');    
INSERT 0 1

使用对应的分词(ts_config)配置，对每个文本进行分词，并计算出word在整表的idf，记录数越多，IDF越准确（类似文本训练）

我们还需要用到ts_debug这个函数，它会得到这个词的token type。

postgres=# select * from  ts_debug('a b c hello i am digoal');    
   alias   |   description   | token  |  dictionaries  |  dictionary  | lexemes      
-----------+-----------------+--------+----------------+--------------+----------    
 asciiword | Word, all ASCII | a      | {english_stem} | english_stem | {}    
 blank     | Space symbols   |        | {}             |              |     
 asciiword | Word, all ASCII | b      | {english_stem} | english_stem | {b}    
 blank     | Space symbols   |        | {}             |              |     
 asciiword | Word, all ASCII | c      | {english_stem} | english_stem | {c}    
 blank     | Space symbols   |        | {}             |              |     
 asciiword | Word, all ASCII | hello  | {english_stem} | english_stem | {hello}    
 blank     | Space symbols   |        | {}             |              |     
 asciiword | Word, all ASCII | i      | {english_stem} | english_stem | {}    
 blank     | Space symbols   |        | {}             |              |     
 asciiword | Word, all ASCII | am     | {english_stem} | english_stem | {}    
 blank     | Space symbols   |        | {}             |              |     
 asciiword | Word, all ASCII | digoal | {english_stem} | english_stem | {digoal}    
(13 rows)

https://www.postgresql.org/docs/9.6/static/textsearch-parsers.html

默认的parser包括如下token type

Alias	Description	Example
asciiword	Word, all ASCII letters	elephant
word	Word, all letters	ma?ana
numword	Word, letters and digits	beta1
asciihword	Hyphenated word, all ASCII	up-to-date
hword	Hyphenated word, all letters	lógico-matemática
numhword	Hyphenated word, letters and digits	postgresql-beta1
hword_asciipart	Hyphenated word part, all ASCII	postgresql in the context postgresql-beta1
hword_part	Hyphenated word part, all letters	lógico or matemática in the context lógico-matemática
hword_numpart	Hyphenated word part, letters and digits	beta1 in the context postgresql-beta1
email	Email address	foo@example.com
protocol	Protocol head	http://
url	URL	example.com/stuff/index.html
host	Host	example.com
url_path	URL path	/stuff/index.html, in the context of a URL
file	File or path name	/usr/local/foo.txt, if not within a URL
sfloat	Scientific notation	-1.234e56
float	Decimal notation	-1.234
int	Signed integer	-1234
uint	Unsigned integer	1234
version	Version number	8.3.0
tag	XML tag
entity	XML entity	&
blank	Space symbols	(any whitespace or punctuation not otherwise recognized)

统计IDF

with t1 as (    
  select count(*) as cnt from doc    
),    
t2 as (    
  select id, alias, token from     
    (    
      select id,(ts_debug(info)).* from doc    
    ) t    
  group by id, alias, token    
)    
select t2.token, t2.alias, log(t1.cnt/count(t2.*)) as idf from t1,t2 group by t2.token,t2.alias,t1.cnt;    


 token  |   alias   |        idf            
--------+-----------+-------------------    
        | blank     |                 0    
 hi     | asciiword |                 0    
 abc    | asciiword | 0.301029995663981    
 am     | asciiword |                 0    
 i      | asciiword |                 0    
 digoal | asciiword | 0.301029995663981    
(6 rows)

使用PostgreSQL提取关键词

如何提取每篇文档的关键词？

1. 计算tf

计算每条记录（假设每篇文本一条记录）有多少词

set default_text_search_config='pg_catalog.english';  

select id, length(to_tsvector(info)) as cnt from doc;

计算每篇文档，每个词出现了多少次

select id, (ts_stat('select to_tsvector(info) from doc where id='||id)).* from doc;

还有一种方法计算tf

https://www.postgresql.org/docs/9.6/static/textsearch-controls.html#TEXTSEARCH-RANKING

ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4
  Ranks vectors based on the frequency of their matching lexemes.

normalization
0 (the default) ignores the document length

1 divides the rank by 1 + the logarithm of the document length

2 divides the rank by the document length

4 divides the rank by the mean harmonic distance between extents (this is implemented only by ts_rank_cd)

8 divides the rank by the number of unique words in document

16 divides the rank by 1 + the logarithm of the number of unique words in document

32 divides the rank by itself + 1

源码

src/backend/utils/adt/tsrank.c

Datum
ts_rank_ttf(PG_FUNCTION_ARGS)
{
        TSVector        txt = PG_GETARG_TSVECTOR(0);
        TSQuery         query = PG_GETARG_TSQUERY(1);
        int                     method = PG_GETARG_INT32(2);
        float           res;

        res = calc_rank(getWeights(NULL), txt, query, method);

        PG_FREE_IF_COPY(txt, 0);
        PG_FREE_IF_COPY(query, 1);
        PG_RETURN_FLOAT4(res);
}

Datum
ts_rank_tt(PG_FUNCTION_ARGS)
{
        TSVector        txt = PG_GETARG_TSVECTOR(0);
        TSQuery         query = PG_GETARG_TSQUERY(1);
        float           res;

        res = calc_rank(getWeights(NULL), txt, query, DEF_NORM_METHOD);

        PG_FREE_IF_COPY(txt, 0);
        PG_FREE_IF_COPY(query, 1);
        PG_RETURN_FLOAT4(res);
}


static float4
calc_rank_cd(const float4 *arrdata, TSVector txt, TSQuery query, int method)
{
        DocRepresentation *doc;
        int                     len,
                                i,
                                doclen = 0;
        CoverExt        ext;
        double          Wdoc = 0.0;
        double          invws[lengthof(weights)];
        double          SumDist = 0.0,
                                PrevExtPos = 0.0,
                                CurExtPos = 0.0;
        int                     NExtent = 0;
        QueryRepresentation qr;


        for (i = 0; i < lengthof(weights); i++)
        {
                invws[i] = ((double) ((arrdata[i] >= 0) ? arrdata[i] : weights[i]));
                if (invws[i] > 1.0)
                        ereport(ERROR,
                                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                                         errmsg("weight out of range")));
                invws[i] = 1.0 / invws[i];
        }

        qr.query = query;
        qr.operandData = (QueryRepresentationOperand *)
                palloc0(sizeof(QueryRepresentationOperand) * query->size);

        doc = get_docrep(txt, &qr, &doclen);
        if (!doc)
        {
                pfree(qr.operandData);
                return 0.0;
        }

        MemSet(&ext, 0, sizeof(CoverExt));
        while (Cover(doc, doclen, &qr, &ext))
        {
                double          Cpos = 0.0;
                double          InvSum = 0.0;
                int                     nNoise;
                DocRepresentation *ptr = ext.begin;

                while (ptr <= ext.end)
                {
                        InvSum += invws[WEP_GETWEIGHT(ptr->pos)];
                        ptr++;
                }

                Cpos = ((double) (ext.end - ext.begin + 1)) / InvSum;

                /*
                 * if doc are big enough then ext.q may be equal to ext.p due to limit
                 * of posional information. In this case we approximate number of
                 * noise word as half cover's length
                 */
                nNoise = (ext.q - ext.p) - (ext.end - ext.begin);
                if (nNoise < 0)
                        nNoise = (ext.end - ext.begin) / 2;
                Wdoc += Cpos / ((double) (1 + nNoise));

                CurExtPos = ((double) (ext.q + ext.p)) / 2.0;
                if (NExtent > 0 && CurExtPos > PrevExtPos               /* prevent devision by
                                                                                                                 * zero in a case of
                                multiple lexize */ )
                        SumDist += 1.0 / (CurExtPos - PrevExtPos);

                PrevExtPos = CurExtPos;
                NExtent++;
        }

        if ((method & RANK_NORM_LOGLENGTH) && txt->size > 0)
                Wdoc /= log((double) (cnt_length(txt) + 1));

        if (method & RANK_NORM_LENGTH)
        {
                len = cnt_length(txt);
                if (len > 0)
                        Wdoc /= (double) len;
        }

        if ((method & RANK_NORM_EXTDIST) && NExtent > 0 && SumDist > 0)
                Wdoc /= ((double) NExtent) / SumDist;

        if ((method & RANK_NORM_UNIQ) && txt->size > 0)
                Wdoc /= (double) (txt->size);

        if ((method & RANK_NORM_LOGUNIQ) && txt->size > 0)
                Wdoc /= log((double) (txt->size + 1)) / log(2.0);

        if (method & RANK_NORM_RDIVRPLUS1)
                Wdoc /= (Wdoc + 1);

        pfree(doc);

        pfree(qr.operandData);

        return (float4) Wdoc;
}

2. 计算idf

with t1 as (    
  select count(*) as cnt from doc    
),    
t2 as (    
  select id, alias, token from     
    (    
      select id,(ts_debug(info)).* from doc    
    ) t    
  group by id, alias, token    
)    
select t2.token, t2.alias, log(t1.cnt/count(t2.*)) as idf from t1,t2 group by t2.token,t2.alias,t1.cnt;

3. 计算每个词的tf-idf