Elasticsearch term vector

最新推荐文章于 2024-07-30 07:29:57 发布

爱喝咖啡的程序员

最新推荐文章于 2024-07-30 07:29:57 发布

阅读量946

点赞数

分类专栏： # 分布式搜索引擎

本文链接：https://blog.csdn.net/miaomiao19971215/article/details/105880195

版权

分布式搜索引擎专栏收录该内容

17 篇文章 9 订阅

订阅专栏

Elasticsearch term vector

一. 概念
二. term vector数据的出现时机
三. 数据探查

一. 概念

term vector用于获取document中某个field内各个不可分割的term(词条)的相关统计信息，它们包括以下内容:

term information: term frequency in the field，term在一个field中出现的次数。
term positions: term在field中出现的下标。
start and end offsets:起始和结束下标，包含起始不包含结束。比如某个document的field为"abc def ghi"，那么abc的起始下标为0，结束下标为3。
term payloads: term的编号，由Elasticsearch维护。
term statistics: 词条的统计信息，当我们把term_statistic设置成true时生效。词条的统计信息包括: total term frequency(一个term在所有document中出现的频率)、document frequency(有多少个document包含这个term)。
field statistic: 属性字段的统计信息，包括: document count(有多少个document包含这个field)，sum of document frequency(一个document中所有field的document frequency之和)，sum of total term frequency(一个field所有term的term frequency in the field之和)

Elasticsearch官方指出，term statistic和field statistic 并不准确，在统计时不会考虑某些document已经被删除的情况。这是因为Elasticsearch在收到删除请求后，只是简单的在数据上更新被删除的标记，并不会立刻删除数据。

通常来说，term vector很少使用，一般只会在对某些数据进行数据探查时使用。比如美团上可以查看到顾客搜索热度最高的词语，用于搜索推荐和词条推荐。

二. term vector数据的出现时机

term vector涉及到了许多关于term和field的统计信息，Elasticsearch提供了两种方式来收集这些统计信息。

index time
在创建index时，通过mapping内的设置开启term vector统计功能，当index创建完毕后，Elasticsearch也会随之完成统计信息的记录。index-time这种创建模式适用于那些需要被频繁进行term vector数据探查的index。

举例，仔细观察my_text和fullname数据结构的区别:

PUT index_name
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom", 
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_text": { # index time
        "type": "text",
        "term_vector": "with_positions_offsets_payloads", # term_vector有no、yes、with_offset、with_positions等可选值
        "store": true, 
        "analyzer": "fulltext_analyzer"
      },
      "fullname": { # query time
        "type": "text",
        "analyzer": "fulltext_analyzer"
      }
    }
  }
}

query time
Elasticsearch在查询数据时，进行数据统计。这种方式又被称为"on the fly"，适合在很少使用term vector数据探查的index中使用。上方案例中，fullname字段使用了query time。数据探查时，query time与index time使用的语法没有任何区别。如果没有特殊要求，那么使用query time就足够了，index写操作的效率上要比index time高。

测试数据:

POST /index_name/_doc/1
{
  "fullname" : "Kerwin Kim",
  "my_text" : "hello test test test "
}

PUT /index_name/_doc/2
{
  "fullname" : "Kerwin Kim",
  "my_text" : "other hello test ..."
}

三. 数据探查

3.1 最基本的数据探查

使用termvectors api探查某一个document中的term vector统计信息。

GET /index_name/_doc/1/_termvectors
{
  "fields" : ["my_text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

得到结果

{
  "_index" : "index_name",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "took" : 2,
  "term_vectors" : {
    "my_text" : {
      "field_statistics" : {
        "sum_doc_freq" : 6,
        "doc_count" : 2,
        "sum_ttf" : 8
      },
      "terms" : {
        "hello" : {
          "doc_freq" : 2,
          "ttf" : 2,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 5,
              "payload" : "d29yZA=="
            }
          ]
        },
        "test" : {
          "doc_freq" : 2,
          "ttf" : 4,
          "term_freq" : 3,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 6,
              "end_offset" : 10,
              "payload" : "d29yZA=="
            },
            {
              "position" : 2,
              "start_offset" : 11,
              "end_offset" : 15,
              "payload" : "d29yZA=="
            },
            {
              "position" : 3,
              "start_offset" : 16,
              "end_offset" : 20,
              "payload" : "d29yZA=="
            }
          ]
        }
      }
    }
  }
}

3.2 探查指定term的term vector

真实项目中，仅仅只是统计某一个document的term vector显然过于片面了，一般我们会针对某几个term在整个index中统计term vector。
在"doc"中写明需要探查的 term。

GET /index_name/_termvectors
{
  "doc": {
    "fullname": "Kerwin Kim",
    "my_text": "hello test"
  },
  "fields" : ["my_text", "fullname"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

得到结果:

{
  "_index" : "index_name",
  "_type" : "_doc",
  "_version" : 0,
  "found" : true,
  "took" : 8,
  "term_vectors" : {
    "fullname" : {
      "field_statistics" : {
        "sum_doc_freq" : 4,
        "doc_count" : 2,
        "sum_ttf" : 4
      },
      "terms" : {
        "kerwin" : {
          "doc_freq" : 2,
          "ttf" : 2,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 6
            }
          ]
        },
        "kim" : {
          "doc_freq" : 2,
          "ttf" : 2,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 7,
              "end_offset" : 10
            }
          ]
        }
      }
    },
    "my_text" : {
      "field_statistics" : {
        "sum_doc_freq" : 6,
        "doc_count" : 2,
        "sum_ttf" : 8
      },
      "terms" : {
        "hello" : {
          "doc_freq" : 2,
          "ttf" : 2,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 5
            }
          ]
        },
        "test" : {
          "doc_freq" : 2,
          "ttf" : 4,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 6,
              "end_offset" : 10
            }
          ]
        }
      }
    }
  }
}

3.3 指定分词器探查 term vector

如果doc中需要探查的term不想使用创建index时指定的分词器，则我们可以使用per_field_analyzer来分别指定doc中每一个field使用的分词器。

比如下述语句中，针对my_text字段指定了"english"分词器，而非创建index时指定的"fulltext_analyzer"分词器。(english会忽略时态，testing->test)

GET /index_name/_termvectors
{
  "doc": {
    "fullname": "Kerwin Kim",
    "my_text": "hello testing"
  },
  "fields" : ["my_text", "fullname"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true,
  "per_field_analyzer": {
    "my_text": "english",
    "fullname": "standard"
  }
}

3.4 term vector filter

对index进行数据探查后，得到的结果中并非都是我们想要的数据，Elasticsearch可以帮助我们过滤掉这部分数据。
过滤时，使用了以下api：

max_num_terms: 最多对多少个term进行数据探查。
min_term_freq: term在一个field中最少出现多少次。
min_doc_freq: term至少在多少个document中出现过。

GET /index_name/_termvectors
{
  "doc": {
    "fullname": "Kerwin Kim",
    "my_text": "hello test"
  },
  "fields" : ["my_text", "fullname"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true,
  "per_field_analyzer": {
    "my_text": "english"
  }, 
  "filter": {
    "max_doc_freq": 1,
    "min_term_freq": 2,
    "max_num_terms": 3
  }
}

3.5 multi term vector

一次性对多个document进行数据探查，可以看作是对3.1节的补充。

GET _mtermvectors
{
  "docs": [
    {
      "_index": "index_name",
      "_id": 1,
      "term_statistics": true,
      "offsets": false
    },
    {
      "_index": "index_name",
      "_id": 2,
      "fields": [
        "my_text"  
      ],
      "offsets": true
    }
  ]
}

得到结果:

{
  "docs" : [
    {
      "_index" : "index_name",
      "_type" : "_doc",
      "_id" : "1",
      "_version" : 1,
      "found" : true,
      "took" : 0,
      "term_vectors" : {
        "my_text" : {
          "field_statistics" : {
            "sum_doc_freq" : 6,
            "doc_count" : 2,
            "sum_ttf" : 8
          },
          "terms" : {
            "hello" : {
              "doc_freq" : 2,
              "ttf" : 2,
              "term_freq" : 1,
              "tokens" : [
                {
                  "position" : 0,
                  "payload" : "d29yZA=="
                }
              ]
            },
            "test" : {
              "doc_freq" : 2,
              "ttf" : 4,
              "term_freq" : 3,
              "tokens" : [
                {
                  "position" : 1,
                  "payload" : "d29yZA=="
                },
                {
                  "position" : 2,
                  "payload" : "d29yZA=="
                },
                {
                  "position" : 3,
                  "payload" : "d29yZA=="
                }
              ]
            }
          }
        }
      }
    },
    {
      "_index" : "index_name",
      "_type" : "_doc",
      "_id" : "2",
      "_version" : 1,
      "found" : true,
      "took" : 0,
      "term_vectors" : {
        "my_text" : {
          "field_statistics" : {
            "sum_doc_freq" : 6,
            "doc_count" : 2,
            "sum_ttf" : 8
          },
          "terms" : {
            "..." : {
              "term_freq" : 1,
              "tokens" : [
                {
                  "position" : 3,
                  "start_offset" : 17,
                  "end_offset" : 20,
                  "payload" : "d29yZA=="
                }
              ]
            },
            "hello" : {
              "term_freq" : 1,
              "tokens" : [
                {
                  "position" : 1,
                  "start_offset" : 6,
                  "end_offset" : 11,
                  "payload" : "d29yZA=="
                }
              ]
            },
            "other" : {
              "term_freq" : 1,
              "tokens" : [
                {
                  "position" : 0,
                  "start_offset" : 0,
                  "end_offset" : 5,
                  "payload" : "d29yZA=="
                }
              ]
            },
            "test" : {
              "term_freq" : 1,
              "tokens" : [
                {
                  "position" : 2,
                  "start_offset" : 12,
                  "end_offset" : 16,
                  "payload" : "d29yZA=="
                }
              ]
            }
          }
        }
      }
    }
  ]
}