进阶-第72__elasticsearch高手进阶_基于term vector深入探查数据的情况-CSDN博客

本文链接：https://blog.csdn.net/qq_35524586/article/details/88669949

课程大纲

1、term vector介绍

获取document中的某个field内的各个term的统计信息

term information: term frequency in the field, term positions, start and end offsets, term payloads

term statistics: 设置term_statistics=true; total term frequency, 一个term在所有document中出现的频率; document frequency，有多少document包含这个term

field statistics: document count，有多少document包含这个field; sum of document frequency，一个field中所有term的df之和; sum of total term frequency，一个field中的所有term的tf之和

GET /twitter/tweet/1/_termvectors

GET /twitter/tweet/1/_termvectors?fields=text

term statistics和field statistics并不精准，不会被考虑有的doc可能被删除了

我告诉大家，其实很少用，用的时候，一般来说，就是你需要对一些数据做探查的时候。比如说，你想要看到某个term，某个词条，大话西游，这个词条，在多少个document中出现了。或者说某个field，film_desc，电影的说明信息，有多少个doc包含了这个说明信息。

2、index-time term vector实验

term vector，涉及了很多的term和field相关的统计信息，有两种方式可以采集到这个统计信息

（1）index-time，你在mapping里配置一下，然后建立索引的时候，就直接给你生成这些term和field的统计信息了

（2）query-time，你之前没有生成过任何的Term vector信息，然后在查看term vector的时候，直接就可以看到了，会on the fly，现场计算出各种统计信息，然后返回给你

这一讲，不会手敲任何命令，直接copy我做好的命令，因为这一讲的重点，不是掌握什么搜索或者聚合的语法，而是说，掌握，如何采集term vector信息，然后如何看懂term vector信息，你能掌握利用term vector进行数据探查

创建mappings

PUT /my_index

{

"mappings": {

"my_type": {

"properties": {

"text": {

"type": "text",

"term_vector": "with_positions_offsets_payloads",

"store" : true,

"analyzer" : "fulltext_analyzer"//使用自定义的分词器

"fullname": {

"type": "text",

"analyzer" : "fulltext_analyzer"

}

"settings" : {

"index" : {

"number_of_shards" : 1,

"number_of_replicas" : 0

"analysis": {

"analyzer": {

"fulltext_analyzer": {//自定义的分词器

"type": "custom",

"tokenizer": "whitespace",

"filter": [

"lowercase",

"type_as_payload"

]

}

结果：

{

"acknowledged": true,

"shards_acknowledged": true

}

插入测试数据

PUT /my_index/my_type/1

{

"fullname" : "Leo Li",

"text" : "hello test test test "

}

PUT /my_index/my_type/2

{

"fullname" : "Leo Li",

"text" : "other hello test ..."

}

结果：

{

"_index": "my_index",

"_type": "my_type",

"_id": "2",

"_version": 1,

"result": "created",

"_shards": {

"total": 1,

"successful": 1,

"failed": 0

"created": true

}

深入探查测试

GET /my_index/my_type/1/_termvectors

{

"fields" : ["text"],

"offsets" : true,

"payloads" : true,

"positions" : true,

"term_statistics" : true,

"field_statistics" : true

}

结果

{

"_index": "my_index",

"_type": "my_type",

"_id": "1",

"_version": 1,

"found": true,

"took": 10,

"term_vectors": {

"text": {

"field_statistics": {

"sum_doc_freq": 6, //所有的doc 中的doc_freq加起来

"doc_count": 2,//意思是：test 这个field 存在多少doc中

"sum_ttf": 8//所有的doc 中的ttf加起来

"terms": {

"hello": {

"doc_freq": 2,//有两个doc 包含这个term

"ttf": 2,//hello 这个词在所有doc中出现了几次

"term_freq": 1,//这个hello在当前这个doc中出现了几次

"tokens": [//一个hello 在一个field 中可能出现多次，每出现一次就是一个token

{

"position": 0,//位于第0个词

"start_offset": 0,//开始位置

"end_offset": 5//结束位置,

"payload": "d29yZA=="//这是一个编码

}

]

"test": {

"doc_freq": 2, //有两个doc 包含这个term

"ttf": 4, //test 这个词在所有doc中出现了4次

"term_freq": 3, //这个hello在当前这个doc中出现了3次

"tokens": [[//一个test 在一个field 中可能出现多次，每出现一次就是一个token

{

"position": 1,

"start_offset": 6,

"end_offset": 10,

"payload": "d29yZA=="

{

"position": 2,

"start_offset": 11,

"end_offset": 15,

"payload": "d29yZA=="

{

"position": 3,

"start_offset": 16,

"end_offset": 20,

"payload": "d29yZA=="

}

]

}

3、query-time term vector实验

GET /my_index/my_type/1/_termvectors

{

"fields" : ["fullname"],

"offsets" : true,

"positions" : true,

"term_statistics" : true,

"field_statistics" : true

}

结果：

{

"_index": "my_index",

"_type": "my_type",

"_id": "1",

"_version": 1,

"found": true,

"took": 13,

"term_vectors": {

"fullname": {

"field_statistics": {

"sum_doc_freq": 4,

"doc_count": 2,

"sum_ttf": 4

"terms": {

"leo": {

"doc_freq": 2,

"ttf": 2,

"term_freq": 1,

"tokens": [

{

"position": 0,

"start_offset": 0,

"end_offset": 3

}

]

"li": {

"doc_freq": 2,

"ttf": 2,

"term_freq": 1,

"tokens": [

{

"position": 1,

"start_offset": 4,

"end_offset": 6

}

]

}

一般来说，如果条件允许，你就用query time的term vector就可以了，你要探查什么数据，现场去探查一下就好了

4、手动指定doc的term vector

GET /my_index/my_type/_termvectors

{

"doc" : {

"fullname" : "Leo Li",

"text" : "hello test test test"

"fields" : ["text"],

"offsets" : true,

"payloads" : true,

"positions" : true,

"term_statistics" : true,

"field_statistics" : true

}

结果：

{

"_index": "my_index",

"_type": "my_type",

"_version": 0,

"found": true,

"took": 1,

"term_vectors": {

"text": {

"field_statistics": {

"sum_doc_freq": 6,

"doc_count": 2,

"sum_ttf": 8

"terms": {

"hello": {

"doc_freq": 2,

"ttf": 2,

"term_freq": 1,

"tokens": [

{

"position": 0,

"start_offset": 0,

"end_offset": 5

}

]

"test": {

"doc_freq": 2,

"ttf": 4,

"term_freq": 3,

"tokens": [

{

"position": 1,

"start_offset": 6,

"end_offset": 10

{

"position": 2,

"start_offset": 11,

"end_offset": 15

{

"position": 3,

"start_offset": 16,

"end_offset": 20

}

]

}

手动指定一个doc，实际上不是要指定doc，而是要指定你想要安插的词条，hello test，那么就可以放在一个field中

将这些term分词，然后对每个term，都去计算它在现有的所有doc中的一些统计信息

这个挺有用的，可以让你手动指定要探查的term的数据情况，你就可以指定探查“大话西游”这个词条的统计信息

5、手动指定analyzer来生成term vector

GET /my_index/my_type/_termvectors

{

"doc" : {

"fullname" : "Leo Li",

"text" : "hello test test test"

"fields" : ["text"],

"offsets" : true,

"payloads" : true,

"positions" : true,

"term_statistics" : true,

"field_statistics" : true,

"per_field_analyzer" : {

"text": "standard"

}

结果：

{

"_index": "my_index",

"_type": "my_type",

"_version": 0,

"found": true,

"took": 0,

"term_vectors": {

"text": {

"field_statistics": {

"sum_doc_freq": 6,

"doc_count": 2,

"sum_ttf": 8

"terms": {

"hello": {

"doc_freq": 2,

"ttf": 2,

"term_freq": 1,

"tokens": [

{

"position": 0,

"start_offset": 0,

"end_offset": 5

}

]

"test": {

"doc_freq": 2,

"ttf": 4,

"term_freq": 3,

"tokens": [

{

"position": 1,

"start_offset": 6,

"end_offset": 10

{

"position": 2,

"start_offset": 11,

"end_offset": 15

{

"position": 3,

"start_offset": 16,

"end_offset": 20

}

]

}

6、terms filter

GET /my_index/my_type/_termvectors

{

"doc" : {

"fullname" : "Leo Li",

"text" : "hello test test test"

"fields" : ["text"],

"offsets" : true,

"payloads" : true,

"positions" : true,

"term_statistics" : true,

"field_statistics" : true,

"filter" : {

"max_num_terms" : 3,

"min_term_freq" : 1,

"min_doc_freq" : 1

}

结果：

{

"_index": "my_index",

"_type": "my_type",

"_version": 0,

"found": true,

"took": 1,

"term_vectors": {

"text": {

"field_statistics": {

"sum_doc_freq": 6,

"doc_count": 2,

"sum_ttf": 8

"terms": {

"hello": {

"doc_freq": 2,

"ttf": 2,

"term_freq": 1,

"tokens": [

{

"position": 0,

"start_offset": 0,

"end_offset": 5

}

"score": 1

"test": {

"doc_freq": 2,

"ttf": 4,

"term_freq": 3,

"tokens": [

{

"position": 1,

"start_offset": 6,

"end_offset": 10

{

"position": 2,

"start_offset": 11,

"end_offset": 15

{

"position": 3,

"start_offset": 16,

"end_offset": 20

}

"score": 3

}

这个就是说，根据term统计信息，过滤出你想要看到的term vector统计结果

也挺有用的，比如你探查数据把，可以过滤掉一些出现频率过低的term，就不考虑了

7、multi term vector

GET _mtermvectors

{

"docs": [

{

"_index": "my_index",

"_type": "my_type",

"_id": "2",

"term_statistics": true

{

"_index": "my_index",

"_type": "my_type",

"_id": "1",

"fields": [

"text"

]

}

]

}

GET /my_index/_mtermvectors

{

"docs": [

{

"_type": "test",

"_id": "2",

"fields": [

"text"

"term_statistics": true

{

"_type": "test",

"_id": "1"

}

]

}

GET /my_index/my_type/_mtermvectors

{

"docs": [

{

"_id": "2",

"fields": [

"text"

"term_statistics": true

{

"_id": "1"

}

]

}

GET /_mtermvectors

{

"docs": [

{

"_index": "my_index",

"_type": "my_type",

"doc" : {

"fullname" : "Leo Li",

"text" : "hello test test test"

}

{

"_index": "my_index",

"_type": "my_type",

"doc" : {

"fullname" : "Leo Li",

"text" : "other hello test ..."

}

]

}