Elasticsearch 2.20 文档篇：索引词频率

最新推荐文章于 2023-02-08 14:04:34 发布

weixin_33721427

最新推荐文章于 2023-02-08 14:04:34 发布

阅读量607

点赞数 1

文章标签：大数据 python

原文链接：https://my.oschina.net/secisland/blog/614627

版权

2019独角兽企业重金招聘Python工程师标准>>>

term vector是在Lucene中的一个概念，就是对于documents的某一field,如title,body这种文本类型的, 建立词频的多维向量空间.每一个词就是一个维度, 这个维度的值就是这个词在这个field中的频率。在Elasticsearch中termvectors返回在索引中特定文档字段的统计信息，termvectors在Elasticsearch中是实时分析的，如果要想不实时分析，可以设置realtime参数为false。默认情况下索引词频率统计是关闭的，需要在建索引的时候手工打开。

注意：在Elasticsearch2.0版本以上用_termvectors代替_termvector。

下面我们建一个打开了索引词统计的索引。

请求：PUT http://localhost:9200/secilog/

参数：

{
  "mappings": {
    "log": {
      "properties": {
        "type": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "analyzer" : "fulltext_analyzer"
         },
         "message": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "analyzer" : "fulltext_analyzer"
        }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}

然后我们插入两条数据：

请求：PUT http://localhost:9200/secilog/log/1/?pretty

参数：

{
  "type" : "syslog",
  "message" : "secilog test test test "
}

请求：PUT http://localhost:9200/secilog/log/2/?pretty

参数：

{
  "type" : "file",
  "message" : "Another secilog test "
}

当创建两条日志成功后，我们用_termvectors来查询统计结果。

请求：GET http://localhost:9200/secilog/log/1/_termvectors?pretty=true

返回结果如下：

{
  "_index" : "secilog",
  "_type" : "log",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "took" : 2,
  "term_vectors" : {
    "message" : {
      "field_statistics" : {
        "sum_doc_freq" : 5,
        "doc_count" : 2,
        "sum_ttf" : 7
      },
      "terms" : {
        "secilog" : {
          "term_freq" : 1,
          "tokens" : [ {
            "position" : 0,
            "start_offset" : 0,
            "end_offset" : 7,
            "payload" : "d29yZA=="
          } ]
        },
        "test" : {
          "term_freq" : 3,
          "tokens" : [ {
            "position" : 1,
            "start_offset" : 8,
            "end_offset" : 12,
            "payload" : "d29yZA=="
          }, {
            "position" : 2,
            "start_offset" : 13,
            "end_offset" : 17,
            "payload" : "d29yZA=="
          }, {
            "position" : 3,
            "start_offset" : 18,
            "end_offset" : 22,
            "payload" : "d29yZA=="
          } ]
        }
      }
    },
    "type" : {
      "field_statistics" : {
        "sum_doc_freq" : 2,
        "doc_count" : 2,
        "sum_ttf" : 2
      },
      "terms" : {
        "syslog" : {
          "term_freq" : 1,
          "tokens" : [ {
            "position" : 0,
            "start_offset" : 0,
            "end_offset" : 6,
            "payload" : "d29yZA=="
          } ]
        }
      }
    }
  }
}

从中可以看出，每个字段，每个单词出现的次数和位置。需要注意的是对这些字段统计不是完全精确的，已删除的文件未被考虑在内，信息统计所请求的文档只统计所在的分片，除非DFS设置为true。因此，索引词的统计数据对于了解索引词的频率有参考意义，默认情况下当情况索引词频率查询的时候，系统会随机的指定一个分片进行统计，如果使用routing 可以查询具体某个分片的统计情况。对于索引词统计，还可以指定参数查询，例如：

请求：POST http://localhost:9200/secilog/log/1/_termvectors?pretty=true

参数：

{
  "fields" : ["message"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

返回结果：

{
  "_index" : "secilog",
  "_type" : "log",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "took" : 2,
  "term_vectors" : {
    "message" : {
      "field_statistics" : {
        "sum_doc_freq" : 5,
        "doc_count" : 2,
        "sum_ttf" : 7
      },
      "terms" : {
        "secilog" : {
          "doc_freq" : 2,
          "ttf" : 2,
          "term_freq" : 1,
          "tokens" : [ {
            "position" : 0,
            "start_offset" : 0,
            "end_offset" : 7,
            "payload" : "d29yZA=="
          } ]
        },
        "test" : {
          "doc_freq" : 2,
          "ttf" : 4,
          "term_freq" : 3,
          "tokens" : [ {
            "position" : 1,
            "start_offset" : 8,
            "end_offset" : 12,
            "payload" : "d29yZA=="
          }, {
            "position" : 2,
            "start_offset" : 13,
            "end_offset" : 17,
            "payload" : "d29yZA=="
          }, {
            "position" : 3,
            "start_offset" : 18,
            "end_offset" : 22,
            "payload" : "d29yZA=="
          } ]
        }
      }
    }
  }
}

从上面的查询中可以看出，对统计进行了过滤，只查询了一部分的统计。

需要注意的是打开了索引词频率会增加系统的负担，除非特别有必要才需要打开统计。

赛克蓝德(secisland)后续会逐步对Elasticsearch的最新版本的各项功能进行分析，近请期待。也欢迎加入secisland公众号进行关注。

转载于:https://my.oschina.net/secisland/blog/614627