Easticsearch从0学到1之问题记录

最新推荐文章于 2022-03-07 11:24:00 发布

三棵树杨

最新推荐文章于 2022-03-07 11:24:00 发布

阅读量264

点赞数

分类专栏：从0到1elasticsearch的学习文章标签： Elasticsearch

本文链接：https://blog.csdn.net/yang52017/article/details/88555378

版权

从0到1elasticsearch的学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

问题1 索引名中含有特殊字符该如何删除

集群中存在这样的一个索引，%{log_type}-2019-01-21；该索引名称是由logstash的conf中index=>“%{log_type}-%{+YYYY-MM-dd}”创建的，由于在某些日志中，没有log_type这个字段，导致原样输出，而在我们的清除索引策略中，没有对该索引的删除进行特殊处理，所以没有删除成功。

DELETE %{log_type}-2019-01-21 删除失败，找不到对应的索引；DELETE %%%7Blog_type%7D-2019-01-21 删除成功；其中%为转义字符，7B为{的ASSCI码值，7D为}的ASSCI码值。

问题2 对关键字做聚合查询，并且返回每个桶中的一条详细信息

POST 索引名/log/_search
{
  "size":0,
  "aggs": {
    "request_count": {
      "terms": { 
        "field": "request.keyword",
        "size":10,
        "order": [{"_count": "desc"}]    
      },"aggs": {
        "userdetail": {
          "top_hits": {#带回一条详细信息
            "size": 1
          }
        }
      }
    }
  }
}

对应的响应内容

另外一直困扰我的一个问题，Elasticsearch好像无法实现类似mysql 的group by（列1，列2），只能是桶中嵌套这桶，如哪个域下的，哪个接口的访问量，这种定语模式。

问题3 自定义分词规则

在一次业务中需要统计每个接口的访问量，但是在日志中需要统计前10的接口访问量。含有接口字段的形式如 "POST /get/user1?sss=ss HTTP1.1" 。但我们需要的只是uri这个值，即/get/user1，我们不需要管他的http动作和协议以及参数。因此，把字段设置为keyword，不进行分词，不适合我们的场景。

插入一些知识：在进行聚合查询时，有两种方式；第一种，把字段的类型设置为，keyword类型。第二种在建立索引时的mapping如下：

"字段名": {
          "type": "text",
          "fielddata": true   
        }

因为第一种方式不满足我们的应用场景，所以我们选择第二种。同时对字段中的POST和HTTP1.1我们不需要，所以不对这个两个单词建立词项；因为Elasticsearch默认的分词器是单词划分，因此我们需要自定义分词器，步骤如下：

1 创建索引

PUT ay_index
{
  "settings": {
    "analysis": {
      "analyzer": { #自定义分词规则
        "std_folded": {
          "type": "custom",
          "tokenizer": "whitespace",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": { #匹配规则
          "type": "pattern_replace",
          "pattern": "[A-Z]*\\s|[?].*",   
          "replacement": " "
        }
      }
    }
  }
  ,"mappings": {
    "_doc": {
      "properties": {
        "my_white": {
          "type": "text",
          "analyzer": "std_folded" , #引用自定义的分词器
          "fielddata": true   
        },
        "my_text": {
          "type": "text"
          }
      }
    }
  }
}

第二步插入数据进行测试

POST ay_index/_doc
{
  "my_text":"this is a test1" ,
  "my_white":"POST /get/user1?sss=ss HTTP1.1"
}

第三步聚合查询

查询
POST /ay_index/_doc/_search
{
  "aggs": {
    "request_count": {
      "terms": { 
        "field": "my_white",
        "size":10
      }
    }
  }
}

响应内容

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "ay_index",
        "_type": "_doc",
        "_id": "-MQEfGkBlcAzC212ObJr",
        "_score": 1,
        "_source": {
          "my_text": "this is a test1",
          "my_white": "POST /get/user1?sss=ss HTTP1.1"
        }
      }
    ]
  },
  "aggregations": {
    "request_count": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "/get/user1",
          "doc_count": 1
        }
      ]
    }
  }
}

第四步我们再检测下普通的搜索

请求
POST /ay_index/_doc/_search
{
  "query": {
    "match":{
        "my_white": "/get/user1"
    }
  }
}

响应
{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "ay_index",
        "_type": "_doc",
        "_id": "-MQEfGkBlcAzC212ObJr",
        "_score": 0.2876821,
        "_source": {
          "my_text": "this is a test1",
          "my_white": "POST /get/user1?sss=ss HTTP1.1"
        }
      }
    ]
  }
}

检测下 my_white字段中的POST和HTPP请求是否能够查询到数据

POST /ay_index/_doc/_search
{
  "query": {
    "match":{
        "my_white": "POST"
    }
  }
}

响应
{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

查询HTPP
POST /ay_index/_doc/_search
{
  "query": {
    "match":{
        "my_white": "HTTP"
    }
  }
}

响应{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

我们可以发现my_white字段中的POST和HTTP无法search，这是因为我们在创建索引时，分词过滤掉了，没有为其建立词项的倒排索引。

当然最后我们也没有用这种方法，因为不满足我们的最终需要，同时改变索引的mapping规则，可能会影响其他业务方该索引的数据分析。可能脚本的方式可以解决问题，这一块我还不太熟悉。

分词相关的更多信息，请参考https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html

参考文献

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-tokenizer.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html

三棵树杨

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Easticsearch从0学到1之问题记录

问题1 索引名中含有特殊字符该如何删除集群中存在这样的一个索引，%{log_type}-2019-01-21；该索引名称是由logstash的conf中index=&gt;“%{log_type}-%{+YYYY-MM-dd}”创建的，由于在某些日志中，没有log_type这个字段，导致原样输出，而在我们的清除索引策略中，没有对该索引的删除进行特殊处理，所以没有删除成功...
复制链接

扫一扫