elasticsearch之正则查询 regexp query 使用

最新推荐文章于 2025-03-05 00:24:01 发布

yingchenwy

最新推荐文章于 2025-03-05 00:24:01 发布

阅读量2.3w

点赞数 5

分类专栏： elastic search regexp

本文链接：https://blog.csdn.net/u010483897/article/details/90485332

版权

elastic search 同时被 2 个专栏收录

22 篇文章

订阅专栏

regexp

1 篇文章

订阅专栏

最近想使用es的正则查询query，于是看了看官网对正则查询语句的介绍：

https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-regexp-query.html#regexp-syntax

上面介绍了es接受的正则表达式规则，同时给出了一个简单的regexp query 样例。嗯。。。看着还简单，如果熟悉python re的话，还算好理解。

于是我就在到kibana现有索引里面试了试。

只是拿个最简单的"abc"样例的正则来测试，查询一下索引中name字段包含“人字梯”的item。可以确定的是，现在索引里面绝对是包含这个样Item的，但是奇怪的是，这个看似正常的正则查询语句返回结果为0.

但是如果把“梯”字去掉，使用“人字”却可以找回很多结果。

后来返回官网，仔细查看英文介绍，才发现我忽略了开篇介绍部分，这才是重点：

The regexp query allows you to use regular expression term queries. See Regular expression syntax for details of the supported regular expression language. The "term queries" in that first sentence means that Elasticsearch will apply the regexp to the terms produced by the tokenizer for that field, and not to the original text of the field.

英文大概意思是说，正则匹配是在字段内容处理之后的每个term上。然后我就检查了mapping中对该字段的定义，发现对该字段使用了es自带的ik_max_word分词器进行了分词处理。

而我在建索引之前，mapping中对name_ini这个字段设计的是自动使用ik_max_word分词处理了的。使用下面指令确认分词后的结果：

{
  "tokens": [
    {
      "token": "奥",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "鹏",
      "start_offset": 1,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "家用",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "家",
      "start_offset": 2,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "用",
      "start_offset": 3,
      "end_offset": 4,
      "type": "CN_CHAR",
      "position": 4
    },
    {
      "token": "四",
      "start_offset": 4,
      "end_offset": 5,
      "type": "TYPE_CNUM",
      "position": 5
    },
    {
      "token": "层",
      "start_offset": 5,
      "end_offset": 6,
      "type": "COUNT",
      "position": 6
    },
    {
      "token": "彩色",
      "start_offset": 6,
      "end_offset": 8,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "铝",
      "start_offset": 8,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 8
    },
    {
      "token": "梯",
      "start_offset": 9,
      "end_offset": 10,
      "type": "CN_WORD",
      "position": 9
    },
    {
      "token": "ap-2554",
      "start_offset": 11,
      "end_offset": 18,
      "type": "LETTER",
      "position": 10
    },
    {
      "token": "ap",
      "start_offset": 11,
      "end_offset": 13,
      "type": "ENGLISH",
      "position": 11
    },
    {
      "token": "2554",
      "start_offset": 14,
      "end_offset": 18,
      "type": "ARABIC",
      "position": 12
    },
    {
      "token": "梯子",
      "start_offset": 19,
      "end_offset": 21,
      "type": "CN_WORD",
      "position": 13
    },
    {
      "token": "梯",
      "start_offset": 19,
      "end_offset": 20,
      "type": "CN_WORD",
      "position": 14
    },
    {
      "token": "子",
      "start_offset": 20,
      "end_offset": 21,
      "type": "CN_CHAR",
      "position": 15
    },
    {
      "token": "人字",
      "start_offset": 22,
      "end_offset": 24,
      "type": "CN_WORD",
      "position": 16
    },
    {
      "token": "梯",
      "start_offset": 24,
      "end_offset": 25,
      "type": "CN_WORD",
      "position": 17
    }
  ]
}

可以看出，里面的确没有“人字梯”，只有“人字”，所以使用前面的正则查询语句结果没有达到预期。

综上，如果想使用正则，需要修改mapping中关于字段的定义。可以把字段处理方式修改为：


    "name_ini": {
      "type": "string",
      "index":"not_analyzed" 
    }

不处理name_ini字段内容。

所以，我可能需要修改mapping，重建索引了~~~~(>_<)~~~~

经测试，发现es支持的正则表达式和Python的re还有点区别：

如果你想查找字符串里面有没有包含“2cd5a”样子的字符串，此时，python中可以直接利用re如下匹配：

pattern = ur"2cd5a"

但是在es里面，如果写成下面则不能正常找到：

GET ***/_search
{
  "query": {
    "regexp":{
      "model":"2cd5a"
    }
  }, 
  "highlight": {
    "fields": {
      "model": {}
    }
  }
}

必须写成这样才可以找到匹配的：

GET ***/_search
{
  "query": {
    "regexp":{
      "model":"2cd5a[a-z0-9]+"
    }
  }, 
  "highlight": {
    "fields": {
      "model": {}
    }
  }
}

我考虑了一下，应该还是和上面那段英文介绍有关，正则匹配是在字段内容处理之后的term级别上去匹配的，所以只有整个term匹配了该正则表达式才算是命中。需要注意。。。