ES特殊字符查询

最新推荐文章于 2025-08-01 17:12:23 发布

fox_初始化

最新推荐文章于 2025-08-01 17:12:23 发布

阅读量4.4k

点赞数

CC 4.0 BY-SA版权

文章标签： elasticsearch 搜索引擎大数据

本文链接：https://blog.csdn.net/fox_233/article/details/127388058

这篇博客探讨了在 Elasticsearch 中，由于 standard 分词器对特殊字符的处理导致用户无法搜索和高亮特殊字符的问题。通过引入 NGram 分词器，可以解决这个问题。NGram 分词器以指定长度的窗口分割字符，包括特殊字符，从而允许特殊字符的搜索。实验结果显示，使用 ngram 分词器后，数据能被正确查询出来，但会增加存储空间的需求。博客还提供了设置索引、插入数据和查询的示例。

背景

项目使用ES做搜索引擎，大家都知道query_string这个API是支持lucene语法的，所以我们使用这个API支持用户个性化的搜索。项目上线后，用户搜索发现特殊字符无法搜索到而且也无法高亮。

原因与解决

因为我们的index没有指定分词器所以默认使用的是standard分词器。standard分词器会根据特殊字符或者空格将字符串进行切割，分成一个个词进行存储，那么来看一下standard分词器会把带有特殊字符的字符串解析成哪些词进行存储呢？

GET _analyze
{
  "analyzer": "standard",
  "text": ["A2654|10|09|022"]
}

非常明显，在经过standard分词器分词时，"A2654|10|09|022"，已经被分成了4个词，显然没有了特殊符号，这也就意味着如果我的index使用的是standard分词器数据入库时已经没有特殊符号了，所以后续使用特殊符号搜索一定是无法搜索到的。

//设置索引
PUT test003
{
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "analyzer": "standard",
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

//插入数据
POST test003/doc
{
  "text":"A2654|10|09|022"
}

//查询
GET test003/_search
{
  "query": {
    "query_string": {
      "query": "\\|"
    }
  }
}

一切都符合预期，这也就是为什么我们线上搜索不到数据的原因。

那么如何才能做到特殊字符搜索呢？

了解了一些分词器发现了ngram分词器。

NGram 分词器

看一下描述

什么意思呢？大概就是说会象是滑动窗口一样将字符进行指定长度的分割，对于那种没有空格很长的语言如德语很有效果。

这样看就很清楚了，NGram会根据我们指定窗口（分割长度）大小进行分词而不是特殊字符或者空格，这就意味着特殊字符也会被分成一个词。

那么既然是指定长的那么指定长度如何设置呢？还有其他的参数么？

现在我们来试一试

//设置index
PUT specialchar001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "specialchar_analyzer": {
          "tokenizer": "specialchar_tokenizer"
        }
      },
      "tokenizer": {
        "specialchar_tokenizer": {
          "type": "ngram",
          "min_gram": 1,
          "max_gram": 2
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "text": {
          "analyzer": "specialchar_analyzer",
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

//插入数据
POST specialchar001/_doc
{
  "text": "A2654|10|09|022"
}

//特殊字符查询
GET specialchar001/_search
{
  "query": {
    "query_string": {
      "query": "\\|"
    }
  }
}

可以看到含有特殊字符的数据已经被查出来了。

那么我们看一下这个词被分成了什么样子

GET specialchar001/_analyze
{
  "analyzer": "specialchar_analyzer",
  "text": ["A2654|10|09|022"]
}

{
  "tokens" : [
    {
      "token" : "A",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "A2",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "2",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "26",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "6",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "65",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "5",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "54",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "4",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "4|",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "|",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "|1",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "1",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "10",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 13
    },
    {
      "token" : "0",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 14
    },
    {
      "token" : "0|",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 15
    },
    {
      "token" : "|",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 16
    },
    {
      "token" : "|0",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "word",
      "position" : 17
    },
    {
      "token" : "0",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 18
    },
    {
      "token" : "09",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "word",
      "position" : 19
    },
    {
      "token" : "9",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "word",
      "position" : 20
    },
    {
      "token" : "9|",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "word",
      "position" : 21
    },
    {
      "token" : "|",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "word",
      "position" : 22
    },
    {
      "token" : "|0",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 23
    },
    {
      "token" : "0",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "word",
      "position" : 24
    },
    {
      "token" : "02",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "word",
      "position" : 25
    },
    {
      "token" : "2",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "word",
      "position" : 26
    },
    {
      "token" : "22",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 27
    },
    {
      "token" : "2",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "word",
      "position" : 28
    }
  ]
}

结果符合预期，同时也说明了问题。当我使用standard分词器我得词只有4个，而用ngram后29个词！这说明：使用ngram势必要占用更多的空间！

使用相同的数据插入选择不同的分词器可以看到确实如此