ElasticSearch中文分词器真的设置成功过了么

安装完中文分词器之后(包含两种:ik_max_word,ik_smart。第一种常用),并不表示他会马上用到index的倒排索引上(inverted index)或者在全文检索时(full-text search)对查询关键字进行分词。这个可以通过以下实验验证:

创建以下document:

POST /student/_doc/1
{
  "name":"徐小小",
  "address":"杭州",
  "age":3,
  "interests":"唱歌 画画  跳舞",
  "birthday":"2017-06-19"
}

POST /student/_doc/2
{
  "name":"刘德华",
  "address":"香港",
  "age":28,
  "interests":"演戏 旅游 小",
  "birthday":"1980-06-19"
}


POST /student/_doc/3
{
  "name":"张小斐",
  "address":"北京",
  "age":28,
  "interests":"小品 旅游 小米手机",
  "birthday":"1990-06-19"
}

POST /student/_doc/4
{
  "name":"王小宝",
  "address":"德州",
  "age":63,
  "interests":"演戏 小品 打牌 小米电视",
  "birthday":"1956-06-19"
}

POST /student/_doc/5
{
  "name":"向华强",
  "address":"香港",
  "age":31,
  "interests":"演戏",
  "birthday":"1958-06-19"
}

执行以下命令,可以得到三条结果,而不是两条:

GET student/_search
{
  "query": {
    "match": {
      "interests": {
        "query": "小米"

      }
    }
  }
}

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.8391738,
    "hits": [
      {
        "_index": "student",
        "_type": "_doc",
        "_id": "4",
        "_score": 0.8391738,
        "_source": {
          "name": "王小宝",
          "address": "德州",
          "age": 63,
          "interests": "演戏 小品 打牌 小米电视",
          "birthday": "1956-06-19"
        }
      },
      {
        "_index": "student",
        "_type": "_doc",
        "_id": "3",
        "_score": 0.68324494,
        "_source": {
          "name": "张小斐",
          "address": "北京",
          "age": 28,
          "interests": "小品 旅游 小米手机",
          "birthday": "1990-06-19"
        }
      },
      {
        "_index": "student",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.21110918,
        "_source": {
          "name": "刘德华",
          "address": "香港",
          "age": 28,
          "interests": "演戏 旅游 小",
          "birthday": "1980-06-19"
        }
      }
    ]
  }
}

手工指定查询分词器为ik_max_word, 查询“小米”竟然没有结果:

GET student/_search
{
  "query": {
    "match": {
      "interests": {
        "query": "小米",
        "analyzer": "ik_max_word"

      }
    }
  }
}

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

通过以上结果可以推测,不但查询关键字"小米"被分成了"小”和"米"两个字,在ES上保存的数据,在创建倒排索引时,也是按照每个字建立一个索引。这应该是由于该student索引使用的是默认的英文分词器导致的,使每个汉字都被当成一个词。

下面重建student索引,同时在创建时,指定该索引默认分词器是ik_max_word(最大化的分词),然后再做同样的实验,发现"小米"没有再被分成了两个词。

DELETE /student
#创建student索引,并指定分词器。这个分词器看上去会同时应用于倒排索引建立时分词和查询关键字分词。不##需要再指定缺省查询分词器:
#        "default_search": {
#          "type": "ik_max_word"
#        }
#但是官方文档有提到可以单独设置缺省查询分词器。
#https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html

#创建student索引
PUT student
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "ik_max_word"
        }
      }
    }
  }
}

#插入数据

POST /student/_doc/1
{
  "name":"徐小小",
  "address":"杭州",
  "age":3,
  "interests":"唱歌 画画  跳舞",
  "birthday":"2017-06-19"
}

POST /student/_doc/2
{
  "name":"刘德华",
  "address":"香港",
  "age":28,
  "interests":"演戏 旅游 小",
  "birthday":"1980-06-19"
}


POST /student/_doc/3
{
  "name":"张小斐",
  "address":"北京",
  "age":28,
  "interests":"小品 旅游 小米手机",
  "birthday":"1990-06-19"
}

POST /student/_doc/4
{
  "name":"王小宝",
  "address":"德州",
  "age":63,
  "interests":"演戏 小品 打牌 小米电视",
  "birthday":"1956-06-19"
}

POST /student/_doc/5
{
  "name":"向华强",
  "address":"香港",
  "age":31,
  "interests":"演戏",
  "birthday":"1958-06-19"
}

#查询"小米",可以得到正确结果

GET student/_search
{
  "query": {
    "match": {
      "interests": "小米"
    }
  }
}
{
  "took": 15,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.6288345,
    "hits": [
      {
        "_index": "student",
        "_type": "_doc",
        "_id": "4",
        "_score": 0.6288345,
        "_source": {
          "name": "王小宝",
          "address": "德州",
          "age": 63,
          "interests": "演戏 小品 打牌 小米电视",
          "birthday": "1956-06-19"
        }
      },
      {
        "_index": "student",
        "_type": "_doc",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "name": "张小斐",
          "address": "北京",
          "age": 28,
          "interests": "小品 旅游 小米手机",
          "birthday": "1990-06-19"
        }
      }
    ]
  }
}

如果清醒指定系统默认英文分词器作为搜索关键字分词器,就只能得到一条结果。变相验证ik_max_word被同时用于倒排索引的分词和搜索关键字的分词。

GET student/_search
{
  "query": {
    "match": {
      "interests": {
        "query": "小米",
        "analyzer": "standard"

      }
    }
  }
}

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.7721133,
    "hits": [
      {
        "_index": "student",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.7721133,
        "_source": {
          "name": "刘德华",
          "address": "香港",
          "age": 28,
          "interests": "演戏 旅游 小",
          "birthday": "1980-06-19"
        }
      }
    ]
  }
}

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值