ElasticSearch中文分词（一）

最新推荐文章于 2024-07-28 17:12:40 发布

WoodieWang

最新推荐文章于 2024-07-28 17:12:40 发布

阅读量255

点赞数

分类专栏： ElasticSearch

本文链接：https://blog.csdn.net/qq_26896085/article/details/104290117

版权

17 篇文章 0 订阅

订阅专栏

注：以下的内置分词器只是对中文几乎不适用，了解。下篇记录的IK分词器是在实际开发中使用的

1、什么是分词

分词就是指将一个文本转化成一系列单词的过程，也叫文本分析，在Elasticsearch中称之为Analysis。
举例：我是中国人 --> 我/是/中国人

2、分词api

指定分词器进行分词

POST http://192.168.142.128:9200/_analyze
{
  "analyzer": "standard",
  "text": "hello world"
}

指定索引分词

POST http://192.168.142.128:9200/itcast/_analyze 
{
  "analyzer": "standard",
  "field": "hobby",
  "text": "听音乐"
}

标准分词器对中文的分词不是很友好，会将中文分成一个一个的之字

3、内置分词器

1）Standard标准分词器

Standard 标准分词，按单词切分，并且会转化成小写

POST http://192.168.142.128:9200/_analyze 
{
  "analyzer": "standard",
  "text": "A man becomes learned by asking questions."
}

2）Simple分词器

Simple分词器，按照非单词切分，并且做小写处理

POST http://192.168.142.128:9200/_analyze 
{
  "analyzer": "simple",
  "text": "If the document doesn't already exist"
}

3）Whitespppace

Whitespace是按照空格切分

POST http://192.168.142.128:9200/_analyze 
{
  "analyzer": "whitespace",
  "text": "If the document doesn't already exist"
}

4）Stop分词

Stop分词器，是去除Stop Word语气助词，如the、an等。

POST http://192.168.142.128:9200/_analyze 
{
  "analyzer": "stop",
  "text": "If the document doesn't already exist"
}

5）Keyword

Keyword分词器，意思是传入就是关键词，不做分词处理。

POST http://192.168.142.128:9200/_analyze 
{
  "analyzer": "keyword",
  "text": "If the document doesn't already exist"
}

关注

专栏目录