二十四分词器

最新推荐文章于 2024-04-21 20:01:31 发布

tianlan996

最新推荐文章于 2024-04-21 20:01:31 发布

阅读量120

点赞数

分类专栏： Elasticsearch核心知识

本文链接：https://blog.csdn.net/tianlan996/article/details/94591971

版权

Elasticsearch核心知识专栏收录该内容

31 篇文章 0 订阅

订阅专栏

1. 默认的分词器standard

特点：
standard tokenizer：以单词边界进行切分
standard token filter：什么都不做
lowercase token filter：将所有字母转换为小写
stop token filer（默认被禁用）：移除停用词，比如a the it等等

2. 修改分词器的设置

启用english停用词token filter

PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"es_std": {
"type": "standard", ---类型是 standard
"stopwords": "_english_" ---启用英文停用词（比如 a/is/in等）
}
}
}
}
}

GET /my_index/_analyze
{
"analyzer": "standard",
"text": "a dog is in the house"
}

GET /my_index/_analyze
{
"analyzer": "es_std",
"text":"a dog is in the house"
}

3. 定制化自己的分词器

PUT /my_index
{
"settings": {
"analysis": {
"char_filter": {
"&_to_and": { --- &转化为 and
"type": "mapping",
"mappings": ["&=> and"]
}
},
"filter": {
"my_stopwords": { ---定义停用词
"type": "stop",
"stopwords": ["the", "a"]
}
},
"analyzer": { ---应用自己定义的规则
"my_analyzer": {
"type": "custom",
"char_filter": ["html_strip", "&_to_and"], --- 去掉HTML标签并使用自定义替换规则 &_to_and
"tokenizer": "standard",
"filter": ["lowercase", "my_stopwords"] --- 其中包含自定义停用词 my_stopwords
}
}
}
}
}

GET /my_index/_analyze
{
"text": "tom&jerry are a friend in the house, <a>, HAHA!!",
"analyzer": "my_analyzer"
}

PUT /my_index/_mapping/my_type
{
"properties": {
"content": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}

tianlan996

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
二十四分词器

1. 默认的分词器standard特点：standard tokenizer：以单词边界进行切分standard token filter：什么都不做lowercase token filter：将所有字母转换为小写stop token filer（默认被禁用）：移除停用词，比如a the it等等2. 修改分词器的设置启用english停用词token filterPUT...
复制链接

扫一扫