分词流程
- Character Filters 初始操作,如标签过滤等
- Tokenizer 按规进行切分,如按空格进行切分等
- Tokenizer Filters 二次加工,如大小写转换等
自带的分词
使用 GET /_analyze 查看分词结果
Standard
默认分词器,按词切分,小写处理
GET /_analyze
{
"analyzer": "standard",
"text":"1 A 2 b"
}
# result
{
"tokens" : [
{
"token" : "1",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "a",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "2",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<NUM>",
"position" : 2
},
{
"token" : "b",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
Simple
按照非字母进行切分(中文被当作字母),过滤符号,小写处理(中文还是中文)
GET /_analyze
{
"analyzer": "simple",
"text":"1 好 ef A-B 2 c"
}
# result
{
"tokens" : [
{
"token" : "好",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "ef",
"start_offset" : 4,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "a",
"start_offset" : 7,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "b",
"start_offset" : 9,
"end_offset" : 10,
"type" : "word",
"position" : 3
},
{
"token" : "c",
"start_offset" : 13,
"end_offset" : 14,
"type" : "word",
"position" : 4
}
]
}
Stop
在Simple的基础上增加了停用词过滤(the,a,is)
GET /_analyze
{
"analyzer": "stop",
"text":"1 好 ef A-B is the 2 c"
}
# result
{
"tokens" : [
{
"token" : "好",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "ef",
"start_offset" : 4,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "b",
"start_offset" : 9,
"end_offset" : 10,
"type" : "word",
"position" : 3
},
{
"token" : "c",
"start_offset" : 21,
"end_offset" : 22,
"type" : "word",
"position" : 6
}
]
}
Whitespace
按空格切分,不转小写
GET /_analyze
{
"analyzer": "whitespace",
"text":"1 好 ef A-B is the 2 c"
}
# result
{
"tokens" : [
{
"token" : "1",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "好",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "ef",
"start_offset" : 4,
"end_offset" : 6,
"type" : "word",
"position" : 2
},
{
"token" : "A-B",
"start_offset" : 7,
"end_offset" : 10,
"type" : "word",
"position" : 3
},
{
"token" : "is",
"start_offset" : 11,
"end_offset" : 13,
"type" : "word",
"position" : 4
},
{
"token" : "the",
"start_offset" : 14,
"end_offset" : 17,
"type" : "word",
"position" : 5
},
{
"token" : "2",
"start_offset" : 19,
"end_offset" : 20,
"type" : "word",
"position" : 6
},
{
"token" : "c",
"start_offset" : 21,
"end_offset" : 22,
"type" : "word",
"position" : 7
}
]
}
Keyword
不分词
GET /_analyze
{
"analyzer": "keyword",
"text":"1 好 ef A-B is the 2 c"
}
#result
{
"tokens" : [
{
"token" : "1 好 ef A-B is the 2 c",
"start_offset" : 0,
"end_offset" : 22,
"type" : "word",
"position" : 0
}
]
}
Patter
按正则分词,默认\W+
GET /_analyze
{
"analyzer": "pattern:\d+",
"text":"1 好 ef A-B is the 2 c"
}
# result
{
"tokens" : [
{
"token" : "1",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "ef",
"start_offset" : 4,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "a",
"start_offset" : 7,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "b",
"start_offset" : 9,
"end_offset" : 10,
"type" : "word",
"position" : 3
},
{
"token" : "is",
"start_offset" : 11,
"end_offset" : 13,
"type" : "word",
"position" : 4
},
{
"token" : "the",
"start_offset" : 14,
"end_offset" : 17,
"type" : "word",
"position" : 5
},
{
"token" : "2",
"start_offset" : 19,
"end_offset" : 20,
"type" : "word",
"position" : 6
},
{
"token" : "c",
"start_offset" : 21,
"end_offset" : 22,
"type" : "word",
"position" : 7
}
]
}
中文分词
从 https://github.com/medcl/elasticsearch-analysis-ik 选择对应自己Elasticsearch的版本。我的当前es版本为v7.1.0,那么插件也选7.1.0
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip
如果是es集群,请为每一台都安装ik插件,否则有可能导致kibana异常
GET /_analyze
{
"analyzer": "ik_smart",
"text":"苹果电脑是比较适合程序员的电脑"
}
# result
{
"tokens" : [
{
"token" : "苹果电脑",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "是",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "比较",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "适合",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "程序员",
"start_offset" : 9,
"end_offset" : 12,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "的",
"start_offset" : 12,
"end_offset" : 13,
"type" : "CN_CHAR",
"position" : 5
},
{
"token" : "电脑",
"start_offset" : 13,
"end_offset" : 15,
"type" : "CN_WORD",
"position" : 6
}
]
}