分词
1、 精确字段v.s Full Text
es会为每个字段创建一个倒排索引,但遇到精确字段(keyword)不会做这一操作;
1、精确字段keyword:精确查找,不会分词,主要的类型包括数字,日期或者一个精确的字符串可以设置为此类型
2、text:全文搜索字段,需要参与分词
Analyzer
1、处理流程
Character Filter -->Tokenizer–>Token Filters
Character Filter
在Tokenier之前对本本进行处理,例如增加或者替换字符。可以配置多个,会影响Tokenizer的position和offset信息。
自带的Character Filter
- HTML strip–去除html标签
- Mapping–字符串替换
- Pattern replace --正则替换
Tokenizer
将原本的文本按照一定规则切分为词项
内置的有:
whitespace/standard/uax_url_email/pattern/keyword/path_hierarchy
可用java自定义
Token Filters
将Tokenizer输出的词,进行增加,修改,删除
内置有:Lowercase/stop/synonym(添加近义词)
POST users/_analyze
{
"text": "我是中国人",
"analyzer": "ik_max_word"
}
PUT logs/_doc/1
{"level":"DEBUG"}
GET /logs/_mapping
#html_strip 去除网页标记信息
POST _analyze
{
"tokenizer":"keyword",
"char_filter":["html_strip"],
"text": "<b>hello world</b>"
}
# 解析path
POST _analyze
{
"tokenizer":"path_hierarchy",
"text":"/user/ymruan/a/b/c/d/e"
}
#使用char filter进行替换
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type" : "mapping",
"mappings" : [ "- => _"]
}
],
"text": "123-456, I-test! test-990 650-555-1234"
}
#char filter 替换表情符号
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type" : "mapping",
"mappings" : [ ":) => happy", ":( => sad"]
}
],
"text": ["I am felling :)", "Feeling :( today"]
}
# 去除空格和 snowball
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["stop","snowball","zw"],
"text": ["The gilrs in China are playing this game! zw"]
}
// whitespace与stop
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["stop","snowball"],
"text": ["The rain in Spain falls mainly on the plain."]
}
#remove 加入lowercase后,The被当成 stopword删除
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["lowercase","stop","snowball"],
"text": ["The gilrs in China are playing this game! zw"]
}
#正则表达式 替换
GET _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type" : "pattern_replace",
"pattern" : "http://(.*)",
"replacement" : "$1"
}
],
"text" : "http://www.elastic.co"
}