GET _analyze
{
"analyzer":"standard",
"text":"中国棒棒的"
}
返回结果:
{
"tokens" : [
{
"token" : "中",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "国",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "棒",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "棒",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "的",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
}
]
}
得出结果,用默认的分词器的话,会把中文的每一个字当成一个词,这样非常的不好
那么,我们看看 使用 了 ik的分词之后 会是神马结果。
概念:
- ik_smart 最粗粒度,尽量的块状,分词的数目最少
- ik_max_word 最细粒度,分词出来的数目最多
- stop 停用词
- keyword 关键词,整一段
例子:
ik_smart:
GET _analyze
{
"analyzer":"ik_smart",
"text":"中国棒棒的,北京大学毕业生"
}
得到结果:
{
"tokens" : [
{
"token" : "中国",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "棒棒",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "的",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "北京大学",
"start_offset" : 6,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "毕业生",
"start_offset" : 10,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 4
}
]
}
2. ik_max_word
}
GET _analyze
{
"analyzer":"ik_max_word",
"text":"中国棒棒的,北京大学毕业生"
}
下面是分词的结果:
{
"tokens" : [
{
"token" : "中国",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "棒棒",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "的",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "北京大学",
"start_offset" : 6,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "北京大",
"start_offset" : 6,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "北京",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "大学毕业",
"start_offset" : 8,
"end_offset" : 12,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "大学",
"start_offset" : 8,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "毕业生",
"start_offset" : 10,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 8
},
{
"token" : "毕业",
"start_offset" : 10,
"end_offset" : 12,
"type" : "CN_WORD",
"position" : 9
},
{
"token" : "生",
"start_offset" : 12,
"end_offset" : 13,
"type" : "CN_CHAR",
"position" : 10
}
]
}
- 停用词 分词
GET _analyze
{
"analyzer":"stop",
"text":"中国棒棒的,北京大学毕业生"
}
下面是停用词分词的结果
{
"tokens" : [
{
"token" : "中国棒棒的",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "北京大学毕业生",
"start_offset" : 6,
"end_offset" : 13,
"type" : "word",
"position" : 1
}
]
}
GET _analyze
{
"analyzer":"stop",
"text":"汽车制造/维修/零配件"
}
{
"tokens" : [
{
"token" : "汽车制造",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "维修",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 1
},
{
"token" : "零配件",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 2
}
]
}
keyword 分词
GET _analyze
{
"analyzer":"keyword",
"text":"汽车制造/维修/零配件"
}
{
"tokens" : [
{
"token" : "汽车制造/维修/零配件",
"start_offset" : 0,
"end_offset" : 11,
"type" : "word",
"position" : 0
}
]
}