Elasticsearch中提供了一个叫N-gram tokenizer的分词器,官方介绍如下
N-gram tokenizer
The ngram
tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.
N-grams are like a sliding window that moves across the word - a continuous sequence of characters of the specified length. They are useful for querying languages that don’t use spaces or that have long compound words, like German.
Example output
With the default settings, the ngram
tokenizer treats the initial text as a single token and produces N-grams with minimum length 1
and maximum length 2
:
POST _analyze
{
"tokenizer": "ngram",
"text": "Quick Fox"
}
The above sentence