The standard tokenizer provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29) and works well for most languages.
PUT standard_tokenizer_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "standard",
"max_token_length": 5
}
}
}
}
}
POST standard_tokenizer_index/_analyze
{
"analyzer": "my_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
letter
The letter tokenizer breaks text into terms whenever it encounters a character which is not a letter.
POST _analyze
{
"tokenizer": "letter",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
lowercase
The lowercase tokenizer, like the letter tokenizer breaks text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms.
POST _analyze
{
"tokenizer": "lowercase",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
whitespace
The whitespace tokenizer breaks text into terms whenever it encounters a whitespace character.
POST _analyze
{
"tokenizer": "whitespace",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
uax_url_email
The uax_url_email tokenizer is like the standard tokenizer except that it recognises URLs and email addresses as single tokens.
POST _analyze
{
"tokenizer": "uax_url_email",
"text": "my home page is www.zhengcj01.com and the email is [email protected]&
文章目录demostandardletterlowercasewhitespaceuax_url_emailclassicngramedge_ngramkeywordpatternchar_groupsimple_patternsimple_pattern_splitpath_hierarchydemostandardThe standard tokenizer provides gram...