在elasticsearch-x.x.x/config新建analysis文件夹,在该文件夹下新建txt文件(其他文件格式也可)若进行中文同义词,TXT文件的编码格式需为UTF-8,否则会报异常"type": "malformed_input_exception","reason": "Input length = 1"
PUT /test_indexone
{
"index": {
"analysis": {
"analyzer": {
"by_smart": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": ["by_tfr","by_sfr"],
"char_filter": ["by_cfr"]
},
"by_max_word": {
"type": "custom",
"tokenizer": "ik_max_word",
"filter": ["by_tfr","by_sfr"],
"char_filter": ["by_cfr"]
}
},
"filter": {
"by_tfr": {
"type": "stop",
"stopwords": [" "]
},
"by_sfr": {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
}
},
"char_filter": {
"by_cfr": {
"type": "mapping",
"mappings": ["| => |"]
}
}
}
}
}
char_filter 用于分词前对原搜索的句子进行处理;
tokenizer 用于将搜索的句子分成多个词组;
filter 用于处理tokenizer输出的词组,比如删除某些词,修改某些词,增加某些词。
实现同义词搜索的原理是,自定义filter,处理tokenizer输出的待搜索词组时,取出其中词的同义词,加入到待搜索的词组中。
每组的同义词格式由如下两种:
- 番茄,西红柿
- 番茄,西红柿 => 西红柿
第一种情况是无论索引番茄或者西红柿,分析器分析结果的词组为['番茄', '西红柿']
第二种情况是无论索引番茄或者西红柿,分析器分析结果的词组为['西红柿']