Elastic Stack之分词器

  1. 文档正常化:normalization
  2. 字符过滤器(character filter):分词前的预处理
  • HTML Strip
PUT test_index
{
	"settings":{
		"analysis":{
			"char_filter":{
				"my_html_filter":{
					"type":"html_strip",
					"escaped_tags":["a"]
				}
			},
			"analyzer":{
				"my_analyzer":{
					"tokenizer":"keyword",
					"char_filter":["my_html_filter"]
				}
			}
		}
	}
}

GET test_index/_analyze
{
	"text":"<p>U r an <a>idiot</a>!</p>"
	"analyzer":"my_analyzer"
}
  • Mapping
PUT test_index
{
	"settings":{
		"analysis":{
			"char_filter":{
				"my_mapping_filter":{
					"type":"mapping",
					"mappings":[
						"小 => ^",
						"阳 => %",
						"人 => &"
					]
				}
			},
			"analyzer":{
				"my_analyzer":{
					"tokenizer":"keyword",
					"char_filter":["my_mapping_filter"]
				}
			}
		}
	}
}
GET test_index/_analyze
{
	"text":"你这个小阳人,哼!",
	"analyzer":"my_analyzer"
	
}
  • Pattern Replace
PUT test_index
{
	"settings":{
		"analysis":{
			"char_filter":{
				"my_pattern_filter":{
					"type":"pattern_replace",
					"pattern":"(\\d{3})\\d+(\\d{3})",
					"replacement":"$1######$2"
				}
			},
			"analyzer":{
				"my_analyzer":{
					"tokenizer""keyword",
					"char_filter":["my_pattern_filter"]
				}
			}
		}
	}
}
GET test_index
{
	"text":"My ID Card No is 31098978786798",
	"analyzer":"my_analyzer"
}

  1. 令牌过滤器:token filter
PUT test_index
{
	"settings":{
		"analysis":{
			"filter":{
				"my_synonym_filter":{
					"type":"synonym_graph",
					"synonyms_path":"analysis/synonyms.txt"
					
				}
			},
			"analyzer":{
				"my_analyzer":{
					"tokenizer":"ik_max_word",
					"filter":["my_synonym_filter"]
				}
			}
		}
	}
}
GET test_index
{
	"text":"123456789",
	"analyzer":"my_analyzer"
}


```powershell
PUT test_index
{
	"settings":{
		"analysis":{
			"filter":{
				"my_synonym_filter":{
					"type":"synonym",
					"synonyms":["中国移动,新疆联通=>电信运营商","湖南电信=>南方运营商"]
				}
			},
			"analyzer":{
				"my_analyzer":{
					"tokenizer":"standard",
					"filter":["my_synonym_filter"]
				}
			}
		}
	}
}
GET /test_index/_analyze
{
	"text":["中国移动,新疆联通","湖南电信"],
	"analyzer":"my_analyzer"
}
GET test_index
{
	"tokenizer":"standard",
	"filter":["uppercase"],
	"text":""
}

GET test_index
{
	"tokenizer":"standard",
	"filter":["lowercase"],
	"text":""
}

GET test_index
{
	"tokenizer":"standard",
	"filter":{
		"type":"condition",
		"filter":"uppercase",
		"script":{
			"source":"token.getTerm().length() < 5"
		}
	},
	"text":"qazwsx edcrfvgb yhnuj iklopqasde qwert"
}

#停用词
PUT test_index
{
	"settings":{
		"analysis":{
			"analyzer":{
				"my_analyzer":{
					"type":"standard",
					"stopwords":"_english_"
				}
			}
		}
	}
}

GET test_index/_analyze
{
	"analyzer":"my_analyzer",
	"text":" U and me is dating!"
}

PUT test_index
{
	"settings":{
		"analysis":{
			"analyzer":{
				"my_analyzer":{
					"type":"standard",
					"stopwords":["U","me"]
				}
				
			}
		}
	}
}

GET test_index/_analyze
{
	"analyzer":"my_analyzer",
	"text":"U and me is beautiful"
}

  1. 分词器:tokenizer
* standard
* ik_max_word
  1. 自定义分词器
  2. 中文分词器
  • ik_max_word
  • ik_smart
  • 自定义中文分词
  1. 热更新
  • 基于远程词库的热更新
  • 基于MySQL的热更新
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值