ES synonym filter
为了进行扩为了进行扩召回,一种有效的方式是添加同义词,加入同义词后扩大了搜索范围同时也带来了两个问题:
-
term query 原词需要比同义词有更高的评分
# 发现结果中 原词和同义词 具有同样的权值 GET learning_test_03/_search { "_source": "post_title", "explain": true, "query": { "term": { "post_title.jieba_dic_all_synonym": { "value": "视图" } } } }
-
match_phase 也有这个问题,同义词低于原词的评分
GET learning_test_03/_search
{
"_source": "post_title",
"explain": true,
"query": {
"match_phrase": {
"post_title.jieba_dic_all_synonym": "插入流程图"
}
}
}
# result:
# "description" : """weight(post_title.jieba_dic_all_synonym:"插入 (流程图 visio 略图 视图)" in 20533) [PerFieldSimilarity], result of:"""
# 可以看出,流程图 和他的同义词: visio 略图 视图 ,身份都是一样的。但是在查询中,往往应该原词高于 扩充的同义词.
synonym 对评分的干扰
带有 synonym filter 的 analyzer 的使用:
官方文档提供了 synonym filter 并举例了 ,索引数据时的应用示例, 但是经过调研分析,得出了 带有 synonym 的 analyzer 适用于 search 而不是 index。
- synonym 增加了field 的 term 数量(导致评分参数 avgdl 变大), 还有重要的是 如果使用 match query 的话,会导致 匹配的 termFreq 增加到 synonym 的数量,影响评分。
- 如果 同义词变化的话,需要同步更新所有的关系到同义词的文档。
- 对于匹配原词 和 他的同义词,往往原词的 评分应该更高。但是 ES 中却一视同仁。没有区别。虽然可以通过定义不同的 field ,一个 field 使用 完全切分,一个field 使用同义词,并且在search时,给 全完且分词field 一个较高的权重。但是又带来了怎加了term 存储的容量扩大问题。
使用 demo 说明:
同义词文件内容:
工作,简历,招聘,入职
学校,老师,学生,操场
医院,护士,医生
PUT /test_synonym_1
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"jieba_synonym": {
"tokenizer": "jieba_search",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms/synonyms.txt"
}
}
}
}
}
}
PUT test_synonym_1/_mapping
{
"properties": {
"title": {
"type": "text",
"analyzer" :"jieba_synonym"
},
"content" :{
"type" :"text",
"analyzer" :"jieba_search"
}
}
}
POST test_synonym_1/_doc/1
{
"title" :"插入流程图时怎么编辑工作",
"content":"插入流程图时怎么编辑工作"
}
POST test_synonym_1/_doc/2
{
"title" :"怎么自定义功能区的学校",
"content":"怎么自定义功能区的学校"
}
POST test_synonym_1/_doc/3
{
"title" :"如何在表格中加医院",
"content":"如何在表格中加医院"
}
POST test_synonym_1/_doc/4
{
"title" :"首页怎么关闭?",
"content":"首页怎么关闭?"
}
POST test_synonym_1/_doc/5
{
"title" :"修改的关系图怎么做成一整个图",
"content":"修改的关系图怎么做成一整个图"
}
POST test_synonym_1/_doc/6
{
"title" :"在哪里给文档命名",
"content":"在哪里给文档命名"
}
GET test_synonym_1/_search
{
"explain": true,
"query": {
"match": {
"title": {
"query": "表格学生",
"analyzer": "jieba_synonym"
}
}
}
}
# result
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 2.7425265,
"hits" : [
{
"_shard" : "[test_synonym_1][0]",
"_node" : "EyNKn90XS1Otize_1yE7-w",
"_index" : "test_synonym_1",
"_type" : "_doc",
"_id" : "2",
"_score" : 2.7425265,
"_source" : {
"title" : "怎么自定义功能区的学校",
"content" : "怎么自定义功能区的学校"
},
"_explanation" : {
"value" : 2.7425265,
"description" : "sum of:",
"details" : [
{
"value" : 2.7425265,
"description" : "weight(Synonym(title:学校 title:学生 title:操场 title:老师) in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 2.7425265,
"description" : "score(freq=4.0), product of:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 1.5404451,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 6,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.80924857,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 4.0,
"description" : "termFreq=4.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 7.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
},
{
"_shard" : "[test_synonym_1][0]",
"_node" : "EyNKn90XS1Otize_1yE7-w",
"_index" : "test_synonym_1",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.7443275,
"_source" : {
"title" : "如何在表格中加医院",
"content" : "如何在表格中加医院"
},
"_explanation" : {
"value" : 1.7443275,
"description" : "sum of:",
"details" : [
{
"value" : 1.7443275,
"description" : "weight(title:表格 in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 1.7443275,
"description" : "score(freq=1.0), product of:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 1.5404451,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 6,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.5147059,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 7.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
}
]
}
}
# 可以看到 第一个得分较高,并且 termFreq=4.0 恰好是 同义词的数量。这是因为在搜索的时候也使用了 synonym ,对原有的query 进行了扩充。
# 使用 term 是没有这个问题的,因为 term query 不会对 搜索词 进行 analyzer的加工处理。但是没有办法保证精确匹配的原词有更高的 score,而不是匹配上的其他同义词有更高 score, 比如 query:学校 ,结果是 (一个学生, 一个拥有大量面积的学校) ,而不是精准匹配的在前面。
上述问题的解决思考:
不要使用 带 synonym 的analyzer 进行 index 操作,使用他们进行 query 操作。
# analyzer 在数据索引的事后起作用
# search_analyzer 在请求的时候起作用,如果没有默认是 analyzer
PUT test_synonym_1/_mapping
{
"properties": {
"title": {
"type": "text",
"analyzer": "jieba_search",
"search_analyzer": "jieba_synonym"
},
"content" :{
"type" :"text",
"analyzer" :"jieba_search"
}
}
}
#更加灵活的处理是 在 match query 是指定相应的 analyzer
GET test_synonym_1/_search
{
"explain": true,
"query": {
"match": {
"title": {
"query": "表格学生",
"analyzer": "jieba_synonym"
}
}
}
}
最后 ,如果 synonym filter本身支持远程词库的作用的话,那么更新了远程词库,搜索的时候就会主动生效。
# 使用远程词库的 synonym filter, 拼接起来的 analyzer 去 search
PUT /test_synonym_1
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"jieba_synonym": {
"tokenizer": "jieba_search",
"filter": [
"remote_synonym"
]
}
},
"filter": {
"remote_synonym": {
"type": "dynamic_synonym",
"synonyms_path": "http://locahost:8080/synonym.txt",
"interval": "60"
}
}
}
}
}
}
PUT test_synonym_1/_mapping
{
"properties": {
"title": {
"type": "text",
"analyzer": "jieba_search",
"search_analyzer": "jieba_synonym"
},
"content" :{
"type" :"text",
"analyzer" :"jieba_search"
}
}
}