elasticsearch synonym filter 使用思考

ES synonym filter

为了进行扩为了进行扩召回,一种有效的方式是添加同义词,加入同义词后扩大了搜索范围同时也带来了两个问题:

  1. term query 原词需要比同义词有更高的评分

    # 发现结果中 原词和同义词 具有同样的权值
    GET learning_test_03/_search
    {
      "_source": "post_title", 
      "explain": true, 
      "query": {
        "term": {
          "post_title.jieba_dic_all_synonym": {
            "value": "视图"
          }
        }
      }
    }
    
  2. match_phase 也有这个问题,同义词低于原词的评分

GET learning_test_03/_search
{
  "_source": "post_title",
  "explain": true, 
  "query": {
      "match_phrase": {
          "post_title.jieba_dic_all_synonym": "插入流程图"
      }
  }
}

# result:
#  "description" : """weight(post_title.jieba_dic_all_synonym:"插入 (流程图 visio 略图 视图)" in 20533) [PerFieldSimilarity], result of:"""
# 可以看出,流程图 和他的同义词: visio 略图 视图 ,身份都是一样的。但是在查询中,往往应该原词高于 扩充的同义词.


synonym 对评分的干扰

带有 synonym filter 的 analyzer 的使用:

官方文档提供了 synonym filter 并举例了 ,索引数据时的应用示例, 但是经过调研分析,得出了 带有 synonym 的 analyzer 适用于 search 而不是 index。

  1. synonym 增加了field 的 term 数量(导致评分参数 avgdl 变大), 还有重要的是 如果使用 match query 的话,会导致 匹配的 termFreq 增加到 synonym 的数量,影响评分。
  2. 如果 同义词变化的话,需要同步更新所有的关系到同义词的文档。
  3. 对于匹配原词 和 他的同义词,往往原词的 评分应该更高。但是 ES 中却一视同仁。没有区别。虽然可以通过定义不同的 field ,一个 field 使用 完全切分,一个field 使用同义词,并且在search时,给 全完且分词field 一个较高的权重。但是又带来了怎加了term 存储的容量扩大问题。

使用 demo 说明:

同义词文件内容:
工作,简历,招聘,入职
学校,老师,学生,操场
医院,护士,医生

PUT /test_synonym_1
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "jieba_synonym": {
            "tokenizer": "jieba_search",
            "filter": [
              "synonym"
            ]
          }
        },
        "filter": {
          "synonym": {
            "type": "synonym",
            "synonyms_path": "synonyms/synonyms.txt"
          }
        }
      }
    }
  }
}

PUT test_synonym_1/_mapping
{
  "properties": {
    "title": {
      "type": "text",
      "analyzer" :"jieba_synonym"
    },
    "content" :{
      "type" :"text",
      "analyzer" :"jieba_search" 
    }
  }
}

POST test_synonym_1/_doc/1
{
  "title" :"插入流程图时怎么编辑工作",
  "content":"插入流程图时怎么编辑工作"
}

POST test_synonym_1/_doc/2
{
  "title" :"怎么自定义功能区的学校",
  "content":"怎么自定义功能区的学校"
}
POST test_synonym_1/_doc/3
{
  "title" :"如何在表格中加医院",
  "content":"如何在表格中加医院"
}
POST test_synonym_1/_doc/4
{
  "title" :"首页怎么关闭?",
  "content":"首页怎么关闭?"
}
POST test_synonym_1/_doc/5
{
  "title" :"修改的关系图怎么做成一整个图",
  "content":"修改的关系图怎么做成一整个图"
}
POST test_synonym_1/_doc/6
{
  "title" :"在哪里给文档命名",
  "content":"在哪里给文档命名"
}

GET test_synonym_1/_search
{
  "explain": true,
  "query": {
    "match": {
      "title": {
        "query": "表格学生",
        "analyzer": "jieba_synonym"
      }
    }
  }
}

# result
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.7425265,
    "hits" : [
      {
        "_shard" : "[test_synonym_1][0]",
        "_node" : "EyNKn90XS1Otize_1yE7-w",
        "_index" : "test_synonym_1",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 2.7425265,
        "_source" : {
          "title" : "怎么自定义功能区的学校",
          "content" : "怎么自定义功能区的学校"
        },
        "_explanation" : {
          "value" : 2.7425265,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 2.7425265,
              "description" : "weight(Synonym(title:学校 title:学生 title:操场 title:老师) in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 2.7425265,
                  "description" : "score(freq=4.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.5404451,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 6,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.80924857,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 4.0,
                          "description" : "termFreq=4.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 5.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 7.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[test_synonym_1][0]",
        "_node" : "EyNKn90XS1Otize_1yE7-w",
        "_index" : "test_synonym_1",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.7443275,
        "_source" : {
          "title" : "如何在表格中加医院",
          "content" : "如何在表格中加医院"
        },
        "_explanation" : {
          "value" : 1.7443275,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 1.7443275,
              "description" : "weight(title:表格 in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 1.7443275,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.5404451,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 6,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.5147059,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 5.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 7.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}


# 可以看到 第一个得分较高,并且 termFreq=4.0 恰好是 同义词的数量。这是因为在搜索的时候也使用了 synonym ,对原有的query 进行了扩充。
# 使用 term 是没有这个问题的,因为 term query 不会对 搜索词 进行 analyzer的加工处理。但是没有办法保证精确匹配的原词有更高的 score,而不是匹配上的其他同义词有更高 score, 比如 query:学校 ,结果是  (一个学生, 一个拥有大量面积的学校) ,而不是精准匹配的在前面。   

上述问题的解决思考:

不要使用 带 synonym 的analyzer 进行 index 操作,使用他们进行 query 操作。

# analyzer 在数据索引的事后起作用
# search_analyzer 在请求的时候起作用,如果没有默认是 analyzer 
PUT test_synonym_1/_mapping
{
  "properties": {
    "title": {
      "type": "text",
      "analyzer": "jieba_search",
      "search_analyzer": "jieba_synonym"
    },
    "content" :{
      "type" :"text",
      "analyzer" :"jieba_search" 
    }
  }
}

#更加灵活的处理是 在 match query 是指定相应的 analyzer 
GET test_synonym_1/_search
{
  "explain": true,
  "query": {
    "match": {
      "title": {
        "query": "表格学生",
        "analyzer": "jieba_synonym"
      }
    }
  }
}

最后 ,如果 synonym filter本身支持远程词库的作用的话,那么更新了远程词库,搜索的时候就会主动生效。

# 使用远程词库的 synonym filter, 拼接起来的 analyzer 去 search
PUT /test_synonym_1
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "jieba_synonym": {
            "tokenizer": "jieba_search",
            "filter": [
              "remote_synonym"
            ]
          }
        },
        "filter": {
          "remote_synonym": {
            "type": "dynamic_synonym",
            "synonyms_path": "http://locahost:8080/synonym.txt",
            "interval": "60"
          }
        }
      }
    }
  }
}

PUT test_synonym_1/_mapping
{
  "properties": {
    "title": {
      "type": "text",
      "analyzer": "jieba_search",
      "search_analyzer": "jieba_synonym"
    },
    "content" :{
      "type" :"text",
      "analyzer" :"jieba_search" 
    }
  }
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值