最近在做项目时,遇到个需求,需要定期根据热搜词频,更新索引中推荐字段的权重,然后就写了个脚本执行。
逻辑也很简单,现获取热搜词,然后使用 update_by_query 对索引中相关item进行更新。一开始单步调试代码测试,都运行正常,但是一旦脚本自动开始执行,就报如下错误:
elasticsearch.exceptions.ConflictError: ConflictError(409, u'{"took":1,"timed_out":false,"total":2,"updated":0,"deleted":0,"batches":1,"version_conflicts":2,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1.0,"throttled_until_millis":0,"failures":[{"index":"my_index_name","type":"my_index_name","id":"1946","cause":{"type":"version_conflict_engine_exception","reason":"[my_index_name][1946]: version conflict, current version [10] is different than the one provided [9]","index_uuid":"nrK2h7uxSNaI2EGsFH-RNQ","shard":"0","index":"my_index_name"},"status":409},{"index":"my_index_name","type":"my_index_name","id":"1981","cause":{"type":"version_conflict_engine_exception","reason":"[my_index_name][1981]: version conflict, current version [17] is different than the one provided [16]","index_uuid":"nrK2h7uxSNaI2EGsFH-RNQ","shard":"0","index":"my_index_name"},"status":409}]}')
下面是我的代码业务逻辑:
for top_query in top_queries:
query = top_query["key"]
new_weight = top_query["doc_count"]
prefix_body = self.fill_update_query_prefix_body(query, new_weight)
update_prefix_res = self.es_client.update_by_query(index=index_name, body=prefix_body)
updated_prefix_num = update_prefix_res["updated"]
body = self.fill_update_query_body(query, new_weight)
update_res = self.es_client.update_by_query(index=index_name, body=body)
updated_num = update_res["updated"]
print "updated %s of query: %s" % (updated_num + updated_prefix_num, query)
每个热搜词,更新了两遍,一遍是根据前缀查询结果更新,一个是根据完全匹配查询结果更新。
后排查发现,该错误是由于update_by_query请求不会等任务处理完成后才返回结果,而是先返回结果,之后会在es后台自动执行。在这个过程中,如果执行下次查询,更新操作,会导致同一个Item同时被两个更新进程更新,进而出现同一个item的版本冲突。
解决办法:
1-简单粗暴:直接在第一个更新和第二次更新之间加个合适的睡眠时间;如下:
for top_query in top_queries:
query = top_query["key"]
new_weight = top_query["doc_count"]
prefix_body = self.fill_update_query_prefix_body(query, new_weight)
update_prefix_res = self.es_client.update_by_query(index=index_name, body=prefix_body)
updated_prefix_num = update_prefix_res["updated"]
# 此处会因为上次的更新请求暂未完成,导致下面的查询更新出错
time.sleep(3)
body = self.fill_update_query_body(query, new_weight)
update_res = self.es_client.update_by_query(index=index_name, body=body)
updated_num = update_res["updated"]
print "updated %s of query: %s" % (updated_num + updated_prefix_num, query)
2:在使用update_by_query更新时,加上参数wait_for_completion=false。加上这个参数后,更新请求会返回一个这个更新请求对应的任务的 task id。之后在进行下一次查询前,可以根据该id,查询任务是否完成,在完成后再进行后续操作。
update_prefix_res = self.es_client.update_by_query(index=index_name, body=prefix_body, wait_for_completion=False)
这条语句的返回结果为:
{
"task": "dXKIRFb1Txy-HA3YjRuBog:1442048"
}
可以根据该id进行查询任务完成状态:
prefix_task_id = self.es_client.update_by_query(index=index_name, body=body, wait_for_completion=False)
# 此处会因为上次的更新请求暂未完成,导致下面的查询更新出错
# time.sleep(3)
stop = True
while stop:
status = self.es_client.tasks.get(prefix_task_id["task"])
if status["completed"]:
stop = False
updated_prefix_num = status["task"]["status"]["updated"]
经实践证明,第二种方式比第一种快很多。