ES 矩阵查询（Adjacency matrix aggregation）

最新推荐文章于 2023-11-15 09:25:14 发布

shen198623

最新推荐文章于 2023-11-15 09:25:14 发布

阅读量537

点赞数

分类专栏： elasticsearch 文章标签： elasticsearch

本文链接：https://blog.csdn.net/shen198623/article/details/123289005

版权

elasticsearch 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

邻接矩阵聚合

定义

对某个字段的值做矩阵，返回单独满足一个/同时满足两个的结果

(矩阵之间查询并级)

	A组 ["a","b"]	B组 ["c","e"]	C组 ["d", "f"]
A组 ["a","b"]	A组 & A组 = A组 ["a","b"]	A组 & B组 ["a","b"] & ["c","e"]	A组 & C组 ["a","b"] & ["d", "f"]
B组 ["c","e"]		B组 & B组 = B组 ["c","e"]	B组 & C组 ["c","e"] & ["d", "f"]
C组 ["d", "f"]			C组 &C组 = C组 ["d", "f"]

A组

["a","b"]

B组

["c","e"]

C组

["d", "f"]

A组

["a","b"]

A组 & A组 = A组

["a","b"]

A组 & B组

["a","b"] & ["c","e"]

A组 & C组

["a","b"] & ["d", "f"]

B组

["c","e"]

B组 & B组 = B组

["c","e"]

B组 & C组

["c","e"] & ["d", "f"]

C组

["d", "f"]

C组 &C组 = C组

["d", "f"]

测试数据

PUT emails/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "accounts" : ["a", "b"]}
{ "index" : { "_id" : 2 } }
{ "accounts" : ["a", "c"]}
{ "index" : { "_id" : 3 } }
{ "accounts" : ["d", "c"]}

测试程序

GET emails/_search
{
"size": 0,
"aggs": {
"临接矩阵": {
"adjacency_matrix": {
"filters": {
"A组":{"terms":{"accounts":["a","b"]}},
"B组":{"terms":{"accounts":["c","e"]}},
"C组":{"terms":{"accounts":["d", "f"]}}
}
}
}
}
}

测试结果

... 其他数据隐藏
"aggregations" : {
"临接矩阵" : {
"buckets" : [
{
"key" : "A组",
"doc_count" : 2
},
{
"key" : "A组&B组",
"doc_count" : 1
},
{
"key" : "B组",
"doc_count" : 2
},
{
"key" : "B组&C组",
"doc_count" : 1
},
{
"key" : "C组",
"doc_count" : 1
}
]
}

手动计算方法

根据最上边图表方法计算结果。手动进行 “邻街距阵”计算

# 测试 A组 & A组 =A组
# 查看total.value数据量
GET emails/_search
{
"query": {
"bool": {
"filter": [
{"terms": {"accounts": ["a","b"]}}
]
}
},
"size": 0
}

返回结果：

{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}

测试 A组&B组

# 测试 A组 & B组
# 查看total.value数据量
GET emails/_search
{
"query": {
"bool": {
"filter": [
{"terms": {"accounts": ["a","b"]}},
{"terms": {"accounts": ["c","e"]}}
]
}
},
"size": 0
}

测试结果

{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}

根据以上结果信息，可以进行依次类推

总结与注意点

邻接矩阵的聚合是作用在同一字段中不同值的交叉对比和聚合（对ES来说，所有的字段都可以存为数组），所以可以看到结果集会像文章最开头的矩阵一样存在单一标签、组合标签的聚合结果的展示。

filters 的部分是必填的，但是里面填的内容和普通 dsl 一样，主要是为了给当前这部分数据进行分组
分组之后的结果默认用 & 相连，需要自己进行和结果矩阵构建
如果标签之间不存在 overlap 的结果，返回的结果就和普通的 terms 一样了

使用场景

官方建议是配合date_histogram做dynamic network analysis

这里我们可以考虑做的是：

用户/特征分组，如：年纪在 20～25 岁的用户，在北京或上海的用户有多少
问题归因，如：同属于 groupA、groupB…的服务器有多少

个人理解这个聚合和普通的 terms 的差异在于，某些标签可能为了方便存储和召回会直接以数组方式存储，如果单纯的以 terms 做聚合就会丧失同一条数据多种不同标签的关联关系

参考文章：https://blog.csdn.net/weixin_40601534/article/details/122366515

shen198623

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ES 矩阵查询（Adjacency matrix aggregation）

邻接矩阵聚合定义对某个字段的值做矩阵，返回单独满足一个/同时满足两个的结果(矩阵之间查询并级) A组 ["a","b"] B组 ["c","e"] C组 ["d", "f"] A组 ["a","b"] A组 & A组 = A组 ["a","b"] A组 & B组 ["a","b"] &["c...
复制链接

扫一扫

专栏目录