WINDOW 操作
利用WINDOW操作对相邻文档或基于文档属性相对值范围的值区间进行滑动窗口聚合,以计算累计总额、滚动平均值以及其他统计属性。
WINDOW操作主要用于对查询结果中相邻文档或者说是前后行进行聚合操作。它也可以基于文档属性的值或持续时间范围进行聚合。
该操作类似于COLLECT AGGREGATE操作对一组查询行进行处理。然而,与将多行查询结果合并成单个结果组的COLLECT操作不同,WINDOW操作为每行查询结果生成一个独立的结果:
- 函数评估发生的行被称为当前行。
- 与当前行相关的、函数在其上进行评估的所有查询行构成了当前行的窗口帧。
窗口帧相对于当前行的定义如下:
- 通过定义窗口帧为从查询开始到当前行的所有行,可以计算每行的累计总额。
- 通过定义窗口帧为当前行两侧延伸N行,可以计算滚动平均值。
语法
WINDOW 操作有两种用法
基于行(相邻文档):
WINDOW { preceding: numPrecedingRows, following: numFollowingRows } AGGREGATE variableName = aggregateExpression
基于范围(值或持续时间范围):
WINDOW rangeValue WITH { preceding: offsetPreceding, following: offsetFollowing } AGGREGATE variableName = aggregateExpression
聚合表达式中支持对以下函数的调用:
- LENGTH() / COUNT()
- MIN()
- MAX()
- SUM()
- AVERAGE() / AVG()
- STDDEV_POPULATION() / STDDEV()
- STDDEV_SAMPLE()
- VARIANCE_POPULATION() / VARIANCE()
- VARIANCE_SAMPLE()
- UNIQUE()
- SORTED_UNIQUE()
- COUNT_DISTINCT() / COUNT_UNIQUE()
- BIT_AND()
- BIT_OR()
- BIT_XOR()
基于行的聚合
WINDOW的第一种语法形式允许在当前行之后或之前的固定数量的行上进行聚合。也可以定义所有前面或后面的行都应该聚合(“unbounded”)。行数必须在查询编译时确定。
下面的查询演示了使用窗口框架来计算运行总计以及根据当前行及其前后的行计算的滚动平均值:
FOR t IN observations
SORT t.time
WINDOW { preceding: 1, following: 1 }
AGGREGATE rollingAverage = AVG(t.val), rollingSum = SUM(t.val)
WINDOW { preceding: "unbounded", following: 0}
AGGREGATE cumulativeSum = SUM(t.val)
RETURN {
time: t.time,
subject: t.subject,
val: t.val,
rollingAverage, // average of the window's values
rollingSum, // sum of the window's values
cumulativeSum // running total
}
[
{
"time" : "2021-05-25 07:00:00",
"subject" : "st113",
"val" : 10,
"rollingAverage" : 5,
"rollingSum" : 10, // 此处的10就是 10 + 0
"cumulativeSum" : 10
},
{
"time" : "2021-05-25 07:00:00",
"subject" : "xh458",
"val" : 0,
"rollingAverage" : 6.333333333333333,
"rollingSum" : 19, // 此处的19就是 10 + 0 + 9
"cumulativeSum" : 10
},
{
"time" : "2021-05-25 07:15:00",
"subject" : "st113",
"val" : 9,
"rollingAverage" : 6.333333333333333,
"rollingSum" : 19,
"cumulativeSum" : 19
},
{
"time" : "2021-05-25 07:15:00",
"subject" : "xh458",
"val" : 10,
"rollingAverage" : 14.666666666666666,
"rollingSum" : 44, // 此处的44就是 9 + 10 + 25
"cumulativeSum" : 29
},
{
"time" : "2021-05-25 07:30:00",
"subject" : "st113",
"val" : 25,
"rollingAverage" : 13.333333333333334,
"rollingSum" : 40,
"cumulativeSum" : 54
},
{
"time" : "2021-05-25 07:30:00",
"subject" : "xh458",
"val" : 5,
"rollingAverage" : 16.666666666666668,
"rollingSum" : 50,
"cumulativeSum" : 59
},
{
"time" : "2021-05-25 07:45:00",
"subject" : "st113",
"val" : 20,
"rollingAverage" : 18.333333333333332,
"rollingSum" : 55,
"cumulativeSum" : 79
},
{
"time" : "2021-05-25 07:45:00",
"subject" : "xh458",
"val" : 30,
"rollingAverage" : 25,
"rollingSum" : 75,
"cumulativeSum" : 109
},
{
"time" : "2021-05-25 08:00:00",
"subject" : "xh458",
"val" : 25,
"rollingAverage" : 27.5,
"rollingSum" : 55,
"cumulativeSum" : 134
}
]
行顺序由时间属性上的SORT操作控制。
第一个窗口操作聚合前一行、当前行和下一行(前一行和后一行设置为1),并计算这三个值的平均值和总和。在第一行的情况下,没有前一行而是后一行,因此将值10和0相加以计算总和,该总和除以2以计算平均值。对于第二行,将值10、0和9相加并除以3,依此类推。
第二个WINDOW操作聚合所有以前的值(“unbounded”)以计算一个运行的和。对于第一行,这仅仅是10,对于第二行,它是10+0,对于第三行,10+0+9,依此类推。
time | subject | val | rollingAverage | rollingSum | cumulativeSum |
---|---|---|---|---|---|
2021-05-25 07:00:00 | st113 | 10 | 5 | 10 | 10 |
2021-05-25 07:00:00 | xh458 | 0 | 6.333… | 19 | 10 |
2021-05-25 07:15:00 | st113 | 9 | 6.333… | 19 | 19 |
2021-05-25 07:15:00 | xh458 | 10 | 14.666… | 44 | 29 |
2021-05-25 07:30:00 | st113 | 25 | 13.333… | 40 | 54 |
2021-05-25 07:30:00 | xh458 | 5 | 16.666… | 50 | 59 |
2021-05-25 07:45:00 | st113 | 20 | 18.333… | 55 | 79 |
2021-05-25 07:45:00 | xh458 | 30 | 25 | 75 | 109 |
2021-05-25 08:00:00 | xh458 | 25 | 27.5 | 55 | 134 |
下面的查询演示了使用窗口框架来计算按时间排序的查询行的每个主题组内的运行总数,以及根据当前行及其前后的行计算的滚动总和和平均值,以及按主题组并按时间排序:
FOR t IN observations
COLLECT subject = t.subject INTO group = t
LET subquery = (FOR t2 IN group
SORT t2.time
WINDOW { preceding: 1, following: 1 }
AGGREGATE rollingAverage = AVG(t2.val), rollingSum = SUM(t2.val)
WINDOW { preceding: "unbounded", following: 0 }
AGGREGATE cumulativeSum = SUM(t2.val)
RETURN {
time: t2.time,
subject: t2.subject,
val: t2.val,
rollingAverage,
rollingSum,
cumulativeSum
}
)
// flatten subquery result
FOR t2 IN subquery
RETURN t2
[
{
"time" : "2021-05-25 07:00:00",
"subject" : "st113",
"val" : 10,
"rollingAverage" : 9.5,
"rollingSum" : 19,
"cumulativeSum" : 10
},
{
"time" : "2021-05-25 07:15:00",
"subject" : "st113",
"val" : 9,
"rollingAverage" : 14.666666666666666,
"rollingSum" : 44,
"cumulativeSum" : 19
},
{
"time" : "2021-05-25 07:30:00",
"subject" : "st113",
"val" : 25,
"rollingAverage" : 18,
"rollingSum" : 54,
"cumulativeSum" : 44
},
{
"time" : "2021-05-25 07:45:00",
"subject" : "st113",
"val" : 20,
"rollingAverage" : 22.5,
"rollingSum" : 45,
"cumulativeSum" : 64
},
{
"time" : "2021-05-25 07:00:00",
"subject" : "xh458",
"val" : 0,
"rollingAverage" : 5,
"rollingSum" : 10,
"cumulativeSum" : 0
},
{
"time" : "2021-05-25 07:15:00",
"subject" : "xh458",
"val" : 10,
"rollingAverage" : 5,
"rollingSum" : 15,
"cumulativeSum" : 10
},
{
"time" : "2021-05-25 07:30:00",
"subject" : "xh458",
"val" : 5,
"rollingAverage" : 15,
"rollingSum" : 45,
"cumulativeSum" : 15
},
{
"time" : "2021-05-25 07:45:00",
"subject" : "xh458",
"val" : 30,
"rollingAverage" : 20,
"rollingSum" : 60,
"cumulativeSum" : 45
},
{
"time" : "2021-05-25 08:00:00",
"subject" : "xh458",
"val" : 25,
"rollingAverage" : 27.5,
"rollingSum" : 55,
"cumulativeSum" : 70
}
]
基于范围的聚合
WINDOW的第二种语法形式允许对值范围内的所有文档进行聚合。偏移是当前文档中属性值的差异。
属性值必须是数字。偏移量计算是通过添加或减去在以下和前面的属性中指定的数字偏移量来执行的。偏移量必须是正数,并且必须在查询编译时确定。默认偏移量为0。
基于范围的窗口语法要求按行值对输入行进行排序。为了确保结果的正确性,AQL优化器将自动在WINDOW语句前面的查询中插入SORT语句。如果组标准上存在排序索引,优化器稍后可能能够优化掉SORT语句。
以下查询演示了使用窗口框架来计算当前文档以及t.val中属性值在[-10,+5](包括)范围内的文档的总计和平均值,包括前面和下面的文档:
FOR t IN observations
WINDOW t.val WITH { preceding: 10, following: 5 }
AGGREGATE rollingAverage = AVG(t.val), rollingSum = SUM(t.val)
RETURN {
time: t.time,
subject: t.subject,
val: t.val,
rollingAverage,
rollingSum
}
[
{
"time" : "2021-05-25 07:00:00",
"subject" : "xh458",
"val" : 0,
"rollingAverage" : 2.5,
"rollingSum" : 5
},
{
"time" : "2021-05-25 07:30:00",
"subject" : "xh458",
"val" : 5,
"rollingAverage" : 6.8,
"rollingSum" : 34
},
{
"time" : "2021-05-25 07:15:00",
"subject" : "st113",
"val" : 9,
"rollingAverage" : 6.8,
"rollingSum" : 34
},
{
"time" : "2021-05-25 07:00:00",
"subject" : "st113",
"val" : 10,
"rollingAverage" : 6.8,
"rollingSum" : 34
},
{
"time" : "2021-05-25 07:15:00",
"subject" : "xh458",
"val" : 10,
"rollingAverage" : 6.8,
"rollingSum" : 34
},
{
"time" : "2021-05-25 07:45:00",
"subject" : "st113",
"val" : 20,
"rollingAverage" : 18,
"rollingSum" : 90
},
{
"time" : "2021-05-25 07:30:00",
"subject" : "st113",
"val" : 25,
"rollingAverage" : 25,
"rollingSum" : 100
},
{
"time" : "2021-05-25 08:00:00",
"subject" : "xh458",
"val" : 25,
"rollingAverage" : 25,
"rollingSum" : 100
},
{
"time" : "2021-05-25 07:45:00",
"subject" : "xh458",
"val" : 30,
"rollingAverage" : 25,
"rollingSum" : 100
}
]
由于val为0,第一行的值范围为[-10,5],因此第一行和第二行的值加起来为5,平均值为2.5。最后一行的值范围是[20,35],因为val是30,这意味着最后四行的总和为100,平均值为25(该范围包括在内,即val落在值为20的范围内)。
time | subject | val | rollingAverage | rollingSum |
---|---|---|---|---|
2021-05-25 07:00:00 | xh458 | 0 | 2 | 5 |
2021-05-25 07:30:00 | xh458 | 5 | 6.8 | 34 |
2021-05-25 07:15:00 | st113 | 9 | 6.8 | 34 |
2021-05-25 07:00:00 | st113 | 10 | 6.8 | 34 |
2021-05-25 07:15:00 | xh458 | 10 | 6.8 | 34 |
2021-05-25 07:45:00 | st113 | 20 | 18 | 90 |
2021-05-25 07:30:00 | st113 | 25 | 25 | 100 |
2021-05-25 08:00:00 | xh458 | 25 | 25 | 100 |
2021-05-25 07:45:00 | xh458 | 30 | 25 | 100 |
基于持续时间的聚合
按时间间隔聚合是基于范围的聚合的一个子类型,它使用第二种语法形式的WINDOW,但具有ISO 标准时间。
为了支持时间序列数据上的窗口帧,窗口操作可以使用正的ISO 8601持续时间字符串来计算时间戳偏移,如P1Y6M(1年6个月)或PT12H30M(12小时30分钟)。另请参阅日期函数。与ISO 8601标准不同,周组件可以与其他组件自由组合。例如,P1WT1H和P1M1W都是有效的。小数仅支持秒数,分隔符后最多支持三位小数,即毫秒精度。例如,PT0.123S是有效的持续时间,而PT0.5H和PT0.1234S不是。
工期可以在下面和前面分别指定。如果使用这样的持续时间,则当前文档的属性值必须是一个数字,并被视为以毫秒为单位的数字时间戳。范围包括在内。如果未指定任一绑定,则将其视为空持续时间(即P0D)。
以下查询演示了如何使用窗口框架,根据从日期时间字符串转换为数字时间戳的文档属性时间,计算最后30分钟(包括30分钟)内观测值的滚动和和平均值:
FOR t IN observations
WINDOW DATE_TIMESTAMP(t.time) WITH { preceding: "PT30M" }
AGGREGATE rollingAverage = AVG(t.val), rollingSum = SUM(t.val)
RETURN {
time: t.time,
subject: t.subject,
val: t.val,
rollingAverage,
rollingSum
}
[
{
"time" : "2021-05-25 07:00:00",
"subject" : "st113",
"val" : 10,
"rollingAverage" : 5,
"rollingSum" : 10
},
{
"time" : "2021-05-25 07:00:00",
"subject" : "xh458",
"val" : 0,
"rollingAverage" : 5,
"rollingSum" : 10
},
{
"time" : "2021-05-25 07:15:00",
"subject" : "st113",
"val" : 9,
"rollingAverage" : 7.25,
"rollingSum" : 29
},
{
"time" : "2021-05-25 07:15:00",
"subject" : "xh458",
"val" : 10,
"rollingAverage" : 7.25,
"rollingSum" : 29
},
{
"time" : "2021-05-25 07:30:00",
"subject" : "st113",
"val" : 25,
"rollingAverage" : 9.833333333333334,
"rollingSum" : 59
},
{
"time" : "2021-05-25 07:30:00",
"subject" : "xh458",
"val" : 5,
"rollingAverage" : 9.833333333333334,
"rollingSum" : 59
},
{
"time" : "2021-05-25 07:45:00",
"subject" : "st113",
"val" : 20,
"rollingAverage" : 16.5,
"rollingSum" : 99
},
{
"time" : "2021-05-25 07:45:00",
"subject" : "xh458",
"val" : 30,
"rollingAverage" : 16.5,
"rollingSum" : 99
},
{
"time" : "2021-05-25 08:00:00",
"subject" : "xh458",
"val" : 25,
"rollingAverage" : 21,
"rollingSum" : 105
}
]
时间为07:30:00,从07:00:00到07:30:0的所有内容都在前面的持续时间范围内:“PT30M”,因此将前六行合计为59,平均9.8333…。
time | subject | val | rollingAverage | rollingSum |
---|---|---|---|---|
2021-05-25 07:00:00 | st113 | 10 | 5 | 10 |
2021-05-25 07:00:00 | xh458 | 0 | 5 | 10 |
2021-05-25 07:15:00 | st113 | 9 | 7.25 | 29 |
2021-05-25 07:15:00 | xh458 | 10 | 7.25 | 29 |
2021-05-25 07:30:00 | st113 | 25 | 9.8333… | 59 |
2021-05-25 07:30:00 | xh458 | 5 | 9.8333… | 59 |
2021-05-25 07:45:00 | st113 | 20 | 16.5 | 99 |
2021-05-25 07:45:00 | xh458 | 30 | 16.5 | 99 |
2021-05-25 08:00:00 | xh458 | 25 | 21 | 105 |