SQL ——利用窗口函数的T-SQL解决方案

目录

1、数字辅助表

2、日期和时间值序列

3、序列键——更新列中的值为唯一值

4、分页

5、删除重复

6、数据透视

7、每组前N行

8、模式

9、统计总和

10、最大并发间隔

11、包装间隔

12、数据差距和数据岛

12.1、数据差距

12.2、数据岛

12.3、中位数

13、条件聚合

14、层次结构排序


说在前面

我所用的SQL版本为2017,而知识点基于SQL SERVER 2012说明,所以建议使用2012或以上版本测试。如果要完成下面案例的测试,请自行创建测试数据库(如TSQL2012)。有的解决方案不止此文所写,只是主要为了介绍窗口函数的优化解决方案而已。但是并不是说窗口函数的解决方案都是最优的(此篇所介绍的基本都是最优),而且限于篇幅,很多案例的更多解决方案并没有给出,有兴趣的也可以自行多研究。

此文涵盖的解决方案包含: Virtual Auxiliary Table of Numbers(虚拟数字辅動表)、 Sequences of Date and Time Values(日期和时间值序列)、 Sequences ofKeys(序列键)、 Paging(分页)、 Removing Duplicates(删除重复)、 Pivoting(数据透视)、 Top N Per Group(每组前N行)、Mode(模式)、 Running Totals(统计总和)、 Max Concurrent Intervals(最大并发间隔)、 Packing Intervals(包装间隔)、 Gaps and Islands(数据差距和数据岛)、 Median(中位数)、 ConditionalAggregate(条件聚合)和 Sorting Hierarchies(层次结构排序)。

如果对窗口函数的概念不是很熟悉,可以查看我之前的文章:SQL ——窗口函数简介 。

 

1、数字辅助表

数字辅助表是一个整数序列,可以用它来完成多种不同的查询任务。数字表有多种用途,如生成日期和实际值序列,以及分裂值表。

使用窗口函数实现的解决方案,以函数的形式完成,完整代码如下:

--创建一个函数,在要求的范围内产生一个整数序列

IF OBJECT_ID('dbo.GetNums','IF') IS NOT NULL
DROP FUNCTION dbo.GetNums;
GO

CREATE FUNCTION dbo.GetNums(@low AS BIGINT ,@high AS BIGINT) RETURNS TABLE
AS
RETURN
    WITH
    L0 AS (SELECT C FROM (VALUES(1),(1)) AS D(c)),
    L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum FROM L5)
SELECT @low +rownum -1 AS N
FROM Nums
ORDER BY rownum
OFFSET 0 ROWS FETCH FIRST @high -@low +1 ROWS ONLY;
GO

 

注意,直到SQL SERVER 2012才兼容了OFFSET/FETCH选项,如果要兼容之前的版本,需要使用TOP选项。

修改如下:

--创建一个函数,在要求的范围内产生一个整数序列
IF OBJECT_ID('dbo.GetNums','IF') IS NOT NULL
DROP FUNCTION dbo.GetNums;
GO

CREATE FUNCTION dbo.GetNums(@low AS BIGINT ,@high AS BIGINT) RETURNS TABLE
AS
RETURN
    WITH
    L0 AS (SELECT C FROM (VALUES(1),(1)) AS D(c)),
    L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum FROM L5)
SELECT TOP(@high-@low+1) @low +rownum -1 AS N
FROM Nums
ORDER BY rownum;
GO

 

在测试之前需要做些配置工作:

注意,如果要生成此函数,此操作可忽略!

测试结果如下:

 

2、日期和时间值序列

与数据操作相关的各种场景要求生成一个日期和时间序列,序列的范围是从输入值@start到@end,且有一定的时间间隔(如1天,12小时等)。这样的场景包括填充数据仓库的时间维度、应用程序的运行时间安排以及其他。使用上面生成的GetNums函数实现。

 

下面生成一个2012年2月1日到2012年02月12日之间的一个日期序列:

DECLARE
    @start AS DATE ='20120201',
    @end   AS DATE ='20120212';

SELECT DATEADD(day,n ,@start) AS dt
FROM dbo.GetNums(0,DATEDIFF(day,@start,@end)) AS Nums;

 

结果如下:

 

如果时间的间隔是12小时,可以参考如下案例:

DECLARE
    @start AS DATETIME ='20120201 00:00:00',
    @end   AS DATETIME ='20120212 12:00:00';

SELECT DATEADD(hour,n*12 ,@start) AS dt
FROM dbo.GetNums(0,DATEDIFF(hour,@start,@end)/12) AS Nums;

 

结果如下:

 

3、序列键——更新列中的值为唯一值

当需要更新表中的数据或向表中插入数据时,在很多情况下需要产生唯一的整数序列键。本节将要描述的场景涉及如何处理数据质量问题。首先创建测试用例表:

IF OBJECT_ID('MyOrders','U')  IS NOT NULL
    DROP TABLE MyOrders;
GO

SELECT 0 AS orderid,custid,empid,orderdate
INTO MyOrders
FROM dbo.Orders;

--查看生成的表及相关数据
SELECT * FROM MyOrders;

 

生成数据如下:

注意,dbo.Orders表的结构和数据可以通过下面链接得到。 链接:https://pan.baidu.com/s/1_bHE3C6H589PXy-QDWdoDA 
提取码:yi6i 

假设由于数据质量问题,MyOrders表的orderid列的值不是唯一的。我的任务就是将orderid的值修改为唯一的值。

--修改orderid为唯一值
WITH C AS
(
    SELECT orderid,ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
    FROM dbo.MyOrders
)

UPDATE C
    SET orderid = rownum;

 

结果如下:

 

4、分页

在应用程序中经常用到分页(paging)。我们想让用户一次得到查询结果行的一部分,这样的结果可以更冗余大偶然目标网页、用户界面(UI)或屏幕中。ROW_NUMBER函数可用于此目的。根据所需的排序为结果行指定行号,然后根据给定的页码和页面大小参数筛选出正确范围内的行号。为了获得最佳性能,要定义窗口排序元素为索引键的索引,索引键还包括出现的查询中的其余列。

以表MyOrders为例,我们先给表创建索引:

--索引
CREATE UNIQUE INDEX idx_od_oid_i_cid_eid
    ON MyOrders(orderdate,orderid)
    INCLUDE(custid,empid);

 

具体实现如下:(获取第三页的25条记录,即51到75行)

--分页处理,具体可放到存储过程中,或者其他需要的地方
DECLARE
    @pagenum  AS INT =3,
    @pagesize AS INT=25;

WITH C AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY orderdate,orderid) AS rownum,
         orderid,orderdate,custid,empid
    FROM MyOrders
)

SELECT orderid,orderdate,custid,empid
FROM C
WHERE rownum BETWEEN (@pagenum-1)*@pagesize-1 AND @pagenum * @pagesize
ORDER BY rownum;

 

执行计划如下:

 

因为有索引支持ROW_NUMBER的计算,所以SQL SERVER 不需要扫描表中所有行 。相反,它只扫描索引中的前75行,然后筛选出行号为51到75的行。

在SQL SERVER 2005及更高版本中可以使用上述技术,如果是SQL SERVER 2012及以上版本还可以一个代替方案,即OFFSET/FETCH筛选选项。

--分页处理,具体可放到存储过程中,或者其他需要的地方  OFFSET/FETCH
DECLARE
    @pagenum  AS INT =3,
    @pagesize AS INT=25;

SELECT orderid,orderdate,custid,empid
FROM MyOrders
ORDER BY orderdate,orderid
OFFSET (@pagenum-1)*@pagesize-1 ROWS FETCH NEXT @pagenum * @pagesize ROWS ONLY;

 

其执行计划如下:

 

5、删除重复

删除重复数据是一种常见的需求,尤其是由于缺乏强制执行约束的唯一性,致使数据存在重复行。

首先处理样本数据:

--样本数据
IF OBJECT_ID('dbo.MyOrders') IS NOT NULL
    DROP TABLE dbo.MyOrders;
GO

SELECT * INTO dbo.MyOrders
FROM Orders
UNION ALL
SELECT * FROM Orders
UNION ALL
SELECT * FROM Orders;

 

假设需要删除重复数据,每个ORDERID值只保留唯一的一行。

--标记重复行
SELECT orderid,
    ROW_NUMBER() OVER(PARTITION BY orderid
         ORDER BY (SELECT NULL)) AS n
FROM dbo.MyOrders;

 

结果如下:

 

如果是少量的重复数据,可以移除所有行号大于1的行,如下:

--重复数据少的时候使用如下方法
WITH C AS
(
    SELECT orderid,
        ROW_NUMBER() OVER(PARTITION BY orderid
             ORDER BY (SELECT NULL)) AS n
    FROM dbo.MyOrders
)

DELETE FROM C
WHERE n>1;

 

但是如果需要删除大量的行,特别是当此数字代表的行在表中占很大比例时,完全记录的删除操作会很慢。这样可以考虑以下方法:

WITH C AS
(
    SELECT orderid,
        ROW_NUMBER() OVER(ORDER BY orderid) AS rownum,
        RANK() OVER(ORDER BY orderid) AS rnk
    FROM dbo.MyOrders
)

DELETE FROM C
WHERE rownum<>rnk;

 

上述方法只能在SQL SERVER 2012 及以上版本可用,且上述的解决方案不是唯一的。例如在有些时候可以使用TOP选项分批完成一个大型删除。但在这里我想重点介绍使用窗口函数的解决方案。

 

6、数据透视

透视是一种通过聚合和旋转把数据行转换成数据列的技术。当透视数据时,需要确定3个要素:要在行(分组元素)中看到的元素,要在列(扩展元素)上看到的元素,要在数据部分看到的元素(聚合元素)。

例如,假设我们需要查询OrderValues表,并未每个订单年返回一行,每个订单月为一列,年份和月份相交的每一个地方是订单金额总和。基于此请求,行或分组的元素是YEAR(orderdate);列或展开的元素是MONTH(orderdate);唯一扩展值是1,2到12;数据或聚合的元素是SUM(val)。

如下:

WITH C AS
(
    SELECT YEAR(orderdate) AS orderyear,MONTH(orderdate) AS ordermonth,val
    FROM OrderValues
)

SELECT *
FROM C  
    PIVOT(SUM(val)
     FOR ordermonth IN([1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12])) AS P;

注意,此节使用的表及数据可通过下面链接获取。链接:https://pan.baidu.com/s/1Zr1PHO1hfCSUxCR_04Yh7Q   提取码:vg27 

但有些情况下,扩展元素不存在数据源中,需要进行计算。例如为每一个客户返回最近5次订单的订单ID。我们希望在行上看到客户ID,在数据部分看到订单ID,但不同客户的订单ID之间没有共同点可以用来作为扩展元素。

解决方案是使用ROW_NUMBER函数在每个客户分区内为每个订单ID分配序号,根据所需的排序——这里是orderdate DESC、orderid DESC。然后代表该行的行号的列可以用作扩展元素,并作为扩展值计算的行号。

如下:

WITH C AS
(
    SELECT custid, val,
        ROW_NUMBER() OVER(PARTITION BY custid
                          ORDER BY orderdate DESC,orderid DESC) AS rownum
    FROM OrderValues
)

SELECT *
FROM C
    PIVOT(MAX(val) FOR rownum IN([1],[2],[3],[4],[5])) AS P;

 

结果如下:

 

如果我们需要把每个客户最近的5个订单ID连成一个字符串,可以使用CONCAT函数(SQL SERVER 2012或者以上版本提供),如下:

WITH C AS
(
    SELECT custid, CAST(orderid AS NVARCHAR(20)) AS sorderid,
        ROW_NUMBER() OVER(PARTITION BY custid
                          ORDER BY orderdate DESC,orderid DESC) AS rownum
    FROM OrderValues
)

SELECT custid,CONCAT([1],','+[2],','+[3],','+[4],','+[5]) AS orderids
FROM C
    PIVOT(MAX(sorderid) FOR rownum IN([1],[2],[3],[4],[5])) AS P;

 

结果如下:

 

CONCAT函数会自动用一个空字符串体会NULL。要在SQL SERVER 2012之前的版本中达到相同的目的,可以使用‘+’连接运算符和COALESCE函数来用空字符串替换NULL,如下:

WITH C AS
(
    SELECT custid, CAST(orderid AS NVARCHAR(20)) AS sorderid,
        ROW_NUMBER() OVER(PARTITION BY custid
                          ORDER BY orderdate DESC,orderid DESC) AS rownum
    FROM OrderValues
)

SELECT custid,
    [1]+COALESCE(','+[2],'')
       +COALESCE(','+[3],'')
       +COALESCE(','+[4],'')
       +COALESCE(','+[5],'') AS orderids
FROM C
    PIVOT(MAX(sorderid) FOR rownum IN([1],[2],[3],[4],[5])) AS P;

 

7、每组前N行

当需要从每个组或分区中更加某种指定拍下筛选出一定数量的行时,通常都会用到每组前N行查询。例如,查询MyOrders表,并为每个客户返回最近的3个订单。

首先我们需要根据POC(Partioning、Ordering、Covering)理念来创建索引:

--索引
CREATE UNIQUE INDEX idx_cid_odD_oidD_i_empid
    ON dbo.MyOrders(custid,orderdate DESC,orderid DESC)
    INCLUDE(empid);

 

有两种策略来完成该任务:一种是使用ROW_NUMBER函数;而另一种使用APPLY运算符和OFFSET/FETCH或TOP。哪一种策略更有效由分区列(这里是custid)的密度决定。低密度——意味着有大量不同的客户,每个客户的订单都很小——基于ROW_NUMBER函数的解决方案是最佳的。

如下:

WITH C AS
(
    SELECT custid,orderdate,orderid,empid,
         ROW_NUMBER() OVER(PARTITION BY custid
                           ORDER BY orderdate DESC,orderid DESC) AS rownum
    FROM MyOrders
)

SELECT *
FROM C
WHERE rownum<=3
ORDER BY custid,rownum;

 

执行计划如下:

 

而如果是高密度——少量不同客户,每个客户都有大量订单——最好使用APPLY运算符为每个客户调用带OFFSET/FETCH或TOP的查询。如下:

SELECT C.custid,A.*
FROM dbo.Customers  AS C
    CROSS APPLY (SELECT orderdate,orderid,empid
                 FROM MyOrders AS O
                 WHERE O.custid = C.custid
                 ORDER BY orderdate DESC,orderid DESC
                 OFFSET 0 ROWS FETCH FIRST 3 ROWS ONLY) AS A;

 

执行计划如下:

 

记住,OFFSET/FETCH是在SQL SERVER 2012 中引入的,早期版本用TOP代替,如下:

SELECT C.custid,A.*
FROM dbo.Customers  AS C
    CROSS APPLY (SELECT TOP(3) orderdate,orderid,empid
                 FROM MyOrders AS O
                 WHERE O.custid = C.custid
                 ORDER BY orderdate DESC,orderid DESC) AS A;

 

现在我们删除索引,如下:

DROP INDEX idx_cid_odD_oidD_i_empid ON dbo.MyOrders;

 

以上实现都是基于有POC索引的,如果不能创建此索引,那接下来的解决方案是最好选择。

如下:

--不使用索引的解决方案
WITH C AS
(
    SELECT custid,
         MAX(CONVERT(NVARCHAR(150),orderdate,112)
             + STR(orderid,10)
             +STR(empid,10) COLLATE Latin1_General_BIN2) AS mx
    FROM MyOrders
    GROUP BY custid
)

SELECT custid,
    CAST(SUBSTRING(mx,1,8) AS DATETIME) AS orderdate,
    CAST(SUBSTRING(mx,9,10) AS INT) AS custid,
    CAST(SUBSTRING(mx,19,10) AS INT) AS empid
FROM C;

 

这个查询不是很完美,但是它的执行计划在POC索引不存在时只是进行了一次数据扫描,这种解决方案往往比其他的解决方案更高效。请记住,如果系统允许创建POC索引,就不需要使用此解决方案;相反,应该根据分区列的密度使用两外两种策略之一。

 

8、模式

模式是返回全体中最频繁出现个例的统计计算。例如,Orders表中保存订单信息。每个订单都是由客户下单,并且都是被某个员工处理。对于每个客户,假设哪个员工处理了最多的订单,该员工就是模式,因为她对客户的订单来说出现得最频繁。

如果有多个员工都处理了给定客户的最多订单,则存在并列。可以根据需要返回所有并列的员工或打破平手。这两种情况的解决方案都将介绍。如果我们想打破平手,假设决胜者是雇员ID号是最高的——如果有多个员工对于给定客户处理最多的订单,就返回雇员里ID号最高的雇员。

首先,我们创建一个索引:

--索引
CREATE INDEX idx_custid_empid
    ON dbo.Orders(custid,empid);

 

下面将从ROW_NUMBER函数的解决方案开始。

如下:

WITH C AS
(
    SELECT custid,empid,COUNT(*) AS cnt,
         ROW_NUMBER() OVER(PARTITION BY custid
                           ORDER BY COUNT(*) DESC,empid DESC) AS rn
    FROM Orders
    GROUP BY custid,empid
)

SELECT custid,empid,cnt
FROM C
WHERE rn=1;

 

结果如下:

 

执行计划如下:

 

因为窗口排序说明包含empid DESC作为一个决胜点,最终结果只会为每个客户返回一行记录。如果不想打破平手,可以使用RANK函数,而不是ROW_NUMBER,并从窗口排序子句中删除empid,如下:

WITH C AS
(

    SELECT custid,empid,COUNT(*) AS cnt,
         RANK() OVER(PARTITION BY custid
                     ORDER BY COUNT(*) DESC) AS rn
    FROM Orders
    GROUP BY custid,empid
)

SELECT custid,empid,cnt
FROM C
WHERE rn=1;

 

结果如下:

 

执行计划如下:

 

记住,RANK函数对平手敏感,不像ROW_NUMBER函数。这也就是说,对于给定的相同排序值——这里是COUNT(*)——会得到相同的排名。因此每位客户具有最大计数的所有的行的排名都是1,这样的行都保存。

删除索引:

--删除索引
DROP INDEX idx_custid_empid ON Orders;

 

最后让我们来介绍第三个解决方案(和上面的每组前N行类似)。如下:

--无索引的解决方案
WITH C AS
(
    SELECT custid,
         STR(COUNT(*),10)+STR(empid,10) COLLATE Latin1_General_BIN2 AS cntemp
    FROM Orders
    GROUP BY custid,empid
)

SELECT custid,
    CAST(SUBSTRING(MAX(cntemp),11,10) AS INT) AS empid,
    CAST(SUBSTRING(MAX(cntemp),1,10) AS INT) AS cnt
FROM C
GROUP BY custid;

 

正如上节提到的,当有索引时,因为基于窗口的解决发难执行效率高,所以没有理由不适用。但当没有索引,后者往往会表现得更好。

 

9、统计总和

计算总和是很普遍的需求。基本思想是,基于在一列或多列中定义的排序(排序元素),以及在一列或多列定义的分区行内(分区元素),累加另一个列中的值(聚合元素)。在生活中有许多计算总和的例子,包括计算银行账户余额,跟踪仓库中产品的库存水平,跟踪累计销售额,如此等等。

在SQL SERVER 2012之前的版本中,用于计算总和的基于集合的解决方案都非常昂贵。因为人们往往使用迭代的解决方案,速度很慢,但对于某种数据分布情况,也有可能迭代的解决方案比基于集合的解决方案速度更快。随着SQL SERVER 2012对窗口函数的支持增强,可以用基于集合的简单代码来计算总和,这比所有旧的T-SQL的解决方案都更有效——也就是基于集合的迭代方案。

我们将使用银行账户余额来掩饰不同的解决方案。用以下代码创建Transactions表并填充数据。

--创建测试数据表
IF OBJECT_ID('dbo.Transactions','U') IS NOT NULL
    DROP TABLE dbo.Transactions;
GO

CREATE TABLE dbo.Transactions
(
    actid INT NOT NULL,
    tranid INT NOT NULL,
    val MONEY NOT NULL,
    CONSTRAINT PK_Transactions PRIMARY KEY(actid,tranid)
);
GO

 

使用下面的代码来填充数据:

-- 定义变量,添加指定量的测试数据
DECLARE
    @num_partitions AS INT =100,
    @rows_per_partitions AS INT =20000;

TRUNCATE TABLE dbo.Transactions;

INSERT INTO dbo.Transactions WITH(TABLOCK) (actid,tranid,val)
    SELECT NP.n,RPP.n,
        (ABS(CHECKSUM(NEWID())%2)*2-1) * (1+ABS(CHECKSUM(NEWID())%5))
    FROM dbo.GetNums(1,@num_partitions) AS NP
    CROSS JOIN dbo.GetNums(1,@rows_per_partitions) AS RPP;

 

表和数据都有了,首先我们来说下基于窗口的解决方案,如下:

SELECT actid,tranid,val,
    SUM(val) OVER(PARTITION BY actid
                  ORDER BY tranid
                  ROWS BETWEEN UNBOUNDED PRECEDING
                       AND CURRENT ROW) AS balance
FROM dbo.Transactions;

 

执行计划如下:

 

在SQL SERVER 2012之前的版本中,基于集合计算总和的传统解决方案要么使用子查询要么使用联接。先看下子查询的代码:

SELECT  actid,tranid,val,
    (SELECT SUM(T2.val)
     FROM dbo.Transactions AS T2
     WHERE T2.actid = T1.actid
     AND T2.tranid<=T1.tranid) AS balance
FROM dbo.Transactions AS T1;

 

执行计划如下:

 

而联接实现如下:

SELECT T1.actid,T1.tranid,T1.val,
    SUM(T2.val) AS balance
FROM dbo.Transactions AS T1
    JOIN dbo.Transactions AS T2
    ON T2.actid = T1.actid
    AND T2.tranid<=T1.tranid
GROUP BY T1.actid,T1.tranid,T1.val;

 

执行计划如下:

 

注:窗口函数我的笔记本跑了8s,子查询快两分钟没跑完,我放弃了~~~

注意,在最后介绍的两种情况下都对聚簇索引进行全表扫描得到实例T1。换句话说,这两种解决方案具有二次方扩展特性,但窗口函数的解决方案具有线性的扩展特性。

 

10、最大并发间隔

观察一系列的时间间隔,如会议、项目、电话等。有一个经典的问题就是最大并发间隔(maximun concurrent interval),用于计算有效的最大时间间隔。例如,我们有一张表SessionsTem,该表包含不同应用程序的用户会话的数据。我们的任务是编写代码计算每个应用程序中同时活跃的最大会话数。假设不需要考虑它们的并发性,则一个会话结束时,另一个开始。

首先我们需要准备这张表,并添加一些测试数据。

--创建表
IF OBJECT_ID('dbo.SessionsTem') IS NOT NULL
    DROP TABLE dbo.SessionsTem;
GO

CREATE TABLE dbo.SessionsTem
(
    keycol    INT         NOT NULL,
    app       VARCHAR(10) NOT NULL,
    usr       VARCHAR(10) NOT NULL,
    host      VARCHAR(10) NOT NULL,
    starttime DATETIME    NOT NULL,
    endtime   DATETIME    NOT NULL,
    CONSTRAINT PK_SessionsTem PRIMARY KEY(keycol),
    CHECK(endtime > starttime)
);
GO

--创建符合poc要求的索引
CREATE UNIQUE INDEX idx_nc_app_st_et
    ON dbo.SessionsTem(app,starttime,keycol) INCLUDE(endtime);

CREATE UNIQUE INDEX idx_nc_app_et_st
    ON dbo.SessionsTem(app,endtime,keycol) INCLUDE(starttime);
GO

 

接下来生成测试数据:

--插入数据
TRUNCATE TABLE dbo.SessionsTem;

DECLARE
    @numrows AS INT = 100000, -- TOATL NUMBER OF ROWS
    @numapps AS INT =10;      -- NUMBER OF APPLICATIONS   

INSERT INTO dbo.SessionsTem WITH(TABLOCK)
    (keycol,app,usr,host,starttime,endtime)
    SELECT
         ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS keycol,
         D.*,
         DATEADD(SECOND,1+ABS(CHECKSUM(NEWID()))%(20*60),starttime)AS endtime
    FROM
    (
         SELECT
             'app'+CAST(1+ABS(CHECKSUM(NEWID()))%@numapps AS VARCHAR(10)) AS app,
             'usr1' AS usr,
             'host1' AS host,
             DATEADD(SECOND,1+ABS(CHECKSUM(NEWID()))%(30*24*60*60),'20120101')AS starttime
         FROM dbo.GetNums(1,@numrows) AS Nums
    ) AS D;

 

先看下传统的基于集合的解决方案,完整代码如下:

WITH TimePoints AS
 (
    SELECT app,starttime as st
    FROM dbo.SessionsTem
 ),
 Counts AS
 (
    SELECT app,st,
         (
             SELECT COUNT(*)
             FROM dbo.SessionsTem AS S
             WHERE P.app = S.app
                  AND P.st>= S.starttime
                  AND P.st < S.endtime
         ) AS concurrent
    FROM TimePoints AS P
 )

 SELECT app,MAX(concurrent) AS mx
 FROM Counts
 GROUP BY app;

 

该解决方案看起来非常简单,而且不会立即表现性能问题。但是,当运行在大样本数据集中时,就需要很长的时间才能完成(以上面的数据量,我的笔记本执行时间是5s),因为它是按照二次方扩展的。要理解它为什么如此缓慢,检查查询执行计划如下:

 

接下来介绍两种基于窗口函数的解决方案——第一种仅适合SQL SERVER 2012及以上版本,因为它依赖新的窗口聚合技术;而第二种从SQL SERVER 2005开始的版本都可以用,它依赖于ROW_NUMBER函数。

使用窗口聚合函数的解决方案的初始化查询和一般原理与基于游标的解决方案是类似的——只是没有游标以及使用游标而带来的开销。下面是代码:

--窗口函数 2012以上
 WITH C1 AS
 (
    SELECT app,starttime AS ts,+1 AS type
    FROM dbo.SessionsTem

    UNION ALL

    SELECT app,endtime ,-1
    FROM dbo.SessionsTem
 ),
 C2 AS
 (
    SELECT *,
         SUM(type) OVER(PARTITION BY app
                          ORDER BY ts,type
                          ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cnt
    FROM C1
 )

 SELECT app,MAX(cnt) AS mx
 FROM C2
 GROUP BY app;

 

观察这个解决方案是多么简单而优雅。它也是高效的并且具有线性扩展特性。执行计划如下:

 

下面是使用ROW_NUMBER函数的代码:

--窗口函数 2012之前兼容
  WITH C1 AS
 (
    SELECT app,starttime AS ts,+1 AS type,keycol,
         ROW_NUMBER() OVER(PARTITION BY app ORDER BY starttime,keycol) as start_ordinal
    FROM dbo.SessionsTem

    UNION ALL

    SELECT app,endtime ,-1,keycol,NULL
    FROM dbo.SessionsTem
 ),
 C2 AS
 (
    SELECT *,
         ROW_NUMBER() OVER(PARTITION BY app
                           ORDER BY ts,type,keycol) AS start_or_end_ordinal
    FROM C1
 )

 SELECT app,MAX(start_ordinal -(start_or_end_ordinal - start_ordinal)) AS mx
 FROM C2
 GROUP BY app;

 

执行计划如下:

 

在这个执行计划中我们可以看到两个ROW_NUMBER的计算——一个用于计算开始序号,另一个用于计算开始或结束序号——都依赖于索引排序。这同样适合用于聚合操作。因此,在这个计划中一个排序操作都不需要。

 

11、包装间隔

包装间隔(Packing interval)就是对每组连续的间隔进行分组,中间没有其他的间隔重叠或相邻(对接),并返回每组的最早开始时间和最迟结束时间。通常在SQL中,包装问题也涉及分区元素(例如,用户和应用程序),包装是在每个分区内独立完成。

为了演示,我们需要先创建Users和Sessions表,并填充样本数据来测试解决方案的有效性:

--创建表
 IF OBJECT_ID('dbo.Sessions') IS NOT NULL
    DROP TABLE dbo.Sessions;
GO

CREATE TABLE dbo.Sessions
(
    id          INT         NOT NULL IDENTITY(1,1),
    username    VARCHAR(14) NOT NULL,
    starttime   DATETIME    NOT NULL ,
    endtime     DATETIME    NOT NULL,
    CONSTRAINT PK_Sessions PRIMARY KEY(id),
    CONSTRAINT CHK_endtime_gteq_starttime CHECK(endtime>=starttime)
); 
GO

IF OBJECT_ID('dbo.Users') IS NOT NULL
    DROP TABLE dbo.Users;
GO

CREATE TABLE dbo.Users
(
    username VARCHAR(14) NOT NULL,
    CONSTRAINT PK_Users PRIMARY KEY(username)
);
GO

 

接下来添加数据:

--添加数据
DECLARE
    @num_users AS INT =2000,
    @intervals_per_user AS INT =2500,
    @start_period AS DATETIME ='20120101',
    @end_period AS DATETIME ='20120107',
    @max_duration_in_ms AS INT =3600000;   

--先清除数据
TRUNCATE TABLE dbo.Users;
TRUNCATE TABLE dbo.Sessions;

INSERT INTO dbo.Users(username)
    SELECT 'User'+RIGHT('000000000'+CAST(U.n AS VARCHAR(10)),10) AS username
    FROM dbo.GetNums(1,@num_users) AS U;

WITH C AS
(
    SELECT 'User'+RIGHT('000000000'+CAST(U.n AS VARCHAR(10)),10) AS username,
         DATEADD(ms,ABS(CHECKSUM(NEWID()))%86400000,
              DATEADD(DAY,ABS(CHECKSUM(NEWID()))%DATEDIFF(DAY,@start_period,@end_period),
              @start_period)) AS starttime
    FROM dbo.GetNums(1,@num_users) AS U
         CROSS JOIN dbo.GetNums(1,@intervals_per_user) AS I
)

INSERT INTO dbo.Sessions WITH(TABLOCK) (username,starttime,endtime)
    SELECT username,starttime,
         DATEADD(ms,ABS(CHECKSUM(NEWID()))%(@max_duration_in_ms+1),starttime) AS endtime
    FROM C;

 

此段代码为Sessions表填充5000000行数据。数据包含2000名用户,每个用户一周之内有2500个会话,每个会话长达一个小时。我们可以改变代码中任何想要更改的元素来测试解决方案的性能。

首先来介绍传统的基于集合的解决方案,为提高性能,我们先添加以下索引。

--索引  以提高性能
CREATE INDEX idx_user_start_end ON dbo.Sessions(username,starttime,endtime);
CREATE INDEX idx_user_end_start ON dbo.Sessions(username,endtime,starttime);

 

解决方案的完整代码如下:

--传统的基于集合的解决方案
WITH StartTimes AS
(
    SELECT DISTINCT username,starttime
    FROM dbo.Sessions AS S1
    WHERE NOT EXISTS
    (
         SELECT *
         FROM dbo.Sessions AS S2
         WHERE S2.username = s1.username
             AND S2.starttime<S1.starttime
             AND s2.endtime >=s1.endtime
    )
),
EndTimes AS
(
    SELECT DISTINCT username,endtime
    FROM dbo.Sessions AS S1
    WHERE NOT EXISTS
    (
         SELECT *
         FROM dbo.Sessions AS S2
         WHERE S2.username = s1.username
             AND S2.endtime<S1.endtime
             AND s2.starttime >=s1.starttime
    )
)

SELECT username,starttime,
    (SELECT MIN(endtime) FROM EndTimes AS E
    WHERE E.username= S.username
         AND endtime >= starttime) AS endtime
FROM StartTimes AS S;

 

看下其执行计划:

 

在有5000000行记录的Sessions上运行了好几个小时(原文所说,没有亲自尝试)。

在继续之前,我们先删除前面创建的索引:

--删除索引
DROP INDEX idx_user_start_end ON dbo.Sessions;
DROP INDEX idx_user_end_start ON dbo.Sessions;

 

接下来介绍两个比传统的解决方案速度更快的基于窗口函数的解决方案。创建以下索引以支持新的解决方案:

--窗口函数的  索引
CREATE INDEX idx_user_start_id ON dbo.Sessions(username,starttime,id);
CREATE INDEX idx_user_end_id ON dbo.Sessions(username,endtime,id);

 

前两个新策略是依靠ROW_NUMBER函数,所以在SQL SERVER 2005及以上版本都可以运行。解决方案代码如下:

--窗口函数   利用行号来打包间隔
WITH C1 AS
--let e = end ordinals,let s = start ordinals
(
    SELECT id,username,starttime AS ts, +1 AS type,NULL AS e,
         ROW_NUMBER() OVER(PARTITION BY username
                           ORDER BY starttime,id) AS s
    FROM  dbo.Sessions   

    UNION ALL
 
    SELECT id,username,endtime AS ts, -1 AS type,
         ROW_NUMBER() OVER(PARTITION BY username
                           ORDER BY endtime,id) AS e,
                          NULL AS s
    FROM  dbo.Sessions
),
C2 AS
-- let se = start or end ordinal, namely ,how many events ( start or end) happended so far
(
    SELECT C1.*,
         ROW_NUMBER() OVER(PARTITION BY username
                           ORDER BY ts,type DESC,id) AS se
    FROM C1
),
C3 AS
(
    SELECT username,ts,
         FLOOR((ROW_NUMBER() OVER(PARTITION BY username
                          ORDER BY ts)-1)/2+1) AS grpnum
    FROM  C2
    WHERE COALESCE(s-(se-s)-1,(se-s)-s) = 0
)

SELECT username,MIN(ts) AS starttime,MAX(ts) AS endtime
FROM C3
GROUP BY username,grpnum;

 

注意,一个包装的间隔总是在开始事件之前的活动会话的数量是零时开始,结束事件之后的活动会话的数量是零时结束。因此,对于每个开始事件,我们需要知道在它之前有多少活动会话,并且对于每个结束事件,我们需要知道在它之后有多少个活动会话。

其执行计划如下:

总体而言,这个计划只是对数据执行了两次扫描(每个索引一次),并按照索引的顺序扫描。该解决方案在我的笔记本上执行了6s。这个解决方案没有利用好的一个地方是并行性。就是下面介绍的这个解决方案的过人之处:

--窗口函数 ROW_NUMBER 解决并行性
IF OBJECT_ID('dbo.UserIntervals1') IS NOT NULL
    DROP FUNCTION dbo.UserIntervals1;
GO

CREATE FUNCTION dbo.UserIntervals1(@user AS VARCHAR(14)) RETURNS TABLE
AS
RETURN
    WITH C1 AS
    (
         SELECT id,starttime AS ts, +1 AS type,NULL AS e,
             ROW_NUMBER() OVER(PARTITION BY username
                               ORDER BY starttime,id) AS s
         FROM  dbo.Sessions
         WHERE username=@user  

         UNION ALL   

         SELECT id,endtime AS ts, -1 AS type,
             ROW_NUMBER() OVER(PARTITION BY username
                               ORDER BY endtime,id) AS e,
                               NULL AS s
         FROM  dbo.Sessions
         WHERE username=@user
    ),
    C2 AS
    (
         SELECT C1.*,
             ROW_NUMBER() OVER(ORDER BY ts,type DESC,id) AS se
         FROM C1
    ),
    C3 AS

    (
         SELECT ts,
             FLOOR((ROW_NUMBER() OVER(ORDER BY ts)-1)/2+1) AS grpnum
         FROM  C2
         WHERE COALESCE(s-(se-s)-1,(se-s)-s) = 0
    )

    SELECT MIN(ts) AS starttime,MAX(ts) AS endtime
    FROM C3
    GROUP BY grpnum;
GO

 

最后,利用CROSS APPLY运算符对Users表中的每个用户调用这个函数,如下所示:

--利用CROSS APPLY解决
SELECT U.username,A.starttime,A.endtime
FROM dbo.Users AS U
    CROSS APPLY dbo.UserIntervals1(U.username) AS A;

 

执行计划如下(该方案在我的笔记本上执行了2s):

 

第二个基于窗口函数的新解决方案。它是利用SUM窗口集合函数,并依靠SQL SERVER 2012 中引入的窗口规范中的元素。

完整的代码如下:

--SUM 函数解决方案
WITH C1 AS
(
    SELECT username,starttime AS ts, +1 AS type,1 AS sub
    FROM dbo.Sessions

    UNION ALL   

    SELECT username,endtime AS ts, -1 AS type,0 AS sub
    FROM dbo.Sessions
),
C2 AS
(
    SELECT C1.*,
         SUM(type) OVER(PARTITION BY username
                          ORDER BY ts,type DESC
                          ROWS BETWEEN UNBOUNDED PRECEDING
                          AND CURRENT ROW) - sub AS cnt
    FROM C1
),
C3 AS
(
    SELECT username,ts,
         FLOOR((ROW_NUMBER() OVER(PARTITION BY username
                                  ORDER BY ts)-1)/2+1) AS grpnum
    FROM C2
    WHERE cnt=0
)

SELECT username,MIN(ts) AS starttime,MAX(ts) AS endtime
FROM C3
GROUP BY username,grpnum;

 

执行计划如下,而在我的笔记本上运行了13s:

 

类似于把针对单个用户的基于行号的解决方案的逻辑封装在内联表函数中,以及使用APPLY运算符对Users中的每个用户调用该函数,也可以对SUM窗口聚合采用同样的思路。下面是使用内联函数的代码:

-- 封装到函数
IF OBJECT_ID('dbo.UserIntervals2') IS NOT NULL
    DROP FUNCTION dbo.UserIntervals2;
GO

CREATE FUNCTION dbo.UserIntervals2(@user AS VARCHAR(14)) RETURNS TABLE
AS
RETURN
    WITH C1 AS
    (
         SELECT starttime AS ts, +1 AS type,1 AS sub
         FROM dbo.Sessions
         WHERE username = @user

         UNION ALL   

         SELECT endtime AS ts, -1 AS type,0 AS sub
         FROM dbo.Sessions
         WHERE username = @user
    ),
    C2 AS
    (
         SELECT C1.*,
             SUM(type) OVER(ORDER BY ts,type DESC
                            ROWS BETWEEN UNBOUNDED PRECEDING
                            AND CURRENT ROW) - sub AS cnt
         FROM C1
    ),
    C3 AS
    (
         SELECT ts,
             FLOOR((ROW_NUMBER() OVER(ORDER BY ts)-1)/2+1) AS grpnum
         FROM C2
         WHERE cnt=0
    )

    SELECT MIN(ts) AS starttime,MAX(ts) AS endtime
    FROM C3
    GROUP BY grpnum;
GO

 

查询代码如下:

--查询
SELECT U.username,A.starttime,A.endtime
FROM dbo.Users AS U
    CROSS APPLY dbo.UserIntervals2(U.username) AS A;

 

执行计划如下,并在我的笔记本上执行了7s:

 

12、数据差距和数据岛

数据差距(Gap)和数据岛(Island)是经典的SQL问题,在实践中用多种形式来体现它们。其基本概念是,我们有一些数字、日期或时间值序列,其中序列值之间应该是有固定的间隔,但有些序列值可能会丢失。那么数据差距问题就是识别序列中缺失值的所有范围,数据岛问题涉及识别现有值的所有范围。为了演示找出数据差距和数据岛的技术,下面我们使用名为T1的表,该表的col1列为一个间隔为整数1的数字序列;另一个表为T2,该表的col1列为日期和时间序列,间隔为1天。代码如下:

--dbo.T1(numeric sequence with unique values,interval:1)
IF OBJECT_ID('dbo.T1','U') IS NOT NULL
    DROP TABLE dbo.T1;
GO

CREATE TABLE dbo.T1
(
    col1 INT NOT NULL PRIMARY KEY
);
GO

INSERT INTO dbo.T1(col1)
    VALUES(2),(3),(7),(8),(11),(15),(16),(17),(28);

--dbo.T2(temporal sequence with unique values,interval: 1 day)
IF OBJECT_ID('dbo.T2','U') IS NOT NULL
    DROP TABLE dbo.T2;
GO

CREATE TABLE dbo.T2
(
    col1 DATE NOT NULL PRIMARY KEY
);
GO

INSERT INTO dbo.T2(col1) VALUES
    ('20120202'),
    ('20120203'),
    ('20120207'),
    ('20120208'),
    ('20120209'),
    ('20120211'),
    ('20120215'),
    ('20120216'),
    ('20120217'),
    ('20120228');

 

12.1、数据差距

在SQL SERVER 2012 之前的版本,用来处理数据差距的技术工作量很大,有时还很复杂。但是随着引入LAG和LEAD函数,我们现在可以简单高效的处理这方面的需求。

针对T1的解决方案:

--数据差距 T1
WITH C AS
(
    SELECT col1 AS cur, LEAD(col1) OVER(ORDER BY col1) AS nxt
    FROM dbo.T1
)

SELECT cur+1 AS rangestart,nxt-1 AS rangeend
FROM C
WHERE nxt-cur>1;

 

结果如下:

 

针对T2的解决方案:

--数据差距 T2
WITH C AS
(
    SELECT col1 AS cur, LEAD(col1) OVER(ORDER BY col1) AS nxt
    FROM dbo.T2
)

SELECT DATEADD(DAY,1,cur) AS rangestart, DATEADD(DAY,-1,nxt) AS rangeend
FROM C
WHERE DATEDIFF(DAY,cur,nxt)>1;

 

结果如下:

 

12.2、数据岛

针对T1的解决方案:

--数据岛  T1
WITH C AS
(
    SELECT col1,col1- DENSE_RANK() OVER(ORDER BY col1) AS grp
    FROM dbo.T1
)

SELECT MIN(col1) AS start_range,MAX(col1) AS end_range
FROM C
GROUP BY grp;

 

结果如下:

 

执行计划如下:

 

也许你会奇怪,为什么我们使用DENSE_RANK函数,而不是ROW_NUMBER函数。这是因为需要支持那些不能保证序列值是唯一的情况。对于ROW_NUMBER函数,该技术只对当序列值是唯一的情况下有效(这恰好是样本数据的情况),但有重复时,它会失败。使用DENSE_RANK函数,该技术对唯一和非唯一的值都有效。

下面来看下针对T2的解决方案:

--数据岛  T2
WITH C AS
(
    SELECT col1,DATEADD(DAY,-1* DENSE_RANK() OVER(ORDER BY col1),col1) AS grp
    FROM dbo.T2
)

SELECT MIN(col1) AS start_range,MAX(col1) AS end_range
FROM C
GROUP BY grp;

 

结果如下:

 

有些情况下,我们需要使用这些数据岛技术,其中包括可用性报告、活动周期,以及其他技术。我们甚至可以使用这些数据岛技术来处理一个涉及打包日期间隔的经典问题。下面先来创建日期间隔表:

--创建日期间隔表
IF OBJECT_ID('dbo.Intervals','U') IS NOT NULL
    DROP TABLE dbo.Intervals;
GO

CREATE TABLE dbo.Intervals
(
    id        INT  NOT NULL,
    starttime DATE NOT NULL,
    endtime   DATE NOT NULL
);
GO

INSERT INTO dbo.Intervals(id,starttime,endtime) VALUES
(1,'20120212','20120220'),
(1,'20120214','20120312'),
(1,'20120124','20120201');

 

解决方案如下:

--时间间隔 数据岛 解决方案
DECLARE
    @from AS DATE ='20120101',
    @to AS DATE = '20121231';

WITH Dates AS
(
    SELECT DATEADD(DAY,n-1,@from) AS dt
    FROM dbo.GetNums(1,DATEDIFF(DAY,@from,@to)+1) AS Nums
),
Groups AS
(
    SELECT D.dt,
         DATEADD(DAY,-1*DENSE_RANK() OVER(ORDER BY D.dt),D.dt) AS grp
    FROM dbo.Intervals AS I
         JOIN Dates AS D
         ON D.dt BETWEEN I.starttime AND I.endtime
)

SELECT MIN(dt) AS rangestart,MAX(dt) AS endrange
FROM Groups
GROUP BY grp;

 

结果如下:

 

请注意,此解决方案对于时间间隔跨度很长的一段时间效果不是很好。该解决方案需要解开各个周期所涉及的个别日期,这是可以理解的。

还有比基本版本更加复杂的数据岛问题的版本。例如,假设忽略达到一定规模的数据差距——例如,数字序列,假设忽略达到2的差距。解决方案如下:

--数字序列,忽略大于2的差距 T1 数据岛 解决方案如下
WITH C1 AS
(
    SELECT col1,
         CASE WHEN col1 - LAG(col1) OVER(ORDER BY col1) <=2 THEN 0 ELSE 1 END AS isstart,
         CASE WHEN LEAD(col1) OVER(ORDER BY col1) - col1 <=2 THEN 0 ELSE 1 END AS isend
    FROM dbo.T1
),
C2 AS
(
    SELECT col1 AS rangestart, LEAD(col1,1-isend) OVER (ORDER BY col1) AS rangeend,isstart
    FROM C1
    WHERE isstart=1 OR isend =1
)

SELECT rangestart,rangeend
FROM C2
WHERE isstart =1;

 

结果如下:

 

现在来看最后一个版本的数据岛问题,这个涉及识别ID的范围,首先创建演示用表:

--创建表
IF OBJECT_ID('dbo.T3','U') IS NOT NULL
    DROP TABLE dbo.T3;
GO

CREATE TABLE dbo.T3
(
    id  INT         NOT NULL PRIMARY KEY,
    val VARCHAR(10) NOT NULL
);
GO

INSERT INTO dbo.T3(id,val) VALUES
(2,'a'),
(3,'a'),
(5,'a'),
(7,'b'),
(11,'b'),
(13,'a'),
(17,'a'),
(19,'a'),
(23,'c'),
(29,'c'),
(31,'a'),
(37,'a'),
(41,'a'),
(43,'a'),
(47,'c'),
(53,'c'),
(59,'c');

 

解决方案如下:

--解决方案
WITH C AS
(
    SELECT id,val,
         ROW_NUMBER() OVER(ORDER BY id) -
         ROW_NUMBER() OVER(ORDER BY val,id) AS grp
    FROM dbo.T3
)

SELECT MIN(id) AS mn,MAX(id) AS mx,val
FROM C
GROUP BY val ,grp
ORDER BY mn;

 

结果如下:

 

12.3、中位数

中位数——严格来说,代表整体的50%都小于该值。此节,对于样本,书中提到是 dbo.Scores表,它保存学生的考试成绩。首先我们来创建测试需要的表及数据:

--创建表和添加数据
IF OBJECT_ID('dbo.Scores') IS NOT NULL
	DROP TABLE dbo.Scores;
GO

CREATE TABLE dbo.Scores
(
	id         INT         NOT NULL PRIMARY KEY,
	testid     VARCHAR(14) NOT NULL,
	studentid  VARCHAR(14) NOT NULL,
	score      FLOAT       NOT NULL
);

INSERT INTO dbo.Scores(id,testid,studentid,score) VALUES
(1,'Test ABC','Student A',95),
(2,'Test ABC','Student B',80),
(3,'Test ABC','Student C',55),
(4,'Test ABC','Student D',55),
(5,'Test ABC','Student E',50),
(6,'Test ABC','Student F',80),
(7,'Test ABC','Student G',95),
(8,'Test ABC','Student H',65),
(9,'Test ABC','Student I',75),
(10,'Test XYZ','Student A',95),
(11,'Test XYZ','Student B',80),
(12,'Test XYZ','Student C',55),
(13,'Test XYZ','Student D',55),
(14,'Test XYZ','Student E',50),
(15,'Test XYZ','Student F',80),
(16,'Test XYZ','Student G',95),
(17,'Test XYZ','Student H',65),
(18,'Test XYZ','Student I',75),
(19,'Test XYZ','Student J',95);

以下提供了三个解决方案,有SQL SERVER 2012之前版本的,也有以上版本的。详细的完整代码如下:

--SQL SERVER 2012及以上版本
WITH C AS
(
    SELECT testid,
         ROW_NUMBER() OVER(PARTITION BY testid ORDER BY (SELECT NULL)) AS rownum,
         PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY score) OVER(PARTITION BY testid) AS median
    FROM Stats.Scores
)
SELECT testid,median
FROM C
WHERE rownum =1;

--SQL SERVER 2012之前版本的解决方案
WITH C AS
(
    SELECT testid,score,
         ROW_NUMBER() OVER(PARTITION BY score) AS pos,
         COUNT(*) OVER(PARTITION BY testid) AS cnt
    FROM Stats.Scores
)

SELECT testid,AVG(1.*score) AS median
FROM C
WHERE pos IN((cnt+1)/2,(cnt+2)/2)
GROUP BY testid;

--SQL SERVER 2012之前版本中另一个有趣的解决方案
WITH C AS
(
    SELECT testid,score,
         ROW_NUMBER() OVER(PARTITION BY testid,ORDER BY score,studentid) AS rna,
         ROW_NUMBER() OVER(PARTITION BY testid,ORDER BY score DESC,studentid DESC) AS rnd
    FROM Stats.Scores
)

SELECT testid,AVG(1.*score) AS median
FROM C
WHERE ABS(rna - rnd)<=1
GROUP BY testid;

 

13、条件聚合

我们的下一个任务涉及技术总和,它总是返回一个非负数。也就是说,如果总和在一个点上为负数,返回零来代替。然后,当我们移动到序列中的下一位时,要从0开始继续。下面是产生演示用的表和数据代码:

--创建表和数据
IF OBJECT_ID('dbo.T4') is not null
    DROP TABLE dbo.T4;
GO

CREATE TABLE dbo.T4
(
    ordcol  INT NOT NULL PRIMARY KEY,
    datacol INT NOT NULL
);


INSERT INTO dbo.T4 VALUES
(1,10),
(5,5),
(4,-15),
(6,-10),
(8,-15),
(10,20),
(17,10),
(18,-10),
(20,-30),
(31,20);

 

以下是提供的解决方案:

-- 解决方案
WITH C1 AS
(
    SELECT ordcol,datacol,
         SUM(datacol) OVER(ORDER BY ordcol
                           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS partsum
    FROM dbo.T4
),
C2 AS
(
    SELECT *,
         MIN(partsum) OVER (ORDER BY ordcol
                            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as adjust
    FROM C1
)

SELECT *,
    partsum - CASE WHEN adjust <0 THEN adjust ELSE 0 END AS nonnegativesum
FROM C2;

 

结果如下:

PS:介绍相对比较少,只能多看看代码,比较结果自行理解吧

 

14、层次结构排序

假设我们需要以排序的方式来呈现某种层次结构中的信息。假设需要把父母元素程序在儿子元素之前。此外,还需要能够控制兄弟姐妹之间的顺序。使用下面的代码产生样本数据,它创建并填充了一个名为dbo.Employees表,完整代码如下:

--创建数据表
IF OBJECT_ID('dbo.Employees') IS NOT NULL
    DROP TABLE dbo.Employees;
GO

CREATE TABLE dbo.Employees
(
    empid INT NOT NULL PRIMARY KEY,
    mgrid INT NULL REFERENCES dbo.Employees,
    empname VARCHAR(25) NOT NULL,
    salary MONEY NOT NULL,
    CHECK (empid<> mgrid)
);

INSERT INTO dbo.Employees(empid,mgrid,empname,salary) VALUES
(1,NULL,'David',$10000.00),
(2,1,'Eitan',$7000.00),
(3,1,'Ina',$7500.00),
(4,2,'Seraph',$5000.00),
(5,2,'Jiru',$5500.00),
(6,2,'Steve',$4500.00),
(7,3,'Aaron',$5000.00),
(8,5,'Lilach',$3500.00),
(9,7,'Rita',$3000.00),
(10,5,'Sean',$3000.00),
(11,7,'Gabriel',$3000.00),
(12,9,'Emilia',$2000.00),
(13,9,'Michael',$2000.00),
(14,9,'Didi',$1500.00);

 

解决方案如下:

--解决方案
WITH EmpsRN AS
(

    SELECT *,
         ROW_NUMBER() OVER(PARTITION BY mgrid ORDER BY empname,empid) AS n
    FROM dbo.Employees
),
EmpsPath
AS
(
    SELECT empid,empname,salary,0 AS lvl,
         CAST(0X AS VARBINARY(MAX)) AS sortpath
    FROM dbo.Employees
    WHERE mgrid IS NULL

    UNION ALL

    SELECT C.empid,C.empname,C.salary,P.lvl+1,P.sortpath+CAST(n AS BINARY(2))
    FROM EmpsPath AS P
         JOIN EmpsRN AS C
         ON C.mgrid = P.empid
)

SELECT empid,salary, REPLICATE('  |  ',lvl) +empname AS empname
FROM EmpsPath
ORDER BY sortpath;

 

结果如下:

 

如果需要平均的员工按不同方式进行排序——例如,按薪金,只需要更改ROW_NUMBER函数的窗口相应的排序子句:

--平级排序
WITH EmpsRN AS
(
    SELECT *,
         ROW_NUMBER() OVER(PARTITION BY mgrid ORDER BY salary,empid) AS n
    FROM dbo.Employees
),
EmpsPath
AS
(
    SELECT empid,empname,salary,0 AS lvl,
         CAST(0X AS VARBINARY(MAX)) AS sortpath
    FROM dbo.Employees
    WHERE mgrid IS NULL

    UNION ALL

    SELECT C.empid,C.empname,C.salary,P.lvl+1,P.sortpath+CAST(n AS BINARY(2))
    FROM EmpsPath AS P
         JOIN EmpsRN AS C
         ON C.mgrid = P.empid
)

SELECT empid,salary, REPLICATE('  |  ',lvl) +empname AS empname
FROM EmpsPath
ORDER BY sortpath;

 

结果如下:

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值