命令格式:
boolean cluster_sample(bigint x[, bigint y])
over(partition by col1[, col2..])
用途:
分组抽样参数说明:
x:Bigint类型常量,x>=1。若指定参数y,x表示将一个窗口分为x份;否则,x表示在一个窗口中抽取x行记录(即有x行返回值为true)。x为NULL时,返回值为NULL。
y:Bigint类型常量,y>=1,y<=x。表示从一个窗口分的x份中抽取y份记录(即y份记录返回值为true)。y为NULL时,返回值为NULL。
partition by col1[, col2]:指定开窗口的列。
返回值:Boolean类型。
示例,如表test_tbl中有key,value两列,key为分组字段,值有groupa,groupb两组,value为值,如下
+------------+--------------------+
| key | value |
+------------+--------------------+
| groupa | -1.34764165478145 |
| groupa | 0.740212609046718 |
| groupa | 0.167537127858695 |
| groupa | 0.630314566185241 |
| groupa | 0.0112401388646925 |
| groupa | 0.199165745875297 |
| groupa | -0.320543343353587 |
| groupa | -0.273930924365012 |
| groupa | 0.386177958942063 |
| groupa | -1.09209976687047 |
| groupb | -1.10847690938643 |
| groupb | -0.725703978381499 |
| groupb | 1.05064697475759 |
| groupb | 0.135751224393789 |
| groupb | 2.13313102040396 |
| groupb | -1.11828960785008 |
| groupb | -0.849235511508911 |
| groupb | 1.27913806620453 |
| groupb | -0.330817716670401 |
| groupb | -0.300156896191195 |
| groupb | 2.4704244205196 |
| groupb | -1.28051882084434 |
+------------+--------------------+
想要从每组中抽取约10%的值,可以用以下ODPS SQL完成:
select key, value
from (
select key, value, cluster_sample(10, 1) over(partition by key) as flag
from tbl
) sub
where flag = true;
+--------+--------------------+
| key | value |
+--------+--------------------+
| groupa | -1.34764165478145 |
| groupb | -0.725703978381499 |
| groupb | 2.4704244205196 |
+-----+-----------------------+
窗口函数cluster_sample
最新推荐文章于 2021-04-16 18:25:00 发布