场景介绍
现在要对目标关系型数据库的所有共T张表中的所有数据进行抽样,为了保证样本的分布均匀,采用分层随机抽样的方式。
把目标表的总体数据X,平均分为N份,并从每一份里面随机取出n条数据,所以最终的样本集x = nN
,并且里面不存在重复数据
其中已知:
-
存在百万,甚至千万级以上的数据表
-
存在主键字段不为
id
的情况 -
存在主键不连续的情况
抽样方式
直接RAND
select * from t1 order by RAND() limit 5000 + i*10000, 100000
优点:
1. 可以快速获取随机样本
缺点:
-
当数据量很大时,RAND的效率会直线下降,百万级时就已经很慢了
-
同样的,limit后面的offset越大,查询效率也会越慢
-
如果需要多次抽取,无法保证其中不存在重复数据
实例:
百万级平均耗时8.04秒
SELECT * FROM `task_detail` order by rand() limit 900010, 100000
> OK
> 时间: 8.026s
SELECT * FROM `task_detail` order by rand() limit 900010, 100000
> OK
> 时间: 7.981s
SELECT * FROM `task_detail` order by rand() limit 900010, 100000
> OK
> 时间: 8.135s
SELECT * FROM `task_detail` order by rand() limit 900010, 100000
> OK
> 时间: 8.034s
SELECT * FROM `task_detail` order by rand() limit 900010, 100000
> OK
> 时间: 7.767s
SELECT * FROM `task_detail` order by rand() limit 900010, 100000
> OK
> 时间: 8.273s
千万级平均耗时91.82秒
SELECT * FROM `test_table_thousand` order by rand() limit 9000010, 1000000
> OK
> 时间: 91.811s
SELECT * FROM `test_table_thousand` order by rand() limit 9000010, 1000000
> OK
> 时间: 90.371s
SELECT * FROM `test_table_thousand` order by rand() limit 9000010, 1000000
> OK
> 时间: 91.49s
SELECT * FROM `test_table_thousand` order by rand() limit 9000010, 1000000
> OK
> 时间: 92.401s
SELECT * FROM `test_table_thousand` order by rand() limit 9000010, 1000000
> OK
> 时间: 92.909s
SELECT * FROM `test_table_thousand` order by rand() limit 9000010, 1000000
> OK
> 时间: 90.255s
基于子查询的RAND
select * from (
select * from t1 limit 5000 + i*10000, 100000
) a order by RAND()
优点:
-
可以保证多次抽样时,不存在重复的数据样本
-
由于先由limit划分出了子集,所以优化了
ORDER BY RAND()
在总体很大时的慢速问题
缺点:
- limit后面的offset越大,查询效率也会越慢
实例:
百万级平均耗时2.57秒
select * from (
SELECT * FROM `task_detail` limit 900010, 100000
) a order by rand() limit 10
> OK
> 时间: 2.604s
select * from (
SELECT * FROM `task_detail` limit 900010, 100000
) a order by rand() limit 10
> OK
> 时间: 2.889s
select * from (
SELECT * FROM `task_detail` limit 900010, 100000
) a order by rand() limit 10
> OK
> 时间: 2.397s
select * from (
SELECT * FROM `task_detail` limit 900010, 100000
) a order by rand() limit 10
> OK
> 时间: 2.46s
select * from (
SELECT * FROM `task_detail` limit 900010, 100000
) a order by rand() limit 10
> OK
> 时间: 2.481s
千万级平均耗时36.32秒
select * from (
SELECT * FROM `test_table_thousand` limit 9000010, 1000000
) a order by rand() limit 10
> OK
> 时间: 36.008s
select * from (
SELECT * FROM `test_table_thousand` limit 9000010, 1000000
) a order by rand() limit 10
> OK
> 时间: 35.711s
select * from (
SELECT * FROM `test_table_thousand` limit 9000010, 1000000
) a order by rand() limit 10
> OK
> 时间: 36.975s
select * from (
SELECT * FROM `test_table_thousand` limit 9000010, 1000000
) a order by rand() limit 10
> OK
> 时间: 37.33s
select * from (
SELECT * FROM `test_table_thousand` limit 9000010, 1000000
) a order by rand() limit 10
> OK
> 时间: 35.537s
取出数据,在代码中随机抽样
select * from t1 limit 5000 + i*10000, 100000
优点:
-
能够将
ORDER BY RAND()
的消耗移至代码系统 -
更加灵活
缺点:
- 当数据量过大时,对IO、网络、内存等会产生很大的开销
随机ID
select * from t1 where id in (id1, id2, id3, ... , id50)
通过在代码中随机生成ID来进行范围查询,但无法满足已知的(2)和(3)两个场景