Hive--数据抽样的常用三种方法(随机/数据块/分桶)

最新推荐文章于 2024-08-16 15:29:20 发布

韩家小志

最新推荐文章于 2024-08-16 15:29:20 发布

阅读量4.7k

点赞数 2

分类专栏： hive 文章标签： hive

本文链接：https://blog.csdn.net/qq_46893497/article/details/110161920

版权

hive 专栏收录该内容

24 篇文章 6 订阅

订阅专栏

数据抽样的常用三种方法

1、随机抽样（rand()函数）
- 方法一:order by与rand函数结合
- 方法二:distribute和sort与rand函数结合
2、数据块抽样（tablesample()函数）
3、分桶抽样

1、随机抽样（rand()函数）

方法一:order by与rand函数结合

limit关键字限制抽样返回的数据
案例:order by 全局排序耗时长

select * 
from app.table_name 
order by rand() 
limit 100;

方法二:distribute和sort与rand函数结合

limit关键字限制抽样返回的数据
案例:rand函数前的distribute和sort关键字可以保证数据在mapper和reducer阶段是随机分布的

select * 
from app.table_name
where datekey='2020-11-26' 
distribute by rand() sort by rand() 
limit 100;

2、数据块抽样（tablesample()函数）

百分比(percent)

语法：tablesample(n percent)
功能：根据hive表数据的大小按比例抽取数据。如：抽取原hive表中10%的数据
案例：
指定where条件可能会报错哦,推荐最好还是不要加where为好

select * 
from dwd.hr_employee 
tablesample(10 percent);

大小(m)

语法：tablesample(n M)
功能：指定抽样数据的大小，单位为M。

行数(rows)

语法：tablesample(n rows)
功能：指定抽样数据的行数，其中n代表每个map任务均取n行数据，map数量可通过hive表的简单查询语句确认（关键词：number of mappers: x)
案例：
不指定where条件,用时374ms

select * 
from dwd.hr_employee 
tablesample(5 rows) ;

name	gender
吴**	F
张**	F
孙**	M
林**	F
李**	M

指定where条件,用时36s,而且可以看出是tablesample函数是在where条件之前生效的~

select * 
from dwd.hr_employee 
tablesample(5 rows) 
where gender='F';

name	gender
吴**	F
张**	F
林**	F

3、分桶抽样

hive中分桶其实就是根据某一个字段Hash取模，放入指定数据的桶中，比如将表table_1按照ID分成100个桶，其算法是hash(id) % 100，这样，hash(id) % 100 = 0的数据被放到第一个桶中，hash(id) % 100 = 1的记录被放到第二个桶中。创建分桶表的关键语句为：CLUSTER BY语句。
分桶抽样语法：

语法：TABLESAMPLE (BUCKET x OUT OF y [ON colname])
功能：分桶抽样,其中x是要抽样的桶编号，桶编号从1开始，colname表示抽样的列，y表示桶的数量。
案例：

select * 
from table_01 
tablesample(bucket 1 out of 10 on rand())

韩家小志

关注

2
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录