【Hive SQL】数据探查-数据抽样

一年又半

已于 2024-08-02 13:46:50 修改

阅读量1.1k

点赞数 10

分类专栏： Hive SQL 文章标签： hive sql hadoop

于 2024-07-26 17:03:48 首次发布

本文链接：https://blog.csdn.net/qq_34446614/article/details/140719909

版权

Hive SQL 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

文章目录

- 数据随机抽样

数据随机抽样

在大规模数据量的数据分析及建模任务中，往往针对全量数据进行挖掘分析时会十分耗时和占用集群资源，因此一般情况下只需要抽取一小部分数据进行分析及建模操作。下面罗列一些常用的数据抽样方法。

1、随机数排序抽样（rand()）

order by 与 rand() 结合
- 说明：limit限制抽样条数；order by 全局排序耗时长。
- 示例：
```
select
    *
from
    table_name 
order by rand() 
limit 1000;
```
distribute 、 sort 、 rand() 结合
- 说明：limit限制抽样条数；distribute和sort 根据rand()分桶排序，保证数据在mapper和reducer阶段随机分布。
- 示例：
```
select
    *
from
    table_name 
distribute by rand() 
sort by rand() 
limit 1000;
```

row_number() 、 rand() 结合

说明：这种方式可以根据特定业务场景抽取百分比数据；row_number() 开窗后，根据业务需求分组，按照rand()排序，排序值随机，根据count() over() 得到窗口内总数据量。通过排序值/总数据量设定阈值来抽取数据。

示例：

-- 根据用户注册日期，每日随机抽取20%的用户。
select
      t1.cust_id
     ,t1.nums
     ,t1.rnk
from 
    (
        select 
             cust_id
            ,count(cust_id) over(partition by cust_type,register_date) as nums
            ,row_number() over(partition by cust_type,register_date order by rand()) as  rnk
        from
            table_name
    ) t1
where
    t1.rnk/t1.nums <= 0.2

2、数据块抽样（tablesample()）

根据 hive 表数据的大小按比例抽取数据

功能：根据 hive 表数据的大小按比例抽取数据。如：抽取原 hive 表中 10%的数据
示例：

--  tablesample(n percent): 百分比(percent)
--  语法：tablesample(n percent)
select 
    * 
from 
    table_name 
tablesample(10 percent);


--------------------------------------------------------
--  tablesample(n M) 指定抽样数据的大小，单位为 M
--  语法：tablesample(n M)
--  按照数据的字节数进行采样
--  支持 b/B, k/K, m/M, g/G
select 
    * 
from 
    table_name
tablesample(1 M);


--------------------------------------------------------
--  tablesample(n rows) 指定抽样数据的行数，其中 n 代表每个 map 任 取 n 行数    据，map 数量可通过 hive 表的简单查询语句确认（关键词：numbe of mappers: x)
--  语法：tablesample(n rows)
select 
    * 
from 
    table_name 
tablesample(10 rows);

3、分桶抽样

hive 中分桶其实就是根据某一个字段 Hash 取模，放入指定数据的桶中，比如将表 table_1 按照 ID 分成 100 个桶，其算法是 hash(id) % 100，这样，hash(id) % 100 = 0 的数据被放到第一个桶中，hash(id) % 100 = 1 的记录被放到第二个桶中。创建分桶表的关键语句为：CLUSTER BY 语句。

语法：TABLESAMPLE (BUCKET x OUT OF y [ON colname])
说明： x 是要抽样的桶编号，桶编号从 1 开始，colname 表示抽样的列，y 表示桶的数量。
示例：

 -- 示例1
 select 
     * 
 from 
     table_name 
 tablesample(bucket 1 out of 10 on rand())

 -- 示例2
 -- 如果采样的列与CLUSTERED BY 列(即分桶列)相同，则采样的效率会更高。
 select 
     name
 FROM 
     employee
 tablesample(BUCKET 1 OUT OF 2 ON emp_id) a;
 ```