hive的分桶

最新推荐文章于 2024-04-30 07:42:10 发布

杨大大慌

最新推荐文章于 2024-04-30 07:42:10 发布

阅读量226

点赞数

分类专栏： hive 文章标签： hive的分桶

本文链接：https://blog.csdn.net/e3hhhh/article/details/100717722

版权

hive 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

分桶的概述

为什么要分桶
- 数据分区可能导致有些分区数据过多，有些分区数据极少。分桶是将数据集分解为若干部分(数据文件)的另一种技术。
- 分区和分桶其实都是对数据更细粒度的管理。当单个分区或者表中的数据越来越大，分区不能细粒度的划分数据时，我们就采用分桶技术将数据更细粒度的划分和管理
- [CLUSTERED BY (col_name, col_name, …)
- stored by (uid desc)
分桶的原理
- 与MapReduce中的HashPartitioner的原理一模一样
  - MapReduce：使用key的hash值对reduce的数量进行取模(取余)
  - hive：使用分桶字段的hash值对分桶的数量进行取模(取余)。针对某一列进行分桶存储。每一条记录都是通过分桶字段的值的hash对分桶个数取余，然后确定放入哪个桶。
分桶的意义
1. 为了保存分桶查询的分桶结构（数据已经按照分桶字段进行了hash散列）
2. 分桶表适合进行数据抽样
  
  抽样更高效。处理大数据时，如果能在数据集的一部分上运行查询进行测试会带来很多方便
3. join操作时可以提高MR的查询效率
  
  连接查询两个在相同列上划分了桶的表，可以在map端进行高效的连接操作。
  比如jion操作。对于两个表都有一个相同的列，如果对两个表都进行桶操作，那么hive底层会对相同列值的桶进行join操作。效率很高

2. 分桶的操作

创建分桶表和加载数据
1. 错误的方式：
  - 建表语句：语句正确
    
    create table student(
    id int,
    name string,
    sex string,
    age int,
    academy string
    )
    clustered by (sno) into 4 buckets #即指定了分桶字段也指定了排序字段
    row format delimited
    fields terminated by ‘,’
    ;
  - 加载数据：方式错误，load实际上也是copy，没有分桶效果。
    
    load data local inpath ‘./data/students.txt’ into table student;
2. 正确的方式：
  1. 建表语句：语句正确
    
    create table student(
    sno int,
    name string,
    sex string,
    age int,
    academy string
    )
    clustered by (sno) sorted by (sage desc) into 4 buckets #分桶字段和排序字段可以不一致
    row format delimited
    fields terminated by ‘,’
    ;
  2. 加载数据：分两步
    - 第一步：先创建临时表
      
      create table temp_student(
      sno int,
      name string,
      sex string,
      age int,
      academy string
      )
      clustered by (sno) sorted by (sage desc) into 4 buckets
      row format delimited
      fields terminated by ‘,’
      ;
```
load data local inpath './data/students.txt' into table temp_student;
```
    - 从临时表中查询并导入数据
      
      insert into|overwirte table student
      select * from temp_student
      distribute by(sno)
      sort by (sage desc)
      ;
3. 小贴士：
  - 需要设置reduce数量和分桶数量相同：
    
    set mapreduce.job.reduces=4;
  - 如果数据量比较大，我们可以使用MR的本地模式：
    
    set hive.exec.mode.local.auto=true;
  - 强行分桶设置：（常规配置）
    
    set hive.enforce.bucketing=true; 默认是false
```
测试
insert overwrite table student
select * from temp_student
distribute by(sno) 
sort by (sage desc)
;
```
  - 强行排序：（常规配置）
    
    set hive.enforce.sorting=true;
```
测试：		
insert overwrite table student
select * from temp_student
distribute by(sno) 
sort by (sage desc)
;
```
分桶的查询
1. 语法：
  
  语法:tablesample(bucket x out of y on sno)
  x:代表从第几桶开始查询，x不能大于y
  y:代表查询的总桶数,y可以是总桶数的因子或者倍数
2. 查询全部
  
  select * from student;
  select * from student tablesample(bucket 1 out of 1);
3. 指定桶查询
```
  查询第一桶
  select * from student tablesample(bucket 1 out of 4 on sno);
```
  查询第一桶和第三桶
  select * from student tablesample(bucket 1 out of 2 on sno);
  查询第二桶和第四桶的数据
  select * from student tablesample(bucket 2 out of 2 on sno);
  查询对8取余的第一桶的数据：
  select * from student tablesample(bucket 1 out of 8 on sno);
4. 其他查询
  
  查询三行数据
  select * from student limit 3;
  select * from student tablesample(3 rows);
  查询百分比的数据
  select * from student tablesample(13 percent);
  查询固定大小的数据
  select * from student tablesample(68b); 单位（K,KB,MB,GB…）
  随机抽三行数据
  select * from student order by rand() limit 3;