Hive分桶详解

最新推荐文章于 2024-05-05 04:32:31 发布

数仓大山哥

最新推荐文章于 2024-05-05 04:32:31 发布

阅读量2.9k

点赞数 2

分类专栏： hive 文章标签： hive 分桶 hive分桶原理及实战

本文链接：https://blog.csdn.net/panfelix/article/details/107433442

版权

hive 专栏收录该内容

34 篇文章 3 订阅

订阅专栏

语法格式

CREATE [EXTERNAL] TABLE <table_name>
(<col_name> <data_type> [, <col_name> <data_type> ...])]
[PARTITIONED BY ...]
CLUSTERED BY (<col_name>)
[SORTED BY (<col_name> [ASC|DESC] [, <col_name> [ASC|DESC]...])]
INTO <num_buckets> BUCKETS
CLUSTERED BY (<col_name>)：以哪一列进行分桶
SORTED BY (<col_name> [ASC|DESC]：对分桶内的数据进行排序
    INTO <num_buckets> BUCKETS：分成几个桶

具体解释：
只能对一列进行分桶。表可以同时分区和分桶，当表分区时，每个分区下都会有<num_buckets> 个桶。当使用 SORTED BY … 在桶内排序时，指定排序的列和指定分桶的列无需相同。ASC 为升序选项，DESC 为降序选项，默认排序方式是升序。<num_buckets> 指定分桶个数，也就是表目录下小文件的个数。

数据分桶的原理:

跟MR中的HashPartitioner的原理一模一样
MR中：按照key的hash值去模除以reductTask的个数
Hive中：按照分桶字段的hash值去模除以分桶的个数
Hive也是针对某一列进行桶的组织。Hive采用对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。

当join连接的字段值取hash不够均匀时,多取一个其它字段作为分桶字段;
分桶公式：
bucket num = hash_function(bucketing_column) mod num_buckets
列的值做哈希取余决定数据应该存储到哪个桶

1、分桶表建表

drop table xxxxxx_uid_online_buck;
create table xxxxxx_uid_online_buck(
  `datehour` string, 
  `halfhourtype` string, 
  `uid` string, 
  `roomid` string, 
  `roomcreatoruid` string, 
  `staytime` string)
clustered by(uid) 
sorted by(uid ASC)
into 4 buckets
row format delimited
fields terminated by ',';

2、设置分桶参数（两种方式，1）强制分桶 2）设置reduce数量，推荐1）
1） set hive.enforce.bucketing = true;（Hive有些版本不支持，报错：Query returned non-zero code: 1, cause: hive configuration hive.enforce.bucketing does not exists.）
2）set mapreduce.job.reduces=4; （不推荐）
3、向分桶表插入数据

insert into table xxxxxx_uid_online_buck
select datehour,halfhourtype,uid,roomid,roomcreatoruid,staytime from xxxxxx_uid_online distribute by(uid) sort by(uid asc);
 
insert overwrite table xxxxxx_uid_online_buck
select datehour,halfhourtype,uid,roomid,roomcreatoruid,staytime from xxxxxx_uid_online distribute by(uid) sort by(uid asc);
 
insert overwrite table xxxxxx_uid_online_buck
select datehour,halfhourtype,uid,roomid,roomcreatoruid,staytime from xxxxxx_uid_online cluster by(uid);
 
insert overwrite table xxxxxx_uid_online_buck
select datehour,halfhourtype,uid,roomid,roomcreatoruid,staytime from xxxxxx_uid_online cluster by(uid) sort by(uid); 报错,cluster 和 sort 不能共存

开始往创建的分通表插入数据(插入数据需要是已分桶, 且排序的)
可以使用distribute by(uid) sort by(uid asc)
排序和分桶的字段相同的时候也可以使用Cluster by(字段)
注意使用cluster by 就等同于分桶+排序(sort)

注意：导入数据有两种，一种是通过文件导入，但是并不会真正的分桶；一种是通过从其他表插入的方式导入数据，这种方式才能真正的分桶；
为什么通过 load data 的方式导入数据到 xxxxxx_uid_online_buck表，并不会分桶？load data只是把文件上传到表所在的HDFS目录下。并没有做其他操作

插入数据之前需要设置参数hive.enforce.bucketing=true，以强制hive的reducer数目为分桶数。如果不设置这个hive参数，最后的桶个数可能不是建表语句中的个数。另外，也可以通过将参数mapred.reduce.tasks设置为桶的数目，并在 SELECT 后增加CLUSTER BY(或 distribute by )语句来控制reducer的数目，建议采用第一种方式。

方式一（推荐）

--打开强制分桶开关： （）
hive (myhive)> set hive.enforce.bucketing=true;
--设置reduces数为-1：
hive (myhive)> set mapreduce.job.reduces=-1;
--通过其他表插入数据
hive (myhive)> insert into table xxxxxx_uid_online_buck select id, name from xxxxxx_uid_online ;
（通过这种方法，得到的分桶对应的文件，数据是无序的，也就是 sorted by 或 sort by无效）
如果没有设置 bucketing属性，我们需要自己设置和分桶个数相匹配的reducer个数。

方式二（不推荐）

--关闭强制分桶开关：
hive (myhive)> set hive.enforce.bucketing=false;
--设置reduces数和分桶数一致：
hive (myhive)> set mapreduce.job.reduces=3;
--通过其他表插入数据，要添加 distribute by 以及 sort by。
hive (myhive)> insert into table xxxxxx_uid_online_buck select id, name from xxxxxx_uid_online distribute by st_dept;
注意：hive.enforce.bucketing为true时，reduce要设为-1；
hive.enforce.bucketing为false时，reduce要设为和分桶数一致；
如果bucketing为 true，reduce又设成大于1的输，会执行两个job。

4、查看分桶表的数据文件
hive> dfs -ls /user/hive/warehouse/xxxxxx_uid_online_buck;
5、数据取样查询
SELECT * FROM xxxxxx_uid_online_buck TABLESAMPLE(bucket 1 out of 2 on uid);
SELECT * FROM xxxxxx_uid_online_buck TABLESAMPLE(bucket 1 out of 4 on uid) limit 100;
6、分桶表理解和说明
对于每一个表（table）或者分区， Hive可以进一步组织成桶，也就是说桶是更为细粒度的数据范围划分。Hive也是针对某一列进行桶的组织。Hive采用对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。
把表（或者分区）组织成桶（Bucket）有两个理由：
（1）获得更高的查询处理效率。桶为表加上了额外的结构，Hive 在处理有些查询时能利用这个结构。具体而言，连接两个在相同列（包含连接列的）上划分了桶的表，可以使用 Map 端连接（Map-side join）高效的实现。比如JOIN操作：对于JOIN操作两个表有一个相同的列，如果对这两个表都进行了桶操作，那么将保存相同列值的桶进行JOIN操作就可以，可以大大减少JOIN的数据量。
（2）使取样（sampling）更高效。在处理大规模数据集时，在开发和修改查询的阶段，如果能在数据集的一小部分数据上试运行查询，会带来很多方便。
分桶：如果我们根据某列进行分桶，意思就是对这列的值进行hash，然后除以桶的个数再决定把这个值放到哪个桶中，当我们查询数据的时候，where 分桶列=“” 首先也会对这个条件的值进行hash,找到他所在的桶，这样的话其他桶就不会再找避免暴力扫描，速度上也会提升
和分区区别：分区是增加实际目录，每增加一个分区就会多个目录，分桶时把一个大的文件分成多个小文件。
分桶表的作用：最大的作用是用来提高join操作的效率；但是两者的分桶数要相同或者成倍数。

数据分桶存在的一些缺陷：
如果通过数据文件LOAD 到分桶表中，会存在额外的MR负担。
实际生产中分桶策略使用频率较低，更常见的还是使用数据分区。

#! /bin/bash
 
set -o errexit
 
source /etc/profile
source ~/.bashrc
 
ROOT_PATH=$(dirname $(readlink -f $0))
echo $ROOT_PATH
 
date_pattern_old='^[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}$'
date_pattern='^[0-9]{4}-((0([1-9]{1}))|(1[1|2]))-(([0-2]([0-9]{1}))|(3[0|1]))$'
 
#参数数量
argsnum=$#
 
#一些默认值
curDate=`date +%Y%m%d`
partitionDate=`date -d '-1 day' +%Y-%m-%d`
fileLocDate=`date -d '-1 day' +%Y-%m-%d`
 
#日志存放位置
logdir=insert_bucket_logs
 
function tips() { 
    echo "Usage : insert_into_bucket.sh [date]"
    echo "Args :"
    echo "date"
    echo "    date use this format yyyy-MM-dd , ex : 2018-06-02"
        echo "============================================================"
    echo "Example :"
    echo "    example1 : sh insert_into_bucket.sh"
    echo "    example2 : sh insert_into_bucket.sh 2018-06-02"
}
 
if [ $argsnum -eq 0 ] ; then
    echo "No argument, use default value"
elif [ $argsnum -eq 1 ] ; then
    echo "One argument, check date pattern"
    arg1=$1
    if ! [[ "$arg1" =~ $date_pattern ]] ; then
               echo -e "\033[31m Please specify valid date in format like 2018-06-02"
               echo -e "\033[0m"
               tips
            exit 1
    fi
    dateArr=($(echo $arg1 |tr "-" " "))
    echo "dateArr length is "${#dateArr[@]}
    partitionDate=${dateArr[0]}-${dateArr[1]}-${dateArr[2]}
else 
    echo -e "\033[31m Not valid num of arguments"
    echo -e "\033[0m"
    tips
    exit 1
fi
 
 
if [ ! -d "$logdir" ]; then
    mkdir -p $logdir
fi
 
 
cd $ROOT_PATH
 
#nohup hive -hivevar p_date=${partitionDate} -hivevar f_date=${fileLocDate} -f  hdfs_add_partition_dmp_clearlog.hql  >> $logdir/load_${curDate}.log
 
nohup beeline -u jdbc:hive2://master:10000 -n root --color=true --silent=false  --hivevar p_date=${partitionDate} -i insert_into_bucket.init -f insert_into_bucket.hql  >> $logdir/insert_bucket_${curDate}.log

数仓大山哥

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
1
评论
Hive分桶详解

语法格式CREATE [EXTERNAL] TABLE <table_name>(<col_name> <data_type> [, <col_name> <data_type> ...])][PARTITIONED BY ...]CLUSTERED BY (<col_name>)[SORTED BY (<col_name> [ASC|DESC] [, <col_name> [ASC|DESC]..
复制链接

扫一扫