Hive 常用案例分析

最新推荐文章于 2024-05-31 09:28:08 发布

MusicDancing

最新推荐文章于 2024-05-31 09:28:08 发布

阅读量5.4k

点赞数

分类专栏： hive 文章标签： hive 大数据

本文链接：https://blog.csdn.net/MusicDancing/article/details/110430861

版权

hive 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

1. 数据分箱占比统计

select a.bin
,a.bin_cnt
,concat(round(cast(a.bin_cnt*1.0/b.total_num as decimal(19,4))*100,2),'%') as bin_rate
from (
    select case
        when a1.exchange_rate<0.2 then '[0.0-0.2)'
        when a1.exchange_rate<0.5 then '[0.2-0.5)'
        when a1.exchange_rate<0.8 then '[0.5-0.8)'
        else '[0.8-1]'
    end as bin            --换血率分箱
    ,count(1) as bin_cnt  --分箱数量
    ,'11' as type                      --用于匹配总数的一个临时标志
    from (
        select uid
        ,exchange_rate
        from XX  
    ) a1
    group by case
        when a1.exchange_rate<0.2 then '[0.0-0.2)'
        when a1.exchange_rate<0.5 then '[0.2-0.5)'
        when a1.exchange_rate<0.8 then '[0.5-0.8)'
        else '[0.8-1.0]'
    end
) a
left join (    --通过与原表left join 获得总样本数
    select count(1) as total_num
    ,'11' as type
    from XX
) b
on a.type=b.type
;

2. Hive实现样本随机采样

order by rand() 函数可以对样本进行随机排序，进而实现hive表数据随机取样。

2.1 rand()

产生一个介于0和1之间的随机数

select rand();
-- 0.5523740163966064

-- 指定随机种子，以后使用这个种子每次产生的随机数都是一样的。
select rand(1234);
-- 0.6465821602909256

2.2 order by rand()

对结果进行随机排序

-- 数据随机排序后取前100条
select uid from tablename order by rand(1234) limit 100;
-- 备注： 这个不是分组，只是排序，rand()只是生成一个随机数。
-- 因为要扫描整个表，然后返回，一旦记录超过30W行，这个处理过程就会变得非常缓慢,可能造成服务器死机,因为排序时需要大量的磁盘空间.

为每一行生成一个随机数random，选择指定区间的random,进而实现随机采样

select a.* 
from (
    select *
    ,rand(12345) as random
    from tablename 
) a
where random between 0.3 and 0.5
-- limit 100
;

测试集采样(采样300w，正负样本1:2)

hive -e "
    select t1.*
    from (
        select a.*
        from (
            select *
            ,rand(12345) as rn
            from table1
            where label=1
        ) a
        where rn between 0.8 and 1.0
        limit 1000000
    ) t1

    union all

    select t2.*
    from (
        select a.*
        from (
            select *
            ,rand(12345) as rn
            from table1
            where label=0
        ) a
        where rn between 0.8 and 1.0
        limit 2000000
    ) t2
    ;
"> te.txt

2.3 order by

hive中的order by和传统sql中的order by 一样，对数据做全局排序，会新启动一个jod进行排序，不管数据多少，都会把所有数据放到同一个reduce中进行处理。

备注：因为执行order by的时候只能启动单个reduce，如果排序的结果集过大，那么执行时间会非常漫长。

2.4 sort by

sort by是局部排序，会在每个reduce端做排序，也就是每个reduce出来的数据是有序的，但是全部不一定有序，除非一个reduce，一般情况下可以先进行局部排序完成后，再进行全局排序，会提高不少效率。

sort by不会增加job，它在reduce端直接进行排序。

2.5 distribute by

distribute by是控制在map端如何拆分数据给reduce端的。hive会根据distribute by后面列，对应reduce的个数进行分发，默认是采用hash算法。

-- 将数据打乱，避免数据倾斜的一个方式
distribute by pmod(cast(rand()*1000 as int),50);

Distribute by和sort by的使用场景

1. Map输出的文件大小不均。

2. Reduce输出文件大小不均。

3. 小文件过多。

4. 文件超大。

2.6 cluster by

cluster by 是distribute by和sort by功能的结合，cluster by只能指定倒序排列。

3. 统计hive表字段覆盖度

3.1 临时表

tablename='zz.table_name'

# 1.统计hive表数据总量
total=`/app/hadoop/hive/bin/hive -e "
    set hive.cli.print.header=false;
    select count(1) from $tablename;
"`
echo "数据总量:    $total" >result.txt

# 2.统计hive表字段名称(expand 将tab键转成空格，cut以空格为分隔符，结果取第1个字段)
hive -e " 
    set hive.cli.print.header=false;
    desc $tablename;" | expand | cut -d ' ' -f 1 > feature_list.txt

# 3. 对于每一个字段计算非0值覆盖度
for i in $(cat feature_list.txt)
do
    cnt=`/app/hadoop/hive/bin/hive -e "
        set hive.cli.print.header=false;
        select count(1) from $tablename where $i<>0;
    "`
    cnt=$(($cnt*100))
    # rate=`echo "scale=2; $cnt/$total" | bc`
    # bc 计算的小数会出现将"0.12"-->".12"情况
    rate=$(echo "$cnt $total" | awk '{printf ("%.2f\n",$1/$2)}')
    echo  "$i:    $rate%" >>result.txt
done

3.2 分区表

tablename='zz.table_name'
# 以pt分区
pt='2021-01-07'

total=`/app/hadoop/hive/bin/hive -e "
    set hive.cli.print.header=false;
    select count(1) from $tablename where pt='$pt';
"`
echo "数据总量:    $total" >result.txt

hive -e " 
    set hive.cli.print.header=false;
    desc $tablename;" | expand | cut -d ' ' -f 1 > feature_list.txt

# 删除后6行，无效字段
for((i=0;i<6;i++)); 
do 
    sed -i '$d' feature_list.txt; 
done  

for i in $(cat feature_list.txt)
do
    cnt=`/app/hadoop/hive/bin/hive -e "
        set hive.cli.print.header=false;
        select count(1) from $tablename where pt='$pt' and $i<>0;
    "`
    cnt=$(($cnt*100))
    #rate=`echo "scale=2; $cnt/$total" | bc`
    rate=$(echo "$cnt $total" | awk '{printf ("%.2f\n",$1/$2)}')
    echo  "$i:    $rate%" >>result.txt
done

4. hive新集群批量建表

工作中，遇到hive集群搬迁，从A集群，迁到B集群，相应的数据表也要随之迁移，

但hive表一般都比较多，下面实现批量处理方式

1. 获得待搬迁的数据表清单table_list.txt。

hive -e "use zx_zz; show tables;" > table_list.txt

2. 处理建表语句。

# 执行命令，在每个建表语句后换行追加“;”
for i in `cat table_list.txt`
do
    hive -e"use zx_zz;show create table ${i}">>create_table_sql.txt
    echo \; >>create_table_sql.txt
done

# 去掉“createtab_stmt”
sed -i 's/createtab_stmt//g' create_table_sql.txt
# 将A集群替换成B集群
sed -i 's/A/B/g' create_table_sql.txt

3. 打开文件create_table_sql.txt，在首行加上"use zx_zz;"，导出文件

4. 导入文件，在B集群服务器上执行命令"hive -f create_table_sql.txt"

5. 修复表

hive -e "use zx_zz; show tables;" > tables.txt
for i in $(cat tables.txt)
do
    name='zx_zz.'$i
    hive -e "msck repair table $name;"
    echo  "msck repair table $name"
done

5. 获取具有指定字段的所有表

获得某一数据库(zz)下，具有指定字段(phone或mobile)的所有表。

# 获取该数据库下所有表名
hive -e "use zz; show tables;" >table_list.txt

for i in `cat table_list.txt`
do
    table_name=${i}
    result=`hive -e "use zz;desc ${table_name}" | grep -E 'phone|mobile'`
    if [ "$result" ];then
        echo ${i} >>phone_table.txt
        # 获取表的时间戳
        # table_name_new=${table_name#*.}  # 去掉库名
        # hadoop fs -ls "hdfs://wxlx/user/hive/warehouse/zz.db/"${table_name_new} | awk -F' ' '{print $6,$7,$8}' | tail -n 1 >>phone_table_ts.txt
    fi
done

MusicDancing

关注

0
点赞
踩
15

收藏

觉得还不错? 一键收藏
1
评论
Hive 常用案例分析

得到一个不重复的数据样本1. rand()产生一个介于0和1之间的随机数select rand();# 0.5523740163966064# 指定随机种子，以后使用这个种子每次产生的随机数都是一样的。select rand(1234);# 0.64658216029092562. order by rand()对结果进行随机排序# 数据随机排序后取前100条select uid from tablename order by rand(1234) limit 10.
复制链接

扫一扫