Hive day04

姚circle

已于 2022-12-05 09:00:55 修改

阅读量393

点赞数

分类专栏： hive 文章标签： hive hadoop 数据仓库

于 2022-12-04 17:34:34 首次发布

本文链接：https://blog.csdn.net/qq_53822083/article/details/128144572

版权

hive 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

目录

1.维度组合分析

2.列换行行转列：

3.字段类型转换

4.四大by

1.order by

2.short by

3.Distribute By （数据分发）:

4.Cluster By

案例

5.文件存储格式：压缩

1.行式存储

2.列式存储

1.维度组合分析

sql 关键字 grouping sets

案例

数据

u1,a,app,andriod
u2,b,h5,andriod
u1,b,h5,ios
u1,a,h5,andriod
u3,c,app,ios
u4,b,app,ios
u1,a,app,ios
u2,c,app,ios
u5,b,app,ios
u4,b,h5,andriod
u6,c,h5,andriod
u2,c,h5,andriod
u1,b,xiao,ios
u2,a,xiao,andriod
u2,a,xiao,ios
u3,a,xiao,ios
u5,a,xiao,andriod
u5,a,xiao,ios
u5,a,xiao,ios

建表

create table user_shop_log(
user_id string,
shop string,
channle string ,
os string 
)
row format delimited fields terminated by ',';

插入数据：load data local inpath "/home/hadoop/tmp/data/user_log.txt" into table user_shop_log;

问题1

1.每个店铺的访问次数

select 
shop,
count(user_id) as cnt 
from user_shop_log
group by 
shop ;

2.每个店铺每个用户的访问次数

select
shop,
user_id,
count(user_id) as cnt
from user_shop_log
group by shop,user_id;

3.每个店铺每个用户每个渠道的访问次数

select
shop,
user_id,
channle,
count(user_id) as cnt
from user_shop_log
group by shop,user_id,channle;

4.每个店铺每个用户每个渠道每个操作系统的访问次数

select
shop,
user_id,
channle,
os,
count(user_id) as cnt
from user_shop_log
group by shop,user_id,channle,os;

5.每个用户每个操作系统的登录次数

select
user_id,
os,
count(user_id) as cnt
from user_shop_log
group by user_id,os;

6.每个渠道每个操作系统的浏览次数

select
channle,
os,
count(user_id) as cnt
from user_shop_log
group by channle,os;

维度组合分析 GROUPING SETS：

select 
user_id,
shop ,
channle  ,
os  ,
count(1)
from user_shop_log
group by 
user_id,
shop ,
channle  ,
os 
grouping sets(
(user_id),
(user_id,shop),
(user_id,channle),
(user_id,os),
(user_id,shop,channle),
(user_id,shop,os),
(user_id,shop,channle,os),
(shop),
(shop,channle),
(shop,os),
(shop,channle,os),
(channle),
(channle,os),
(os)
);

2.列换行行转列：

列换行

xxxx => array

案例

数据

zuan,王者荣耀
zuan,吃饭
zuan,rap
zuan,唱歌
甜甜,王者荣耀
甜甜,哥
甜甜,吃鸡

创表

create table t1(
name string ,
interesting string 
)
row format delimited fields terminated by ',';

插入数据：load data local inpath "/home/hadoop/tmp/data/t1.txt" into table t1;

命令

select 
name,
collect_list(interesting) as interestings,
concat_ws("|",collect_list(interesting)) as interestings_blk
from t1
group by name ;

3.字段类型转换
- 前提：
  任何数据类型都可以转换成string
  数值类型 string =》
  - 1.四则运算是ok hive 优化
  - 2.影响排序
- 案例
  - 数据
```
sal :  
1000
1500
100
900
9000
```
  - 字符串排序：按照字典序进行排序的 a-z
    排序后会出现问题如下：
    9000
    900
    1500
    1000
    100
  - 解决思路：
    - 1.修改表
    - 2.类型转换
      - 建表
        
        create table t2( sql string );
      - 插入数据：
        load data local inpath "/home/hadoop/tmp/data/t2.txt" into table t2;
      - 命令
        
        select cast(sql as bigint ) as sql_alias from t2 order by sql_alias; //将string转换成bigint再排序
4.四大by
- 1.order by
  - 1.全局排序
  - 2.reduce 只有一个
  - 使用方法
    - 开启严格模式【一般是关闭的】
      防止一些危险的查询是不被允许的
      - 开启命令：set hive.mapred.mode=strict;
        关闭命令：set hive.mapred.mode=nostrict;
      - 当开启严格模式后 select * from emp_p; 将不好用
        （emp_p 为分区表，其他正常表正常使用）
        只能使用select * from emp_p where deptno=20; 进行分区查询
    - select * from emp order by empno limit 10;
      将empno由小到大进行排序
- 2.short by
  - 1.分区排序
  - 2.reduce task 个数默认
  - 3.不能保证全局有序
  - 如果你的reduce task 个数是1 那么 order by 和sort by 效果是一样的
    - 调制reduce task 个数：set mapred.reduce.tasks=2;
      再进行查询，此时分为两部分
  - 查看结果数据
    insert OVERWRITE LOCAL DIRECTORY '/home/hadoop/tmp/data/exemple/sortby'
    select * from emp sort by empno;
- 3.Distribute By （数据分发）:
  - 数据
    每季度的收入
```
2020,1w
2020,2w
2020,1w
2020,0.5w
2021,10w
2021,20w
2021,19w
2021,1.5w
2022,1.3w
2022,2w
2022,1w
2022,0.5w
```
  - 建表
```
create table hive_distribute(
year string,
earning string
)
row format delimited fields terminated by ',';
```
  - 插入数据：load data local inpath "/home/hadoop/tmp/data/exemple/distribute.txt" into table hive_distribute;
  - 查询
    - reduce task 2
      
      insert OVERWRITE LOCAL DIRECTORY '/home/hadoop/tmp/data/exemple/distribute'
      select * from hive_distribute distribute by year sort by earning;
    - 若rdeuce task=3
    - 修改reduce task为默认：set mapred.reduce.task=-1;
- 4.Cluster By
  - 了解
    - Cluster By 和 Distribute By主要用于分发数据
      Cluster By 是Distribute By和sort by 的简写
    - 使用方法
      distribute by year sort by year 《=》 Cluster By year
  - 分桶表
    - 分桶表是hdfs上的文件
      
      再dt后继续进行划分
    - 案例
      - 数据
        1,name1 2,name2 3,name3 4,name4 5,name5 6,name6 7,name7 8,name8
      - 建表
        create table hive_bucket( id int, name string ) clustered by (id) into 4 buckets row format delimited fields terminated by ","; //以id分四个桶
      - 插入数据：load data local inpath "/home/hadoop/tmp/data/exemple/bucket.txt" into table hive_bucket;
        记得将reduce task 改为默认数值
5.文件存储格式
- 1.行式存储
  - 1.含义
    - 1.一行内容所有的列都在一个 block里面
    - 2.里面的列掺杂很多数据类型
  - 2.查询方式
    - 行式存储加载所是把所有的列都查询出来再过滤出用户需要的列
      如果用户仅仅查几个字段会导致磁盘io 开销比较大
  - 3.行存储的使用
    - 1.text file 文本文件
    - 2.SequenceFile 文本文件
  - 2.列式存储
    - 1. 含义：按照列进行存储
    - 2. 列式存储文件
      - 1. RCFile ：初期时行到列的转变的产物（现在大多不适用）
      - 2. ORC Files
      - 3. Parquet
    - 3. 适用场：查询几个列
    - 4. 弊端：加载表中所有字段
    - 5. 优点：列式存储文件数据量比行式存储的数据量少【前提都采用压缩】
    - 6. 案例
      - 建表命令
        create table hive_distribute_col( year string, earning string ) row format delimited fields terminated by ',' stored as orc;
      - 插入数据命令
        insert into table hive_distribute_col select * from hive_distribute;