Hive学习(待续)

1 Hive 基本概念

1.1 Hive简介

Hive 是基于 Hadoop 的一个数据仓库工具,可以将结构化的数据文件映射为一张表,并
提供类 SQL 查询功能
用于解决海量结构化日志的数据统计工具

1.2 Hive本质

将 HQL 转化成 MapReduce 程序

  • Hive 处理的数据存储在 HDFS
  • Hive 分析数据底层的实现是 MapReduce
  • 执行程序运行在 Yarn 上

1.3 优缺点

1.3.1 优点

1.3.2 缺点

1.4 Hive架构原理

2 Hive数据类型

2.1 基本数据类型

基本数据类型

2.2 集合类型

集合类型

2.3 代码测试

需求:
json 存储 进入 hive

{
	 "name": "songsong",
	 "friends": ["bingbing" , "lili"] , //列表 Array, 
	 "children": { //键值 Map,
	 "xiao song": 18 ,
	 "xiaoxiao song": 19
 	}
	 "address": { //结构 Struct,
	 "street": "hui long guan",
	 "city": "beijing"
 	}
}

本地 txt:

songsong,bingbing_lili,xiao song:18_xiaoxiao song:19,hui long 
guan_beijing
yangyang,caicai_susu,xiao yang:18_xiaoxiao yang:19,chao yang_beijing
show databases;

create database lya_test;

use lya_test;

show tables;

create table test_create_table(id int,name string);

create table test(
name string,
friends array<string>,
children map<string, int>,
address struct<street:string, city:string>
)
-- 如需用对应的文本文件导入 则加上
row format delimited fields terminated by ','
collection items terminated by '_'
map keys terminated by ':'
lines terminated by '\n';

load data local inpath '/opt/module/hive/datas/test.txt' into table test;

字段解释:
row format delimited fields terminated by ‘,’ – 列分隔符
collection items terminated by ‘_’ --MAP STRUCT 和 ARRAY 的分隔符(数据分割符号)
map keys terminated by ‘:’ – MAP 中的 key 与 value 的分隔符
lines terminated by ‘\n’; – 行分隔符

2.3.1 分区测试

 create table dept_partition
 (deptno int, dname string, loc string)
partitioned by (day string)
row format delimited fields terminated by '\t';

load data local inpath 
'/hive/datas/dept_20200401.log' 
into table dept_partition 
partition(day='20200401');

 create table dept_partition_dy(id int, name string) 
partitioned by (loc int) row format delimited fields terminated by '\t';
insert into table dept_partition_dy select id ,name,loc from dept

3. 函数

3.1 窗口函数

  • 函数说明:

OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化。

CURRENT ROW:当前行
n PRECEDING:往前n行数据
n FOLLOWING:往后n行数据
UNBOUNDED:无边界,

  • UNBOUNDED PRECEDING 表示起点,
  • UNBOUNDEDFOLLOWING 表示终点

LAG(col,n,default_val):往前第n行数据
LEAD(col,n,default_val):往后第n行数据

3.1.1 案例数据

jack,2022-01-01,10
tony,2022-01-02,15
jack,2022-02-03,23
tony,2022-01-04,29
jack,2022-01-05,46
jack,2022-04-06,42
tony,2022-01-07,50
jack,2022-01-08,55
mart,2022-04-08,62
mart,2022-04-09,68
neil,2022-05-10,12
mart,2022-04-11,75
neil,2022-06-12,80
mart,2022-04-13,94
孙悟空,语文,87
孙悟空,数学,95
孙悟空,英语,68
大海,语文,94
大海,数学,56
大海,英语,84
宋宋,语文,64
宋宋,数学,86
宋宋,英语,84
婷婷,语文,65
婷婷,数学,85
婷婷,英语,78

3.1.2 案例代码

  1. 初始化数据
    private static void initData(SparkSession spark) {
       spark.sql("drop table `business`");
       spark.sql("create table `business`(" +
               "`name` string," +
               "`orderdate` string," +
               "`cost` int)" +
               "row format delimited fields terminated by ','");
       spark.sql("load data local inpath 'datas/business.txt' into table business");
       spark.sql("create table score(" +
               "name string,subject string, score int) " +
               "row format delimited fields terminated by ',' ");
       spark.sql("load data local inpath 'datas/rank.txt' into table score");
   }
  1. 查询数据
        spark.sql("select name , count(*) from business group by name").show();
        spark.sql("select name , count(*) over() from business").show();

在这里插入图片描述

// 1 查询在 2022 年 4 月份购买过的顾客及总人数
        spark.sql("select name , count(*)" +
                "from business " +
                "where substring(orderdate,1,7) = '2022-04' " +
                "group by name").show();

        spark.sql("select name , count(*)  over()" +
                "from business " +
                "where substring(orderdate,1,7) = '2022-04' " +
                "group by name").show();

在这里插入图片描述
小结
group by相同的数据一个组
over每条数据一个组

        // 2 查询顾客的购买明细及对应每个顾客月购买总额
        spark.sql("select name, orderdate,cost, sum(cost) over(partition by name,month(orderdate))" +
                "from business").show();

在这里插入图片描述

         // 3 上述的场景, 将每个顾客的 cost 按照日期顺序进行累加
        spark.sql("select name, orderdate,cost, sum(cost) over(partition by name,month(orderdate) order by orderdate) " +
                "from business").show();

在这里插入图片描述

        // 4 查询每个顾客上次的购买时间
        spark.sql("select name, orderdate, lag(orderdate,1) over (partition by name order by orderdate) " +
                "from business").show();

在这里插入图片描述

        // 4 查询每个顾客上次的购买时间
        spark.sql("select name, orderdate, lag(orderdate,1) over (partition by name order by orderdate) " +
                "from business").show();
            
		spark.sql("select * ,rank() over(order by score) from score ").show();
		spark.sql("select * ,dense_rank() over(order by score) from score ").show();
        //5. 每个学科前三名
        spark.sql("select * ,rank() over(partition by subject order by score desc ) rank from score ").createOrReplaceTempView("t");
        spark.sql("select * from t where t.rank <=3").show();

在这里插入图片描述

3.2 行列转换

3.2.1 行转列

  • 函数说明
    CONCAT(string A/col, string B/col…):返回输入字符串连接后的结果,支持任意个输入字
    符串(任意一个字段为空整体拼接结果就为空);
    CONCAT_WS(separator, str1, str2,...):它是一个特殊形式的 CONCAT()。第一个参数剩余参
    数间的分隔符。分隔符可以是与剩余参数一样的字符串。如果分隔符是 NULL,返回值也将
    为 NULL。这个函数会跳过分隔符参数后的任何 NULL 和空字符串。分隔符将被加到被连接
    的字符串之间;
    注意: CONCAT_WS must be "string or array\<string>
    COLLECT_SET(col):函数只接受基本数据类型,它的主要作用是将某字段的值进行去重
    汇总,产生 Array 类型字段。
      //查询每个科目包含哪些人
        spark.sql("use test");
        spark.sql("insert into table score values ('小明',\"语文\",95)");
        spark.sql("select subject ,concat_ws(\",\",collect_set(name)) from score group by subject").show();
        System.out.println("===============");

在这里插入图片描述

3.2.2 列转行

  • 函数说明
    EXPLODE(col):将 hive 一列中复杂的 Array 或者 Map 结构拆分成多行。
    LATERAL VIEW
    用法:LATERAL VIEW udtf(expression) tableAlias AS columnAlias
    解释:用于和 split, explode 等 UDTF 一起使用,它能够将一列数据拆成多行数据,在此
    基础上可以对拆分后的数据进行聚合。
3.2.2.1 案例数据
《疑犯追踪》 悬疑,动作,科幻,剧情
《Lie to me》 悬疑,警匪,动作,心理,剧情
《战狼 2》 战争,动作,灾难
3.2.2.2 案例代码
        spark.sql("create table movie_info(" +
                " movie string," +
                " category string)" +
                "row format delimited fields terminated by \"\\t\"");
        spark.sql("insert into table movie_info values ('《疑犯追踪》',\"悬疑,动作,科幻,剧情\")," +
                "('《Lie to me》',\"悬疑,警匪,动作,心理,剧情\"),('《战狼 2》',\"战争,动作,灾难\")");
        spark.sql("select movie, category_name from " +
                "movie_info " +
                "lateral view explode(split(category,\",\")) movie_info_tmp as category_name").show();

                

在这里插入图片描述

4. Demo

4.1 数据准备

表结构

  1. 视频表
    在这里插入图片描述
  2. 用户表
    在这里插入图片描述
    sql语句
create table gulivideo_ori(
videoId string, 
 uploader string, 
 age int, 
 category array<string>, 
 length int, 
 views int, 
 rate float, 
 ratings int, 
 comments int,
 relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as textfile;


create table gulivideo_user_ori(
 uploader string,
 videos int,
 friends int)
row format delimited 
fields terminated by "\t" 
stored as textfile;

4.2 需求实现

4.2.1 统计视频类别热度 Top10

分析:

  1. 即统计每个类别有多少个视频,求出出包含视频最多的前 10 个类别
  2. 根据类别group by 统计videoId的个数
SELECT
	t.category_name,
	count( t.videoId ) num 
FROM
	( SELECT videoId, category_name 
	  FROM gulivideo_ori lateral VIEW explode ( category ) gulivideo_ori_tmp AS category_name ) t 
GROUP BY
	t.category_name 
ORDER BY
	num DESC 
	LIMIT 10

在这里插入图片描述

4.2.2 统计出视频观看数最高的 20 个视频的所属类别以及类别包含Top20 视频的个数

  1. 查询观看数最高的20个视频信息
  2. 行转列category
  3. 查询视频分类名称和该分类下有多少个 Top20 的视频
SELECT
 t2.category_name,
 COUNT(t2.videoId) video_sum
FROM 
( SELECT
	t1.videoId,
	category_name 
FROM
( SELECT videoId, views, category 
  FROM gulivideo_ori ORDER BY views DESC LIMIT 20 ) t1 
	lateral VIEW explode ( t1.category ) t1_tmp AS category_name
	) t2 
GROUP BY
	t2.category_name

在这里插入图片描述

4.2.3 统计视频观看数 Top50 所关联视频的所属类别排序

  1. 查询视频观看数 Top50 所关联视频id-行转列
  2. 查询id对应的 种类 将种类行转列
  3. 统计每个种类的 视频数量
  4. 开窗函数排序
SELECT
t6.category_name,
t6.video_sum,
rank() over ( ORDER BY t6.video_sum DESC ) rk 
FROM
	(
	SELECT
		t5.category_name,
		COUNT( t5.relatedid_id ) video_sum 
	FROM
		(
		SELECT
			t4.relatedid_id,
			category_name 
		FROM
			(
			SELECT
				t2.relatedid_id,
				t3.category 
			FROM
				(
				SELECT
					relatedid_id 
				FROM
					( SELECT videoId, views, relatedid FROM gulivideo_ori ORDER BY views DESC LIMIT 50 ) t1 lateral VIEW explode ( t1.relatedid ) t1_tmp AS relatedid_id 
				) t2
				JOIN gulivideo_orc t3 ON t2.relatedid_id = t3.videoId 
			) t4 lateral VIEW explode ( t4.category ) t4_tmp AS category_name 
		) t5 
	GROUP BY
		t5.category_name 
	ORDER BY
	video_sum DESC 
	) t6

在这里插入图片描述

4.2.4 统计每个类别视频观看数 Top3

SELECT 
 t2.videoId,
 t2.views,
 t2.category_name,
 t2.rk
FROM 
(
SELECT 
 t1.videoId,
 t1.views,
 t1.category_name,
 rank() over(PARTITION BY t1.category_name ORDER BY t1.views DESC ) rk
FROM 
(
SELECT
 videoId,
 views,
 category_name
FROM gulivideo_ori
lateral VIEW explode(category) gulivideo_ori_tmp AS category_name
)t1
)t2
WHERE t2.rk <= 3

在这里插入图片描述

4.2.5 统计上传视频最多的用户 Top3以及他们上传的视频观看次数在前 3 的视频

spark.sql("SELECT  " +
                " uploader, " +
                " videos " +
                "FROM gulivideo_user_ori  " +
                "ORDER BY  " +
                " videos " +
                "DESC " +
                "LIMIT 3  ").createOrReplaceTempView("t1");
        spark.sql("SELECT  " +
                " t2.videoId, " +
                " t2.views, " +
                " t2.uploader " +
                "FROM " +
                "t1 " +
                "JOIN gulivideo_ori t2  " +
                "ON t1.uploader = t2.uploader " +
                "ORDER BY t2.views DESC LIMIT 3").show();

5. 优化

5.1 in/exists 效率最低

测试用例大表100w 小表10w

select t1.id , t1.uid
        from bigtable1 t1
where t1.id in (select t.id as id from smalltable t limit 10);

最佳替代

select t1.id from bigtable1 t1
    left semi join (select t.id as id from smalltable t limit 10) as temp on temp.id = t1.id;
    

对比

select temp.id from
    (select t.id as id from smalltable t limit 10) as temp
    left semi join bigtable1 t1 on temp.id = t1.id;
    

在这里插入图片描述

5.2 MapJoin

MapJoin 是将 Join 双方比较小的表直接分发到各个 Map 进程的内存中,在 Map 进程中进行 Join 操 作,这样就不用进行 Reduce 步骤,从而提高了速度。

  1. 开启 MapJoin 参数设置
  • 设置自动选择 MapJoin
set hive.auto.convert.join=true; #默认为 true
  • 大表小表的阈值设置(默认 25M 以下认为是小表)
set hive.mapjoin.smalltable.filesize=25000000
  1. demo 测试
    insert overwrite table jointable
    select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
    from smalltable s
    left join bigtable b
    on s.id = b.id
    失效了
    原因:此时小表(左连接)作为主表,所有数据都要写出去,因此此时会走 reduce,mapjoin失效

5.3 大表与大表之间的join

  • 减少不必要的字段筛选
  • 看能否将where提前 减少表与表之间的关联
  • 建立分桶

建立分桶测试

建立原始表

create table bigtable1(
	 id bigint,
	 t bigint,
	 uid string,
	 keyword string,
	 url_rank int,
	 click_num int,
	 click_url string)
row format delimited fields terminated by '\t';
load data local inpath '/opt/module/data/bigtable' into table bigtable1;

create table bigtable2(
	 id bigint,
	 t bigint,
	 uid string,
	 keyword string,
	 url_rank int,
	 click_num int,
	 click_url string)
row format delimited fields terminated by '\t';
load data local inpath '/opt/module/data/bigtable' into table bigtable1;

建立分桶表

create table bigtable_buck1(
	 id bigint,
	 t bigint,
	 uid string,
	 keyword string,
	 url_rank int,
	 click_num int,
	 click_url string)
	clustered by(id)
	into 6 buckets
	row format delimited fields terminated by '\t';
load data local inpath '/opt/module/data/bigtable' into table 
bigtable_buck1;


create table bigtable_buck2(
	 id bigint,
	 t bigint,
	 uid string,
	 keyword string,
	 url_rank int,
	 click_num int,
	 click_url string)
	clustered by(id)
	into 6 buckets
	row format delimited fields terminated by '\t';
load data local inpath '/opt/module/data/bigtable' into table 
bigtable_buck2;

原始表速度

select a.id,a.name,a.age,a.phone,b.id,b.grade,b.address
from bigtable a
join bigtable2 b
on a.id = b.id;

分桶表速度

  • 参数设置
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set 
hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
  • sql编写
insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable_buck1 s
join bigtable_buck2 b
on b.id = s.id;
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值