【Hive_04】

走多远才算远

于 2022-12-05 17:58:56 发布

阅读量172

点赞数

文章标签： hive hadoop 数据仓库

本文链接：https://blog.csdn.net/weixin_47922102/article/details/128190823

版权

Hive_04

[hadoop@bigdata32 exemple]$ cat user_shop.txt
user_id shop
u1,a
u2,b
u1,b
u1,a
u3,c
u4,b
u1,a
u2,c
u5,b
u4,b
u6,c
u2,c
u1,b
u2,a
u2,a
u3,a
u5,a
u5,a
u5,a

pv =》页面浏览量 3个用户每个人访问了 10次页面 30
uv =》访客次数 3个用户每个人访问了 10次页面 3

create table user_shop(
user_id string,
shop string
)
row format delimited fields terminated by ‘,’;

load data local inpath “/home/hadoop/tmp/data/exemple/user_shop.txt” into table user_shop;

需求：
1.每个店铺的uv

维度：店铺 
指标：uv 人数 count()+去重

select
shop,
count(distinct user_id) as uv
from user_shop
group by shop;
==> 去重 group by

select

from user_shop

1.数据清洗 etl ：
1.取出需要的字段
2.取出的column做数据转换或者etl
3.去重
2.数据计算
维度
指标

select
shop,
count(user_id) as uv
from
(
select
shop,
user_id
from user_shop
group by
shop,
user_id
) as a
group by
shop;

=> 一共有多少个店铺？
select
count(distinct shop) as cnt
from user_shop

2.每个店铺访问次数top3 的用户记录 
	输出： 店铺名称 访客id  访问次数

1.每个店铺访问次数

2.访问次数排名

3.top3 店铺名字访问次数排名

select
shop,
cnt ,
rk
from
(
select
shop,
cnt ,
rank() over( order by cnt desc ) as rk
from
(
– 每个店铺访问次数
select
shop,
count(1) as cnt
from user_shop
group by shop
) as a
) as a
where rk <4;

select
shop,
count(1) as cnt
from user_shop
group by shop;

店铺名称访客id 访问次数 top3

1.每个店铺每个访客id访问次数

2.访问次数排名

3.top3 店铺名称访客id 访问次数

select
shop,
user_id,
cnt,
rk
from
(
select
shop,
user_id,
cnt,
rank() over(partition by shop order by cnt desc ) as rk
from
(
– 1.每个店铺每个访客id访问次数
select
shop,
user_id,
count(1) as cnt
from user_shop
group by
shop,
user_id
) as a
) as a
where rk <4;

1 2 3 4 5 6 7 2指标

1 2指标
1 3 2指标
123
124
145
156
1-7
357

维度组合分析 sql 关键字 grouping sets

create table user_shop_log(
user_id string,
shop string,
channle string ,
os string
)
row format delimited fields terminated by ‘,’;

u1,a,h5,andriod
u2,b,h5,andriod
u1,b,h5,andriod
u1,a,h5,andriod
u3,c,app,ios
u4,b,app,ios
u1,a,app,ios
u2,c,app,ios
u5,b,app,ios
u4,b,h5,andriod
u6,c,h5,andriod
u2,c,h5,andriod
u1,b,xiao,ios
u2,a,xiao,ios
u2,a,xiao,ios
u3,a,xiao,ios
u5,a,xiao,ios
u5,a,xiao,ios
u5,a,xiao,ios

load data local inpath “/home/hadoop/tmp/data/exemple/user_log.txt” into table user_shop_log;

1.每个店铺的访问次数
create table rpt_shop_cnt as
select
shop,
count(1) as cnt
from user_shop_log
group by
shop ;
2.每个店铺每个用户的访问次数
create table rpt_shop_user_cnt as
select
shop,user_id
count(1) as cnt
from user_shop_log
group by
shop,user_id ;
3.每个店铺每个用户每个渠道的访问次数
create table rpt_shop_channle_cnt as
select
shop,user_id,channle
count(1) as cnt
from user_shop_log
group by
shop,user_id ,channle;
4.每个店铺每个用户每个渠道每个操作系统的访问次数
5.每个用户每个操作系统的登录次数
create table rpt_user_os_cnt as
select
user_id,os
count(1) as cnt
from user_shop_log
group by
user_id,os;
6.每个渠道每个操作系统的浏览次数
create table rpt_channle_os_cnt as
select
channle,os
count(1) as cnt
from user_shop_log
group by
channle,os;

维度组合分析：
GROUPING SETS

select
user_id,
shop ,
channle ,
os ,
count(1)
from user_shop_log
group by
user_id,
shop ,
channle ,
os
grouping sets(
(user_id),
(user_id,shop),
(user_id,channle),
(user_id,os),
(user_id,shop,channle),
(user_id,shop,os),
(user_id,shop,channle,os),
(shop),
(shop,channle),
(shop,os),
(shop,channle,os),
(channle),
(channle,os),
(os)
);

列换行行转列：

create table t1(
name string ,
interesting string
)
row format delimited fields terminated by ‘,’;

zuan,王者荣耀
zuan,吃饭
zuan,rap
zuan,唱歌
chaofeng,王者荣耀
chaofeng,睡觉
chaofeng,方亚

load data local inpath “/home/hadoop/tmp/data/exemple/t1.txt” into table t1;

行转列：
array =》 items explode

列转行
xxxx => array

select
name,
collect_list(interesting) as interestings,
concat_ws(“|”,collect_list(interesting)) as interestings_blk
from t1
group by name ;

更改数组元素的分隔符

字段类型转换：
前提：
任何数据类型都可以转换成string
数值类型 string =》
1.四则运算是ok hive 优化
2.影响排序

sal :
1000
1500
100
900
9000

降序排序：
9000
1500
1000
900
100

字符串排序：按照字典顺序进行排序的 a-z
9000
900
1500
1000
100

create table t2(
sql string
);

load data local inpath “/home/hadoop/tmp/data/exemple/t2.txt” into table t2;

解决思路：
1.修改表
2.类型转换

select
cast(sql as bigint ) as sql_alias
from t2
order by sql_alias;

2.四大by
Order By：（全局排序）
1.全局排序
2.reduce 只有1个

select * from emp order by empno;

hive.mapred.mode => some risky queries are not allowed to run 【关闭】
			1.order by => limit 
			2.分区表

Sort By：（分区排序）
1.分区排序
2.reduce task
3.不能保证全局有序
如果你的reduce task 个数是1 order by 和sort by 效果是一样的

reduce 1：
10
9
8

reduce 2：
2
1

select * from emp sort by empno;

调制reduce task 个数 ：
	mapred.reduce.tasks 
	set mapred.reduce.tasks;

insert OVERWRITE LOCAL DIRECTORY ‘/home/hadoop/tmp/data/exemple/sortby’
select * from emp sort by empno;

Distribute By: （数据分发和排序无关）
数据分发

数据：
2020,1w
2020,2w
2020,1w
2020,0.5w
2021,10w
2021,20w
2021,19w
2021,1.5w
2022,1.3w
2022,2w
2022,1w
2022,0.5w

create table hive_distribute(
year string,
earning string
)
row format delimited fields terminated by ‘,’;

load data local inpath “/home/hadoop/tmp/data/exemple/distribute.txt” into table hive_distribute;

– 查询过程中数据分发【mr partitioner 相同规则的数据肯定是在一个reducer中】

reduce task 2

insert OVERWRITE LOCAL DIRECTORY ‘/home/hadoop/tmp/data/exemple/distribute’
select * from hive_distribute distribute by year sort by earning;

Cluster By :
Cluster By is a short-cut for both Distribute By and Sort By.

distribute by year sort by earning => Cluster By 错的

distribute by year sort by year 《=》 Cluster By year 正确

分桶表：

普通表 路径
分区表 路径
分桶表: 
	hdfs上的文件

1,name1
2,name2
3,name3
4,name4
5,name5
6,name6
7,name7
8,name8

[CLUSTERED BY (col_name, col_name, …)
INTO num_buckets BUCKETS]

create table hive_bucket(
id int,
name string
)
clustered by (id) into 4 buckets
row format delimited fields terminated by “,”;

load data local inpath “/home/hadoop/tmp/data/exemple/bucket.txt” into table hive_bucket;

mapreduce:
hash % reducetask个数

文件存储格式：压缩
1.行式存储：
1.一行内容所有的列都在一个 block里面
2.里面的列掺杂很多数据类型
2.列式存储
按照列进行存储

前提： 企业 table 字段 几十个 到几百个 

行式存储： 
	1.行式存储加载 是把所有的列都查询出来 再过滤出 用户需要的列 
	2.如果用户 仅仅查几个字段  =》 磁盘io 开销比较大

有哪些： 
	1.text file 文本文件 
	2.SequenceFile 文本文件

列式存储：
	1.RCFile  =》 行 =》 列 
	2.ORC Files + Parquet
	适用场景： 
		查询几个列 

	弊端： 
		加载表中所有字段

udf：之后项目里面讲

create table hive_distribute_col(
year string,
earning string
)
row format delimited fields terminated by ‘,’
stored as orc;

load data local inpath “/home/hadoop/tmp/data/exemple/distribute.txt” into table hive_distribute_col;

insert into table hive_distribute_col
select

from hive_distribute;

列式存储文件数据量比行式存储的数据量少【前提都采用压缩】

作业：
1.了解 hive 中文件存储格式 vs 压缩
2.整理hive目前所学知识点
3.https://blog.csdn.net/qq_37064763/article/details/120673550 50道题
不能看答案实在不行可以看

hadoop
hive

大数据：三件事
数据采集：采集日志数据 flume、logstush、采集业务数据：sqoop、datax、实时采集maxwell、flinkcdc
数据存储：Hadoop[hdfs]、hive、hbase【大数据】、数据分析之后的结果数据：mysql、clickhouse、drios
数据分析：mapreduce[不用]、hive、hbase、spark、flink
数据可视化：前端开发、自己做、用开源的框架：superset、dataease
echarts env、anv、
【很少用】收费：quickbi、sugar

消息中间键:kafka、pular
即席查询： 临时查询 
		sparksql、presto、druid、clickhouse、kylin【cube】

数据种类：
1.业务数据【MySQL、es】app
2.日志数据【log】 linux磁盘上的：工作中处理的重点
展现日志、点击日志、跳转日志。。。
3.其他数据

架构图：
1.业务数据【MySQL】 =》sqoop、datax =》 hdfs/hive
2.日志数据【log文件】=》flume =》 hdfs/hive

3.hive：构建离线数仓： 
	1.数据分层
	2.维度建模
	3.指标输出 
4.数据可视化
	hive =》 sqoop =》 mysql/clickhouse =》数据可视化 => 老板、产品、组内人员开

大数据基础平台【0-1 搭建大数据环境】

大数据数据平台【基于基础平台 =》升级】了解：
1.80-99.953 =》 sql

大数据平台开发组
大数据数据开发： 90%
	hive
	1.离线数仓
	2.实时数仓
	3.adhoc 临时查询
大数据etl工程师： 
	数据抽取、数据清洗、数据转换
大数据运维工程师： 
	1.部署、后续维护 
	2.云原生 docker、k8s 
大数据算法组：【数据分析师sql+数学统计知识、数据科学家】
	1.用户画像
	2.数据挖掘： 
		python 
		spark、flink自带的 机器学习组件

新颖：
数据湖=》
云原生=> docker,k8s
job => yarn 资源隔离 =》 k8s 、二次开发 yarn资源分配算法

走多远才算远

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【Hive_04】

[hadoop@bigdata32 exemple]$ cat user_shop.txtuser_id shopu1,au2,bu1,bu1,au3,cu4,bu1,au2,cu5,bu4,bu6,cu2,cu1,bu2,au2,au3,au5,au5,au5,apv =》页面浏览量 3个用户每个人访问了 10次页面 30uv =》访客次数 3个用户每个人访问了 10次页面 3create table user_shop(user_i
复制链接

扫一扫