Apache Hive——DML

最新推荐文章于 2024-09-14 22:07:12 发布

王二小、

最新推荐文章于 2024-09-14 22:07:12 发布

阅读量298

点赞数

分类专栏：大数据文章标签：大数据

本文链接：https://blog.csdn.net/weixin_43056275/article/details/99439723

版权

大数据专栏收录该内容

8 篇文章 0 订阅

订阅专栏

课堂学习笔记

Apache Hive——DDL&DML

Apache Hive——DDL&DML

重点学习sql

Apache Hive–DDL创建表–内部表、外部表

内部表：内部映射的数据必须在默认路径下对应的表文件夹下，被hive管理的表。
建表：
create table student(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',';
加载数据：
load data local inpath '/root/hivedata/students.txt' overwrite into table student;
外部表：数据的路径可以在hdfs的任何位置，建表通过location指定数据的位置。
内部表的数据加载会从hdfs任意路径移动到表映射的那个文件夹的下。
建表：
create external table student_ext(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',' location '/stu';
加载数据：
load data inpath '/stu' into table student_ext;
总结：
1、内部表在库的路径下存在文件夹文件夹的名字就是表的名字
外部表看不到
2、如果数据在hdfs上内部表的数据是移动
外部表的数据原封不动改名标识映射成功
3、内部表在删除表的时候 drop table 除了删除表的元数据还把结构化文件给删除
外部表只删除表的元数据信息(映射信息) ，而结构化文件原封不动

Apache Hive–了解DDL修改表语法

STORED AS 默认情况下 hive支持的是文本文件
因为hive不是通过insert去加载数据是通过映射跟已经存在的结构化文件产生关系
关系定义好了表中的数据就有了
如果开发中发现表的定义信息跟实际需求不符合：
1、通过hive支持的语法alter修改
2、直接重新定义一个新的表在新表的定义中修改成为符合实际情况元数据信息
ALTER TABLE t_user2 DROP IF EXISTS PARTITION (country=‘USA’); 删除分区
增加列
t_1(id int,name string) partitoned by(country String)
字段：id name country
如果用add (age String)
增加之后的字段：id name age country
从侧面印证分区字段不是真实的字段里面没有结构化的数据只是分区的标识
修改列：
ALTER TABLE test_change CHANGE a a1 STRING AFTER b;

Apache Hive–查看表的信息&探究解密元数据

查看分区的信息前提：表必须是分区表
关于hive的元数据信息：
1、DBS 记录了数据库跟hdfs上的路劲映射关系
2、TBLS 记录了表跟数据库的映射关系也就是表属于哪个库
3、COLUMNS_V2 记录了表的字段跟表的映射关系字段不仅记录了类型还记录了顺序
以上这3个就是所谓的映射信息也就是表的元数据信息
hive中你定义的表的字段个数类型要自己保证跟结构化数据一致如果不一致 hive会做尝试转换，但是不保证转换成功如果不成功就显示null 能的话就转化成功

Apache Hive–DML–load加载数据–重点理解local是哪里

在这里插入图片描述

数据来自于本地：自己问自己本地在哪？
create table student_5(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',';
加载本地数据：
load data local inpath '/root/hivedata/students.txt' into table student_5;
显示日志信息
Loading data to table itcast.student_5 from file:/root/hivedata/students.txt
小结：数据从本地加载是纯复制操作数据从file:/root/hivedata/students.txt 复制 /user/hive/warehouse/itcast.db/student_5
在这种情况下 load data 就相当于执行了hadoop fs -put file:/root/hivedata/students.txthdfs://node-1:9000/user/hive/warehouse/itcast.db/student_5
加载hdfs数据
create table student_6(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',';
load data inpath '/stu' into table student_6;
显示日志信息
Loading data to table itcast.student_6 from hdfs://node-1:9000/stu
小结：数据从hdfs加载是纯移动操作数据从 hdfs://node-1:9000/stu 移动 hdfs://node-1:9000/user/hive/warehouse/itcast.db/student_6
在这种情况下 load data 相当于
hadoop fs -mv hdfs://node-1:9000/stu hdfs://node-1:9000/user/hive/warehouse/itcast.db/student_6
总结：inpath如果是文件就操作这个文件如果文件夹目录就操作这个下面的所有

Apache Hive–DML–insert–使用规范

mysql： insert into table values(1,‘asd’)
hive: insert + select 插入的数据来自于后面的查询语句
insert into table a select id from t_source
总结：1、当前版本 hive还支持insert+valus的形式插入数据执行的效果非常慢实际中几乎不这么玩
2、还是把已经存在的结构化数据映射上进而通过sql来分析

Apache Hive–DML–insert–多重插入

Insert查询语句
多重插入：
create table source_table (id int, name string) row format delimited fields terminated by ‘,’;
create table test_insert1 (id int) row format delimited fields terminated by ‘,’;
create table test_insert2 (name string) row format delimited fields terminated by ‘,’;
from source_table
insert overwrite table test_insert1
select id
insert overwrite table test_insert2
select name;

Apache Hive–DML–insert–动态分区插入

动态分区插入
set hive.exec.dynamic.partition=true; #是否开启动态分区功能，默认false关闭。
set hive.exec.dynamic.partition.mode=construct; #动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。
需求：
将dynamic_partition_table中的数据按照时间(day)，插入到目标表d_p_t的相应分区中。
原始表：
create table dynamic_partition_table(day string,ip string)row format delimited fields terminated by ",";
load data local inpath '/root/hivedata/dynamic_partition_table.txt' into table dynamic_partition_table;
2015-05-10,ip1
2015-05-10,ip2
2015-06-14,ip3
2015-06-14,ip4
2015-06-15,ip1
2015-06-15,ip2
目标表：
create table d_p_t(ip string) partitioned by (month string,day string);
动态插入：
insert overwrite table d_p_t partition (month,day)
select ip,substr(day,1,7) as month,day from dynamic_partition_table;
动态分区好处：保存了分区字段的值，不在sql中写死了
而是在数据插入的时候由具体的查询返回结果确定查询的结果分区字段值就是什么
避免了静态分区每次用户自己去查询数据确定分区值

Apache Hive–DML–select

基本的 Select 操作
语法结构
SELECT [ALL | DISTINCT] select_expr, select_expr, … FROM table_reference
JOIN table_other ON expr
[WHERE where_condition]
[GROUP BY col_list [HAVING condition]]
[CLUSTER BY col_list
| [ DISTRIBUTE BY col_list ] [SORT BY | ORDER BY col_list] ]
[LIMIT number]
cluster by 分桶查询指定哪个字段分桶分为几桶
默认情况下环境中的桶个数无定义 mapreduce.job.reduces=-1
Number of reduce tasks not specified. Estimated from input data size: 1
mapreduce.job.reduces=2
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
mapreduce.job.reduces=3
Number of reduce tasks not specified. Defaulting to jobconf value of: 3
分桶查询按照哪个分由cluster by 指定
分为几个部分如果没指定默认根据数据大小自己评估一如果指定就按照mapreduce.job.reduces值来决定分成几个部分

create table stu_buck2(Sno int,Sname string,Sex string,Sage int,Sdept string)
clustered by(Sno) into 4 buckets
row format delimited
fields terminated by ',';

insert overwrite table stu_buck2
select * from student cluster by(Sno);

总结：cluster by （xxx）按照这个字段分开且按照这个字段正序排序
例子：需求：按照学生标号分开桶内再按照年龄排序倒序
select * from student cluster by(sno) sort by(sage desc);报错
cannot have both cluster by and sort by clauses
select * from student distribute by(sno) sort by(sage desc);
distribute by（分） +sort by （排）
把查询结果导出的文件系统：local自己想本地是哪个本地？小心overwrite
insert overwrite local directory '/root/aaa' select * from student distribute by(sno) sort by(sage desc);
如果分的字段等于排序的字段
cluster by = distribute by+sort by
order by 全局排序只能有一个reducetask
Number of reduce tasks determined at compile time: 1
为了保证全局排序这时候hive在编译期间忽略环境中设置的reducetask 直接就是1

Apache Hive–hive智能切换本地模式

set hive.exec.mode.local.auto=true; 智能本地模式
如果开启，并不意味着都跑本地模式 hive自己会根据条件判断是本地还是集群模式
集群模式：sql编译成mr 提交yarn执行
本地模式: sql编译成mr 本地线程模拟跑mr Job running in-process (local Hadoop)
条件切换：输入的数据量 reducetask个数
数据量大有多个reducetask —> 倾向于集群模式
数据量小一个reducetask —> 倾向于本地模式

Apache Hive–join语法

关于hive中的各种join
准备数据
1,a
2,b
3,c
4,d
7,y
8,u

2,bb
3,cc
7,yy
9,pp

建表：

create table a(id int,name string)
row format delimited fields terminated by ',';

create table b(id int,name string)
row format delimited fields terminated by ',';

导入数据：

load data local inpath '/root/hivedata/a.txt' into table a;

load data local inpath '/root/hivedata/b.txt' into table b;

实验：
inner join

select * from a inner join b on a.id=b.id;

在这里插入图片描述

select a.id,a.name from a join b on a.id = b.id;

select a.* from a join b on a.id = b.id;

left join
select * from a left join b on a.id=b.id;
在这里插入图片描述
right join
select * from a right join b on a.id=b.id;

select * from b right join a on b.id=a.id;

full outer
select * from a full outer join b on a.id=b.id;
在这里插入图片描述
hive中的特别join
select * from a left semi join b on a.id = b.id;

相当于
select a.id,a.name from a where a.id in (select b.id from b); 在hive中效率极低
select a.id,a.name from a join b on (a.id = b.id);
select * from a inner join b on a.id=b.id;

cross join（##慎用）
返回两个表的笛卡尔积结果，不需要指定关联键。
select a.,b. from a cross join b;

启动一系列命令

命令分别
1：在except/script中，sh zk.sh（保证zk启动）
2：还在这个目录，sh journalnode.sh （journalnode启动，咱的是高可用哈，一般线上都是高可用）
3：不管在哪个目录，start-all.sh（启动hadoop，以设置全局）

4：不管在哪，直接输入hive，回车即可（启动hive，以设计全局）
5: node1: /export/servers/hive-1.2.1/bin/hiveserver2
node3: /export/servers/hive-1.2.1/bin/beeline
! connect jdbc:hive2://node1:10000
6：切换nn状态：hdfs haadmin -transitionToActive --forcemanual nn1
hadoop-daemon.sh stop namenode
hadoop-daemon.sh start namenode

Apache Hive–内置元算法函数&dual测试

dual来自于oracle 是一个虚表没有真实意义没有数据
oracle默认自带该表就是用来各种测试。
create temporary function itcastfunc as ‘cn.itcast.hive.udf.ItcastUDF’;
自定义函数创建好之后需要创建临时函数名与之关联
谁创建谁调用谁断开函数失效

网站流量日志数据分析系统–点击流模型&及其梳理

点击流(Click Stream)是指用户在网站上持续访问的轨迹
web访问日志（网站流量流 click stream
web日志是站在网站的角度日志）：原始记录网站访问信息是散点状的信息
点击流模型是数据来自于对web访问日志的梳理把每个用户的数据按照时间先后顺序串起来就由点连成了线
一次次的点击行为就构成了点击
点击流是站在用户的角度
注意：用户一天可能来网站多次（）session
如何区分每次session之间的区别由业务来定义：以前后两条访问记录时间差是否在30分钟以内

网站流量日志数据分析系统–常见分析模型

网站流量质量分析(流量分析)
网站流量多维度细分(流量分析)
网站内容及导航分析(内容分析)

网站流量日志数据分析系统–常见分析指标

骨灰级指标
IP:1 天之内，访问网站的不重复 IP 数。一天内相同IP地址多次访问网站只被计算1次。曾经 IP 指标可以用来表示用户访问身份，目前则更多的用来获取访问者的地理位置信息。
PageView 浏览量: 即通常说的PV值，用户每打开 1 个网站页面，记录1个 PV。用户多次打开同一页面PV累计多次。通俗解释就是页面被加载的总次数。
Unique PageView: 1天之内，访问网站的不重复用户数(以浏览器cookie为依据)，一天内同一访客多次访问网站只被计算1次。
总结：网站分析—>结合业务指标开展分析（正确解读业务指标很重要）
通过hive得出指标不难，多考虑数据如何展示的更加直观