Hive核心

最新推荐文章于 2023-10-04 11:55:01 发布

XiuL

最新推荐文章于 2023-10-04 11:55:01 发布

阅读量146

点赞数

分类专栏：大数据组件命令介绍

本文链接：https://blog.csdn.net/qq_35310348/article/details/115130801

版权

大数据组件命令介绍专栏收录该内容

12 篇文章 0 订阅

订阅专栏

Hive的客户端和服务端

客户端

bin/hive

服务端

元数据服务：nohup hive --service metastore 2>&1 &
jdbc服务：nohup bin/hive --service hiveserver2 2>&1 &

通过命令执行

  使用-e参数执行hql语句：
	bin/hive -e "use test;select * from student;"
  使用-f参数通过指定文本文件执行hql语句：
	bin/hive -f hive.sql
	hive.sql: usr test;select * from student;

Hive数据库/表描述命令

查看所有数据库

show databases like '*hive*';

创建一个数据库

create databse lwg;(未指定存储路径，数据存储在默认的配置hive.metastore.warehouse.dir中)
create databse lwg01 location '/lwg03';自己指定hdfs路径

查看数据库详细信息

desc database [extended] lwg;  主要是看路径location

删库

drop database lwg; 如果数据库下有表，会报错
强制删库： drop database lwg cascade;

查看表的类型

desc formatted tablename; 可以查看表是内部表还是外部表

查看分区

show partitions score;

清空表数据

truncate table student; 不能清空外部表

Hive源数据库相关表介绍

DBS

主要记录hive库的信息，例如：库再hdfs的路径，库名，相关权限和角色。

TBLS

主要记录所有表的信息，例如：表名，所属库名，用户信息，内部表还是外部表。

TABLE_PARAMS

主要记录某个表，对应的文件数numFiles，对应的行记录数numRows。
问题： select count(*) from table1的几种结果：
			A）count(*)直接查询的就是TABLE_PARAMS中的行记录数numRows,不走MR.
			B）直接通过hdfs dfs -put把文件加载到对应的hive的location目录，这样查询的话：numFiles和numRows两个值都是0.
			C）通过insert into table values加载数据，numFiles和numRows两个值都会改变，查询的值是真实值。
			D）通过load data hdfs路径把文件加载到对应的表，这样的话：numFiles会修改，但是numRows不会读取文件内部具体信息。

PARTITIONS

主要是记录表和分区的信息。

Hive建表语法

关键字介绍

   external：表示是一个外部表。注意：内部表数据会移入真实数仓路径，删除表会删除元数据和数据。外部表指向了hdfs一个路径，删除表只删元数据，不删数据。例如：日志数据定期上传到hdfs，基于外部表做分析，用到的中间表，结果表使用功能内部存储。数据通过select + insert 进入内部表。
   like：复制表结构，不复制数据
   As：通过查询其他表(部分字段)构建表
   row format delimited fields terminated by :每行数据字段间隔符号，包括基本类型，map类型，集合类型
   stored as textfile|rcfile|sequencefile|...：如果数据文件是纯文本，用textfile，如果需要压缩，用sequencefiel
   partition by:分区
   clustered by:分桶
   location：指定表的存储位置
	 1）什么时候用内部表，什么时候用外部表：一般都使用内部表，数据可以共用，更加安全。只有中间表才使用内部表。
	 2）内外表的修改：alter table student2 set tblproperties('EXTERNAL'='TRUE'); TRUE:外表。

创建普通表

create table stu(id int,name string);

创建表指定分隔符号

create table if not exists stu2(id int,name string) row format delimited fields terminated by '\t' stored as textfile location 'user/stu2';

复制表结构和数据

create table stu3 as select * from stu2;

复制表结构，不要数据

create table stu4 like stu2;

创建分区表

一个分区：
create table sc(s_id string,c_id string, s_score int) partitioned by (month string) row format delimited fields terminated by '\t';
多级分区：create table sc2(s_id string,c_id string, s_score int) partitioned by (year string,month string,day string) row format delimited fields terminated by '\t';
对应在hdfs中是  year目录-->里面有month目录-->里面有day目录

修改表的操作

	1）修改表名：ALTER TABLE table_name RENAME TO new_table_name
	2）添加多个分区: alter table score add partition(month='xx') partition(month='xx1');
	3）删除分区：alter table score drop partition(month='xx');
	4）添加列：alter table dept add columns(deptdesc string);
	5）更新列名称：alter table dept change column deptdesc desc string;类型不能变
	6）全部更新列：alter table dept replace columns(deptno string, dname string, loc string); 字段可多可少。
	7）删除表： drop table o;

Hive加载数据语法

关键字介绍

    1）load data:加载数据
	2）local：本地数据（加载后本地文件还存在），不加表示是hdfs数据（加载后hdfs原路径不会存在，类似于剪切。只把元数据信息修改了，真实数据的存储位置不会修改。）
	3）inpath: 加载数据的路径
	4）overwrite: 覆盖表中已有数据，否则追加
	5）into table：加载到哪个表
	6）partition：指定数据分区

insert语句

  insert into stu values(1,'zs');底层执行的是mapreduce任务
  添加数据到某个分区：
	 insert into table score partition(month='211') values('11','11','11');
  读取某张表的数据，插入到另张表的某个分区：
	 insert overwrite(into) table score partition(month='22') select id,uid,sc from score3; 
	 overwrite:只会覆盖对应分区的数据
	 into:追加方式
  读取某张表的数据，插入到其他两个表的某个分区：
	 from score insert overwrite table score_first partition(month='11') select id,uid,sc insert overwrite table score_second partition(month='22') select id,uid,sc;

加载本地服务器文件数据到表

 load data local inpath '/服务器路径/a.csv' overwrite into table teacher;

从hdfs加载数据到表

 load data inpath '/hdfs/stu.csv' into table stu;

根据查询结果创建表和加载数据

create table if not exists student3 as select id, name from student;

建表的时候通过location指定一个已经存在的hdfs路径的文件，就可以直接加载数据

create external table if not exists student5( id int, name string) row format delimited fields terminated by '\t' location '/student;

Hive导出数据语法

将查询结果导出到本地：目录可以不存在

insert overwrite local directory '/xx/data' select * from score;

将查询结果格式化导出到本地

insert overwrite local directory '/xx/data' row format delimited fields terminated by '\t' collection items terminated by '#' select * from student;

将查询的结果导出到hdfs上

 insert overwrite directory '/hdfs/data' row format delimited fields terminated by '\t' collection iterm terminated by '#' select * from stu;

hdfs命令导出到本地

 hdfs dfs -get /hdfs/xx/data/000_0 /local/data/stu.txt

shell命令导出

bin/hive -e 'select * from default.student;' >/data/export/student4.txt;

导出到hdfs上

  (注意：export导出的会自动生成一个_meta的文件夹)
  export table score to '/hdfs/data/score';
  import table xxx  from '/student1';
	1）import导入的路径，必须是export导出的路径(因为这里面有_meta的元数据文件夹)
	2）import导入的表，要么不存在，如果存在，数据必须为空。
	3）import和export主要用户两套hadoop集群间，hive的数据迁移。

Hive查询语法

官网： https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select

基本查询

逻辑运算

and  or  not

是否空值

is null / is not null

模糊查询

like和rlike   _代表一个字符，%代表零个或者多个字符. rlike可以接正则表达式

聚合函数

count()  max()  min()   sum()  avg()  limit(limit只能接一个参数，返回多少行数据)

范围查询

between 80 and 100（都包含）     where score in(80,90)

分组

group by   分组后的条件：having
select deptno,avg(sal) avg_sal from emp group by deptno having avg_sal>2000.
	注意：为什么不能用where,因为sql的执行顺序是，from=>where=>select.别名都没有生成，就用了这个别名所以报错。只能用嵌套where。

Join查询

内连接

两个表必须有全部相同的一个值，才会被连接上。
	select e.empno, e.ename, d.deptno, d.dname from emp e join dept d on e.deptno = d.deptno;

左外连接

JOIN 操作符左边表中符合 WHERE 子句的所有记录将会被返回,如果右表没有，使用NULL。
	select e.empno, e.ename, d.deptno, d.dname from emp e left join dept d on e.deptno = d.deptno;

右外连接

JOIN 操作符右边表中符合 WHERE 子句的所有记录将会被返回
	select e.empno, e.ename, d.deptno from emp e right join dept d on e.deptno = d.deptno;

满外连接

将会返回所有表中符合 WHERE 语句条件的所有记录。如果任一表的指定字段没有符合条件的值的话，那么就使用 NULL 值替代。
	select e.empno, e.ename, d.deptno from emp e full join dept d on e.deptno = d.deptno;

多表连接

	SELECT e.ename, d.dname, l.loc_name FROM	emp e JOIN	dept d ON	d.deptno = e.deptno JOIN		location l ON	d.loc = l.loc;

笛卡尔积的产生条件

	1）省略连接条件
	2）连接条件无效
	3）所有表中的所有行互相连接
	select empno, dname from emp, dept;

排序

order by

全局排序，只有一个 Reducer，不适合数据量大
	select * from emp order by sal desc;

sort by

每个reducer都排序。mapred.reduce.tasks>1
	select * from emp sort by deptno desc;
	查看文件内容：insert overwrite local directory '/opt/module/data/sortby-result' select * from emp sort by deptno desc;
	数据是随机分散的，有一个seed种子。为了数据分散，避免数据倾斜。

distribute by

将数据分散到不同reducer，hash算法
	用法：不同的部门，划分到不同的分区中，可以使用自定义分区。一个分区保证数据在一个reducer中执行。然后对这个分区内部的数据使用sort by。
	insert overwrite local directory	'/opt/module/data/distribute-result' select * from emp distribute by deptno sort by empno desc;

cluster by

将数据分散到不同reducer，还会对该字段排序
	当 distribute by 和 sorts by 字段相同时，可以使用 cluster by 方式。但是排序只能是升序排序，不能指定排序规则为 ASC 或者 DESC。
	select * from emp cluster by deptno;
	等价于：select * from emp distribute by deptno sort by deptno;

Hive的分区表

本质

本质：分区表实际上就是对应一个 HDFS 文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive 中的分区就是分目录。查询时通过 WHERE 子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。
避免了全表扫描，生成环境使用：按天生成分区。

一个分区的基本操作

   1）创建分区：
		create table dept_partition(deptno int, dname string, loc string) partitioned by (day string)
        row format delimited fields terminated by ' ';
	2）加载数据：分区表加载数据必须要指定分区。
		load data local inpath '/opt/module/datas/dept1.txt' into table dept_partition partition(day='20200401');
		load data local inpath '/opt/module/datas/dept2.txt' into table dept_partition partition(day='20200402');
		load data local inpath '/opt/module/datas/dept3.txt' into table dept_partition partition(day='20200403');
	3）查询分区表数据：
		单分区：select * from dept_partition where day='20200401';
		多分区：select * from dept_partition where day='20200401' union select * from dept_partition where day='20200402' union select * from dept_partition where day='20200403';
		等价于：select * from dept_partition where day='20200401' or day='20200402' or day='20200403';
	4）新增一个或多个分区：alter table dept_partition add partition(day='20200405') partition(day='20200406');
	5）删除一个或多个分区：alter table dept_partition drop partition (day='20200404'), partition(day='20200405');
	6）查看表的分区：show partitions dept_partition;
	7）查看分区表结构：desc formatted dept_partition;

二级分区

如果一天的日志数据量也很大，需要对每个小时也进行分区。
	1）创建二级分区：create table dept_partition2( deptno int, dname string, loc string) partitioned by (day string, hour string) row format delimited fields terminated by ' ';
	注意：在hdfs上的目录结构是，先存在day的分区目录，里面再有hour的分区目录。
	2）正常加载数据：
		A）load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition2 partition(day='20200401', hour='12');
		B）
	3）查询：select * from dept_partition2 where day='20200401' and hour='12';

分区数据加载-把数据直接上传到分区目录，分区表和数据产生关联的方法

    1）上传数据后修复
		A）先去hdfs上创建一个对应分区的目录：hdfs dfs -mkdir /user/hive/warehouse/dept_partition2/day=20200401/hour=13
		B）把本地的数据或者hdfs的数据上传到该目录：hdfs dfs -put dept1.txt /user/hive/warehouse/dept_partition2/day=20200401/hour=13
		C）直接查询数据，无法查到：select * from dept_partition2 where day='20200401' and hour='13';
		D）修复数据：msck repair table dept_partition2;（原因是没有创建这个分区）
		E）检测hdfs目录，把分区信息写入到mysql
	2）上传数据后添加分区
		A）先去hdfs上创建一个对应分区的目录：hdfs dfs -mkdir /user/hive/warehouse/dept_partition2/day=20200401/hour=14
		B）把本地的数据或者hdfs的数据上传到该目录：hdfs dfs -put dept1.txt /user/hive/warehouse/dept_partition2/day=20200401/hour=14
		C）添加分区：alter table dept_partition2 add partition(day='20200401',hour='14');
		D）查询数据：select * from dept_partition2 where day='20200401' and hour='14';
	3）创建文件夹后load数据到分区	
		A）先去hdfs上创建一个对应分区的目录：hdfs dfs -mkdir /user/hive/warehouse/dept_partition2/day=20200401/hour=15
		B）load data local inpath '/opt/module/datas/dept1.txt' into table dept_partition2 partition(day='20200401',hour='15');
		C）load会去修改元数据信息。
		D）如果load的时候不指定分区。因为读取的数据是本地的，而load会触发MR任务，会在某一个节点上启动任务，如果刚好数据和启动任务节点在一个服务器上，会执行成功。否则会失败。并且默认会在hdfs上生称一个HIVE_DEFAULT_PARTITION的分区。

动态分区

1）开启动态分区功能：默认开启
		set hive.exec.dynamic.partition=true
	2）设置为非严格模式：
		set hive.exec.dynamic.partition.mode=nonstrict
	3）创建目标分区表：create table dept_partition_dy(id int, name string) partitioned by (loc int) row format 	delimited fields terminated by '\t';
	4）加载数据，必须用select colum1,clomu2...的方式
		insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept;
		其中：deptno, dname对应了目标表中的字段。 loc对应了动态分区的字段。
	5）Hive3.x新特性。
		A）加载数据的时候，可以不指定分区字段的名称。默认使用最后一个字段。
			insert into table dept_partition_dy partition select deptno, dname, loc from dept;
		B）如果使用上述的加载数据方式，可以不用设置为非严格模式。

Hive的函数

查看所有函数

show functions;

查看函数的用法

desc function [extended] upper;

函数的类型

函数的类型： （输入数据的行数）
	UDF: 一进一出
	UDTF: 多进一出
	UDAF: 一进多出

内置函数

	1）nvl(字段名,'默认值')  如果字段值不为空，返回字段值。如果为空，返回默认值。
	2）case when then else end: 分组后，对分组内部按照其他条件选择的场景。
		select dept_id,sum(case sex when '男' then 1 else 0 end) male_count, sum(case sex when '女' then 1 else 0 end) female_count from emp_sex group by dept_id;

XiuL

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive核心

Hive的客户端和服务端客户端bin/hive 服务端元数据服务：nohup hive --service metastore 2>&1 &jdbc服务：nohup bin/hive --service hiveserver2 2>&1 &通过命令执行使用-e参数执行hql语句： bin/hive -e "use test;select * from student;" 使用-f参数通过指定文本文件执行hql语句： bin
复制链接

扫一扫