hive基本操作

最新推荐文章于 2024-05-15 02:32:10 发布

方兵兵

最新推荐文章于 2024-05-15 02:32:10 发布

阅读量206

点赞数

分类专栏：大数据大数据开发入门

本文链接：https://blog.csdn.net/u010800708/article/details/86619182

版权

大数据同时被 2 个专栏收录

36 篇文章 0 订阅

订阅专栏

大数据开发入门

21 篇文章 0 订阅

订阅专栏

1）hive概述

Apache Hive数据仓库软件有助于使用SQL读取，编写和管理驻留在分布式存储中的大型数据集，
可以将结构投影到已存储的数据中，提供了命令行工具和JDBC驱动程序以将用户连接到Hive。

数据计算：mapreduce分布式计算->难度大
hive->SQL语句mysql简化开发减少学习成本

2）优缺点

优点：  
	（1）操作接口采用了sql，简化开发，减少学习成本  
	（2）避免手写mapreduce程序  
	（3）hive执行延迟较高，使用场景大多要求实时性要求不强的情景   

	（4）优点在于处理大数据  
	（5）支持自定义函数  

缺点:  
	（1）hive的sql表达能力有限（HSQL）  
	（2）hive的效率低

3）hive架构

4）配置hive
配置hive-env.xml
HADOOP_HOME：hadoop的安装路径

HADOOP_HOME=/root/hd/hadoop-2.8.5  
export HIVE_CONF_DIR=/root/hd/hive/conf

5）在hdfs文件系统中新增两个文件夹
这两个文件夹是hive的默认文件夹

hdfs dfs -mkdir /tmp  
hdfs dfs -mkdir -p /user/hive/warehouse

4）配置hive的数据库为mysql
安装完成mysql后，需要在hive的conf文件夹中创建hive-site.xml文件

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
     <value>root</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>root</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://bigdata121:3306/metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
</property>
</configuration>

之所以把hive的数据库换成mysql，是因为自带的deby数据库不支持多个终端同时访问数据库，而mysql支持

##hive数据类型

Java数据类型 Hive数据类型长度
byte TINYINT 1byte
short SMALINT 2byte
int INT 4byte
long BIGINT 8byte
float FLOAT 4byte
double DOUBLE 8byte
string STRING 2byte
TIMESTAMP 时间戳
BINARY 字节数组

DDL数据定义

1）创建数据库
如果不指定路径，数据库会创建在默认的文件夹下，/uer/hive/warehouse/

查看数据库  
show databases;  

创建数据库  	
create database hive_db;

创建护具看标准写法 
create database if not exists db_hive;

创建数据库到指定路径  
create database hive1_db location '/hive1_db'

2）修改数据库

查看数据库结构  
desc database hive_db;

添加描述信息 
alter database hive_db set dbproperties('data'='tony');

查看带有描述信息的数据库结构
desc database extended hive_db;

4）查询数据库

显示数据库  
show databases;

输出查询的数据库  
show databases like 'hive*;

删除数据库  
drop database hive_db;
标准写法  
drop database if exists hive_db;

5）创建表

创建表  
create table db_h(id int,name string)
row format
delimited fields
terminated by "\t";

6）管理表

不擅长做数据共享
删除hive中管理表，数据删除。

如果hive中的sql语句没有计算 就不需要用到mr程序。

加载数据
load data load inpath '/root/itstar.txt' into table emp;

查询并保存到新表
create table if not exists emp2 as select * from emp where name='hunter';

查询表的结构
desc formatted emp;
Table Type 		MANAGED_TABLE

7）外部表

hive不认为这张表拥有这份数据，删除该表，数据不删除。
擅长做数据共享  

创建外部表

> create external table if not exists emptable(empno int,ename string,job string,mgr int,birthdate string,sal double,comm double,deptno int)
> row format
> delimited fields
> terminated by "\t";

导入数据 
load data local inpath '/root/emp.txt' into table emp;

查看表结构
desc formatted emp;
Table Type:	EXTERNAL_TABLE

删除表
drop table emp;

提示：再次创建相同的表 字段相同 将自动关联数据。

8）分区表

分区表的创建

hive> create table dept_partitions(depno int,dept string,loc string)
> partitioned by(day string)
> row format delimited fields terminated by '\t';

加载数据
load data inpath '/root/depart.txt' into table dpt_partitions;
注意：正常导入数据的方式不行，会报错。需要指定分区

load data inpath '/root/depart.txt' into table dpt_partitions partition(day='1112');

添加分区
alter table dpt_partitions add partition(day='1112');

单分区查询
select * from dept_partitions where day='1112';
全查询
select * from dept_partitions;

查询表结构
desc formatted dept_partitions;

删除分区表
alter table dept_partitions drop partition(day='1112')
删除多分区
alter table dept_partitions drop partition(day='1112'),partition(day='1113')

修改表名
alter table emptable rename to empt;
添加列字段
alter table dept_partitions add columns(desc string);
更新某一列的字段类型
alter table dept_partitions change columns desc desccc int;
替换
alter table dept_partitions replace columns(desc int);

DML数据操作

1）向表中加载数据  
load data local inpath '/root/hunter.txt' into table hunter;
2）加载hdfs中数据
load data inpath '/hunter.txt' into table hunter;
提示：相当于剪切
3）覆盖原有数据
load data local inpath '/root/hunter' overwrite into table hunter;

4）创建分区表  
create table hunter_partitions(id int,name string) partitioned by (month string) row format delimited fields terminated by '\t';

5）向分区表插入数据
insert into table hunter_partitions partition(month='201811') values(1,'tongliya');

6）按照条件查询结果存储到新表
create table if not exists hunter_pp as select * from hunter_partitions;

create table if not exists hh(id int,name string) row format delimited fields terminated by '\t' location '/user/hive/warehouse/hunter.txt';

7）创建表时加载数据
create table db_h(id int,name string) row format delimited fields terminated by '\t' location '/user/hive/warehouse/hunter.txt';

8）查询结果导出到本地
insert overwrite local directory '/root/datas/yangmi.txt' select * from hh where name='yangmi';
注意：这里yangmi.txt是一个文件夹，结果在文件夹下。

bin/hive -e "select * from hh where name='yanmi'" > /root/yangmi.txt
注意：这里yangmi.txt就是文件，如果要加where条件需要外层""双引号，内层单引号。

dfs -get /user/hive/warehouse/hunter/hunter.txt > /root;
这就是hadoop中的hdfs dfs -get 的方式在hive中也兼容

查询

1）配置查询头信息
在hive-site.xml

  <property>
    <name>hive.cli.print.header</name>
    <value>true</value>
</property> 
<property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
</property>

显示效果
hive (default)>

show tables;显示tab_name标题
tab_name
db_h
dept_partitions

查询表信息显示列名
hive (default)> select * from hunter;

hunter.id   hunter.name
1       	hunter
2       	zhangsan
3       	delireba
4       	yanmi
5       	baby

2）基本查询
全查询
select * from empt;
查询指定列
select empt.empname,empt.empno from empt;
列别名
select ename name,empno from empt;

算数运算符

算数运算符	
+	        
-
*
/
%
&	按位取余
|	按位取或
^	异或
~	按位取反

函数

（1）统计数目
select count(*) from emptable;
（2）最大值
select max(emptable.sal) sal_max from emptable;
（3)最小值
select min(emt.sal) sal_min from emptable;
 (4)求和
select sum(emptable.sal) sum_sal from emptable;
（5）求平均值
select avg(emptable.sal) svg_sal from emptable;
（6）前两条
select * from emptable limit 2;

where语句
（1）工资大于1700的员工信息
select * from emptable where emptable.sal>1700;
（2）工资小于1800的员工信息
select * from emptable where emptable.sal<1800;
（3）工资在1500到1700之间的员工信息
select * from emptable where emptable.sal between 1500 and 1800;
（4）查询有奖金的员工
select * from emptable where emptable.comm is not null;
（5）查询无奖金的员工
select * from emptable where emptable.comm is null;
（6）查询工资是1700和1900的员工信息
select * from emptable where emptable.sal in(1700,1900);

Like
使用like运算选择类似的值
选择条件可以包含字母和数字
（1）查询员工工资第二位数字是6的员工信息
select * from emptable where emptable.sal like '_6%';
_：代表一个字符
%：代表0个或多个字符

（2）查询员工工资中包含7的员工信息
select * from emptable where emptable.sal like '%7%';

rlike
select * from emptable where emptable.sal rlike '[7]';

分组
（1）Group By语句
计算emptable表每个部门的平均工资
select avg(emptable.sal) sal_avg,emptable.deptno from emptable group by deptno;
（2）计算每个部门中最高工资
select max(emptable.sal) sal_max,emptable.deptno from emptable group by deptno;
（3）平均薪水大于1700的部门
select deptno,avg(sal) avg_sal from emptable group by deptno having avg_sal > 1700;
注意 having只用于group by分组统计函数

join操作

（1）等值join
根据员工表和部门表中部门编号相等，查询员工编号，员工名，部门名称
select e.empno,e.ename,d.deptname 
from emptable e join dept_partitions d on e.deptno=d.deptno;

（2）左外连接left join
左边表数据多，在右边没有找到，用null代替
select e.empno,e.ename,d.deptname from emptable e left join dept_partitions d e.deptno=d.deptno;

（3）右外连接right join
右边表数据多，在左边没有找到，用null代替
select e.empno,e.ename,d.deptname from emptable e right join dept_partitions d e.deptno=d.deptno;

（4）多表连接查询
查询员工名字、部门名称、员工地址
select e.ename,d.deptname,l.locname from emptable e join dept_partitions d on e.deptno=d.deptno join location l on d.loc=l.locno;

（5）笛卡尔积
为了避免笛卡尔积采用设置严格模式
set hive.mapred.mode;//查看hive的查询模式
set hive.mapred.mode=strict;//设置hive的模式为严格模式
注意：这种设置只是作用于当前的hive终端，退出hive在进入就不生效了。

永久设置严格模式，配置hive-site.xml文件
 <property>
    <name>hive.mapred.mode</name>
    <value>strict</value>
</property>

排序

（1）全局排序 order by
查询员工信息按照工资升序排列。
select * from emptable order by sal asc;默认
select * from emptable order by sal desc;降序

（2）查询员工号和员工工资按照两倍工资排序
select emptable.empno,sal*2 sal_double from emptable order by sal_double;

（3）分区排序
select * from emptable distribute by deptno sort by empno asc;

分桶
分区表分的是数据的存储路径
分桶针对数据文件   

清空数据表
truncate table emp_buck;

（1）创建分桶表
create table emp_buck(id int,name string)
clustered by(id) into 4 buckets
row format
delimited fields
terminated by '\t';
（2）设置属性
set hive.enforce.bucketing=true;
（3）导入数据
insert into table emp_buck select * from emp_b;
注意：分区分的是文件夹，分桶是分的文件

自定义函数
之前使用的hive自带的函数sum/avg/max/min...

三种自定义函数：
UDF:一进一出（User-Defined-Function）
UDAF:多进一出（count,max,min）
UDTF:一进多出

（1）导入hive依赖包
hive/lib包
编写自定义函数方法
public class Lower extends UDF{

	public String evaluate(final String s) {
		
		if(s == null) {
			return null;
		}
		
		return s.toString().toLowerCase();
	}
}
（2）将工程打成jar包上传到linux
alt+p

（3）添加到hive中
add jar /root/lower.jar;
（4）关联
create temporary function my_lower as "com.tony.hiveudf.Lower"
（5）使用
select ename,my_lower(ename) lowername from emptable;

Hive优化

压缩

（1）开启Map阶段输出压缩
开启输出压缩功能：
set hive.exec.compress.intermediate=true;

开启map输出压缩功能：
set mapreduce.map.output.compress=true;
设置压缩方式：
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

（2）开启reduce输出端压缩
开启最终输出压缩功能
set hive.exec.compress.output=true;

开启最终数据压缩功能
set mapreduce.output.fileoutputformat.compress=true;

设置压缩方式
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec

设置块压缩
set mapreduce.output.fileoutputformat.compress.type=BLOCK;

存储
Hive存储格式：
TextFile/SequenceFile/orc/Parquet

压缩比
orc > parquet > textFile

查询速度：
orc > textFile

Group by 优化

分组：mr程序，map阶段把相同key的数据分发给一个reduce，一个key的量很大。
解决方案：
在map端进行聚合（combiner）
set hive.map.aggr=true;

设置负载均衡
set hive.groupby.skewindata=true;

数据倾斜
（1）合理避免数据倾斜出现
	合理设置map数
	合并小文件
	set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

	合理设置reduce数

	
（2）解决数据倾斜
	
在map端进行预合（combiner）
set hive.map.aggr=true;

设置负载均衡
set hive.groupby.skewindata=true;

（3）JVM重用
	mapred-site.xml修改
	mapreduce.job.jvm.numtasks
	10-20

方兵兵

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive基本操作

1）hive概述Apache Hive数据仓库软件有助于使用SQL读取，编写和管理驻留在分布式存储中的大型数据集，可以将结构投影到已存储的数据中，提供了命令行工具和JDBC驱动程序以将用户连接到Hive。数据计算：mapreduce分布式计算-&gt;难度大hive-&gt;SQL语句mysql简化开发减少学习成本2）优缺点优点：（1）操作接口采用了sql，简化开发，减少学习...
复制链接

扫一扫

专栏目录