Hive的安装以及基本操作的学习

最新推荐文章于 2023-05-10 09:53:43 发布

BUG世界中的killer

最新推荐文章于 2023-05-10 09:53:43 发布

阅读量174

点赞数 1

分类专栏： hadoop从0开始文章标签： hive 大数据

本文链接：https://blog.csdn.net/qq_32695789/article/details/98469770

版权

hadoop从0开始专栏收录该内容

14 篇文章 0 订阅

订阅专栏

什么是Hive

 官网：http://hive.apache.org/
 Apache Hive™数据仓库软件有助于使用SQL读取，编写和管理驻留在分布式存储中的
 大型数据集。可以将结构投影到已存储的数据中。提供了命令行工具和JDBC驱动程序以
 将用户连接到Hive。
 hive提供了SQL查询功能 hdfs分布式存储。
 hive本质HQL转化为MapReduce程序。
环境前提：1）启动hdfs集群
                  2）启动yarn集群
如果想用hive的话，需要提前安装部署好hadoop集群。

为什么要学习hive

主要原因：简化开发

优势：

操作接口采用类sql语法，select * from stu;
简单、上手快！
hive可以替代 简单的mr程序，sqoop；
hive可以处理海量数据；
hive支持UDF，自定义函数
劣势：
由于是离线计算所以处理数据延迟高，慢。
引擎： 1.2.2以前版本都是用的mr引擎
2.x之后用的是spark引擎
HQL的表达能力有限
一些sql无法解决的场景，依然需要我们写mapreduce.

hive安装部署

准备工作

安装mysql（https://blog.csdn.net/qq_32695789/article/details/98470323）
hadoop环境（略因为本人还没整理一套zookeeper+hadoop的文档下次补）
安装包以及相关jar：（我用的是1.2.2版本）
- apache-hive-1.2.2-bin.tar.gz
- mysql-connector-java-5.1.39.jar

安装

下载、上传、解压（略过）
修改/conf目录下的配置

mv hive-env.sh.template hive-env.sh
vi hive-env.sh
//写自己的路径
HADOOP_HOME=/xx/xx/xx
export HIVE_CONF_DIR=/xx/hive/conf

启动

//可以自行配置环境变量
bin/hive（这一部可能出错）
[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
原因： hadoop目录下存在老版本jline
解决：
停止hadoop 
把旧的jline.0.9xx.jar删掉或者重命名为.bak
cp -r /hive/lib/jline-2.12.jar /XX/hadoop/share/hadoop/yarn/lib
启动hadoop和hive解决

配置mysql为元数据库

	1）驱动拷贝
	拷贝mysql-connector-java-5.1.39-bin.jar到/hive/lib/下
	
	2）配置Metastore到MySql
	-》在/hive/conf目录下创建一个hive-site.xml
	-》根据官方文档配置参数，拷贝数据到hive-site.xml文件中（hive/conf/下创建文件）
	
	<?xml version="1.0"?>
	<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
	<configuration>
		<property>
		  <name>javax.jdo.option.ConnectionURL</name>
		  <value>jdbc:mysql://hadoop1:3306/metastore?createDatabaseIfNotExist=true</value>
		  <description>JDBC connect string for a JDBC metastore</description>
		</property>

		<property>
		  <name>javax.jdo.option.ConnectionDriverName</name>
		  <value>com.mysql.jdbc.Driver</value>
		  <description>Driver class name for a JDBC metastore</description>
		</property>

		<property>
		  <name>javax.jdo.option.ConnectionUserName</name>
		  <value>root</value>
		  <description>username to use against metastore database</description>
		</property>

		<property>
		  <name>javax.jdo.option.ConnectionPassword</name>
		  <value>root</value>
		  <description>password to use against metastore database</description>
		</property>
		</configuration>
     3)注意：重启hadoop集群
     4）启动hive
     bin/hive
     此时mysql中创建metastore元数据库

hive数据类型

java数据类型	Hive数据类型	长度
byte	TINYINT	1byte有符号整数
short	SMALLINT	2byte有符号整数
int	INT	4byte有符号整数
long	BIGINT	8byte有符号整数
boolean	BOOLEAN	false/true
float	FLOAT	单精度浮点
double	DOUBLE	双精度浮点
string	STRING	字符
	BINARY	字节数组

hive的常用操作

数据导入操作

load data [local] inpath '/xx/xx.txt' into table xx;
load data:加载数据
local:可选操作，如果加上local导入是本地linux中的数据，如果去掉local 那么
导入的是hdfs中数据。
inpath:表示的是加载数据的路径
into table:表示要加载的对应的表

DDL数据定义

1)查看数据库
show databases;
2)创建库
create database hive_db;
3)创建库 标准写法
create database if not exists hive_db;
4)创建库指定hdfs路径
create database hive_db location '/hive_db';
5)创建表
如果指定了hdfs路径
创建的表存在于这个路径
6）查看数据库结构
desc database hive_db;
7)添加额外的描述信息
alter database hive_db set dbproperties('created'='hunter');
注意：查询需要使用desc database extended hive_db;
8)查看指定的通配库:过滤
show databases like 'i*';
9）删除空库
drop database hive_db;
10）删除非空库
drop database hive_db2 cascade;
11) 删除非空库标准写法
drop database if exists hive_db cascade;

创建表

create [external] table [if not exists] table_name(字段信息) [part
itioned by(字段信息)]
[clustered by(字段信息)] [sorted by(字段信息)]row format delimited
fields terminated by '切割符';

管理表

默认不加external创建的就是管理表，也称为内部表。
MANAGED_TABLE管理表。
Table Type:MANAGED_TABLE
查看表类型：
desc formatted test;

外部表

EXTERNAL_TABLE外部表
创建方式：
create external table student(id int,name string)
区别：如果是管理表删除hdfs中数据删除，如果是外部表删除hdfs数据不删除！

分区表

1）创建分区表
create table dept_partitions()
partition by()
row format
delimited fields
terminated by '';
2)查询
全查询：
select * from dept_partitions;
注意：此时查看的是整个分区表中的数据
单分区查询：
select * from dept_partitions where day = '0804';
注意：此时查看的是指定分区中的数据
联合查询：
select * from dept_partitions where day = '0804' union select *
from dept_partitions where day = '0803';
添加单个分区：
alter table dept_partitions add partition(day = '0805');
注意：如果想一次添加多个的话 空格分割即可
查看分区：
show partitions dept_partitions;
删除分区：
alter table dept_partitions drop partition(day='0805');
分区表在hdfs中分目录文件夹
导入数据
修复：msck repair table dept_partitions;

DML数据操作

1）数据的导入
load data [local] inpath '' into table ;
2）向表中插入数据
insert into table student_partitions partition(age = 18)
values(1,'reba');
向表中插入sql查询结果数据
insert overwrite table student_partitions partition(age = 18) se
lect * from itstar where id<3;
create方式：
create table if not exists student_partitions1 as select * from
student_partitions where id = 2;
3）创建表直接加载数据
create table student_partitions3(id int,name string)
row fromat
delimited fields
terminated by '\t'
locatition '';
注意：locatition路径是hdfs路径。
4）把操作结果导出到本地linux
insert overwrite local directory '/root/datas' select * from its
tar;
5)把hive中表数据导出到hdfs中
export table itstar to '/test';
把hdfs数据导入到hive中
import table itstar2 from '/test/';
6）清空表数据
truncate table test2;

常用查询同Mysql (略)

分桶（抽样查询）

创建分桶表

create table stu_buk(id int,  name string)
clustered by(id)        #根据什么分桶-----
into 4 buckets   #分几个桶
row format delimited fields terminated by '\t';

插入数据

（1）先建一个普通的stu表
create table stu(id int, name string)
row format delimited fields terminated by '\t';

（2）向普通的stu表中导入数据
load data local inpath '/opt/module/datas/student.txt' into table stu;

（3）导入数据到分桶表，通过子查询的方式
insert into table stu_buck
select id, name from stu;

注意

注意：做这些的前提是，reduce必须是只有一个
所以必要时要设置：
hive (default)> set hive.enforce.bucketing=true;
hive (default)> set mapreduce.job.reduces=-1;
hive (default)> insert into table stu_buck
select id, name from stu;

查询

hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);

说明

tablesample是抽样语句，语法：TABLESAMPLE(BUCKET x OUT OF y) 。
y必须是table总bucket数的倍数或者因子。hive根据y的大小，决定抽样的比例。例如，table总共分了4份，当y=2时，抽取(4/2=)2个bucket的数据，当y=8时，抽取(4/8=)1/2个bucket的数据。
x表示从哪个bucket开始抽取，如果需要取多个分区，以后的分区号为当前分区号加上y。例如，table总bucket数为4，tablesample(bucket 1 out of 2)，表示总共抽取（4/2=）2个bucket的数据，抽取第1(x)个和第3(x+y)个bucket的数据。
注意：x的值必须小于等于y的值，否则
会经常报：FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck，这是原因是x的值大于了y的值

UDF自定义函数

UDF:一进一出
UDAF:聚合函数 多进一出 count /max/avg
UDTF:一进多出
java
添加临时：
add jar /xx/Myconcat.jar;
create temporary function my_cat as "com.itstaredu.com.MyConca
t";
注册永久：hive-site.xml
<property>
<name>hive.aux.jars.path</name>
<value>file:///xxx/xx/hive/lib/hive.jar</value>
public class MyConcat extends UDF{
//大写转换为小写
public String evaluate(String a,String b) {
return a + "*****" + String.valueOf(b);
}

使用也和 mysql中的count(),sum()等函数使用方式相同

hive压缩

存储：hdfs
计算：mapreduce
Map输出阶段压缩方式：
开启hive中间传输数据压缩功能：
set hive.exec.compress.intermediate=true;
开启map输出压缩：
set mapreduce.map.output.compress=true;
设置snappy压缩方式：
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.com
press.SnappyCodec;
Reduce输出阶段压缩方式：
设置hive输出数据压缩功能
set hive.exec.compress.output=true;
设置mr输出数据压缩
set mapreduce.output.fileoutputformat.compress=true;
指定压缩编码：
set mapreduce.output.fileoutputformat.compress.codec=org.apache.
hadoop.io.compress.SnappyCodec;
指定压缩类型块压缩
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
测试（略）

hive其他操作

1）不登录hive客户端直接输入命令操作Hive
bin/hive -e "select * from xxx;"
2）直接把sql写入到文件中
bin/hive -f /xx/hived.sql
3）查看hdfs文件
dfs -ls /;
dfs -cat /wc/in/words.txt;
4）查看历史操作
cat ~/.hivehistory

BUG世界中的killer

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive的安装以及基本操作的学习

目录什么是Hive为什么要学习hivehive安装部署hive数据类型hive的常用操作数据导入操作DDL数据定义创建表管理表外部表分区表DML数据操作常用查询同Mysql (略)分桶（抽样查询）UDF自定义函数hive压缩hive其他操作什么是Hive 官网：http://hive.apache.org/ Apache Hive™数据仓库软件有助于使用SQL读取，编写和管理驻留在分布式存储...
复制链接

扫一扫

专栏目录