Hive深入详解

最新推荐文章于 2024-07-02 03:45:00 发布

小江_xiaojiang

最新推荐文章于 2024-07-02 03:45:00 发布

阅读量2.3k

点赞数

分类专栏： Hive

本文链接：https://blog.csdn.net/jiangsanfeng1111/article/details/52841126

版权

Hive 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

Hive表的创建

方法一：同sql

create table if not exists deault.bf_log_20150913(
ip string comment 'remot ip address',    ---字段注释
user string ,
req_url string comment 'user request url')
comment 'Beifeng Web Access Logs'   --数据库注释
row format delimited fields terminated by ' '
stored as textfile          --默认为文本格式textfile
location '/user/beifeng/hive/warehouse/bf_log_20150913'; （内部表一般不指定）
--默认在/user/beifeng/hive/warehouse/数据库/表 
注意：default数据库为/user/beifeng/hive/warehouse/表

方法二：创建包含另一张表字段的字表

create table if not exists default.bf_log_20150913_sa
as select ip,user from deault.bf_log_20150913 ;

方法三：创建同另一张表格式一样的表，数据为空

create table if not exists default.bf_log_20150914
Like deault.bf_log_20150913;

其他常见操作：

查看数据库
	show databases;
	show databases like ‘db_hive*’;
使用表
	use db_hive_01
查看数据库详细信息
	desc database db_hive_03;
	desc database extended db_hive_03;
	desc formatted 表名
删除一个数据库
	drop database if exists db_hive_03; --如果数据库中有表存在则报错
	drop database if exists db_hive_03 cascade; --级联删除 同时删除数据库中的表

Hive数据类型

Numeric Types
	TINYINT  (1-byte signed integer, from -128 to 127)
	SMALLINT  (2-byte signed integer, from -32,768 to 32,767)
	INT  (4-byte signed integer, from -2,147,483,648 to 2,147,483,647)
	BIGINT (8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)
	FLOAT  (4-byte single precision floating point number)
	DOUBLE  (8-byte double precision floating point number)
	DOUBLE PRECISION  (alias for DOUBLE, only available starting with Hive 2.2.0)
	DECIMAL
		Introduced in Hive 0.11.0 with a precision of 38 digits
		Hive 0.13.0 introduced user definable precision and scale
Date/Time Types
	TIMESTAMP  (Note: Only available starting with Hive 0.8.0)
	DATE  (Note: Only available starting with Hive 0.12.0)
String Types (常用)
	STRING（常用）
	VARCHAR (Note: Only available starting with Hive 0.12.0)
	CHAR (Note: Only available starting with Hive 0.13.0)
Misc Types
	BOOLEAN
	BINARY (Note: Only available starting with Hive 0.8.0)
Complex Types
	arrays: ARRAY<data_type> (Note: negative values and non-constant expressions are allowed as of Hive 0.14.)
	maps: MAP<primitive_type, data_type> (Note: negative values and non-constant expressions are allowed as of Hive 0.14.)
	structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
	union: UNIONTYPE<data_type, data_type, ...> (Note: Only available starting with Hive 0.7.0.)

Hive中外部表

内部表也称为managed_table，默认存储在/user/hive/warehouse下，也可以通过location指定（一般不指定），删除表时，会删除表数据以及元数据
外部表称之为external_table，在创建表时可以自己指定目录位置location（一般都要指定）；删除表时，只会删除元数据不会删除表数据

CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User',
     country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
STORED AS TEXTFILE
LOCATION '<hdfs_location>';

分区表

和外部表联合使用企业常用。分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成更小的数据集。在查询时通过where子句中的表达式来选择查询所需要的指定的分区，这样的查询效率会提高很多。

create external table if not exists default.emp_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
partitioned by (month string)  --一级分区
----partitioned by (month string,day string)  --二级分区
row format delimited fields terminated by '\t'

加载数据：
Load data local inpath ‘/opt/datas/emp.txt’ into table default.emp_partition partition(month=201509,day=30)

查询：
Select * from emp_partition where month = ‘201509’ ; ----and day =’30’; --二级分区

合并：
Select * from emp_partition where month = ‘201509’
Union
Select * from emp_partition where month = ‘201508’
Union
Select * from emp_partition where month = ‘201507’; 3+2个MR

注意事项：内部表数据直接上传至/user/hive/warehouse/中，可以查询到结果；但是分区表则不行，因为分区表记录了分区的元数据信息。请看如下实例。

create table if not exists default.emp_nopart(
deptno int,
dname string,
loc string)
row format delimited fields terminated by '\t'
默认存在/user/hive/warehouse/dept_nopart
	上传数据：
		dfs -put /opt/datas/dept.txt /user/hive/warehouse/dept_nopart
	查询：有数据

create talbe if not exists default.emp_nopart(
deptno int,
dname string,
loc string)
	partition by (day string)
row format delimited fields terminated by '\t'
创建目录
		dfs –mkdir –p /user/hive/warehouse/dept_part/day=20150913
上传数据
		dfs -put /opt/datas/dept.txt /user/hive/warehouse/dept_part/day=20150913
查询：无数据 （没有分区的元数据）
	修复表信息：
		msck repair table dept_part;   或者
		alter table dept_part partition(day=’20150913’)
	查询：
		Select * from dept_part; 有数据

查看表的分区数：

Show partitions dept_part;

导入数据到Hive

方法一：按如下格式导入数据

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

原始文件存储的位置：[LOCAL]
本地: load data local …
Hdfs : load data …
对表中的数据是否覆盖：[OVERWRITE]，否则为追加

eg：
加载本地文件到hive表
	Load data local inpath ‘/opt/datas/emp.txt’ into table default.emp;
	
加载hdfs文件到hive
	Load data inpath ‘/user/beifeng/hive/datas/emp.txt’ into table defaut.emp;（加载完成后hdfs上数据被删除）

方法二：创建表的时候通过insert加载数据

create table default.emp_ci like default.emp;
insert into table default.emp_ci select * from default.emp;
或者：
create table if not exits default.bf_log_20150913_sa
As  select ip,user from deault.bf_log_20150913 ;

方法三：创建表的时候通过location指定加载（外部表常用）

create external talbe if not exists default.emp_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
row format delimited fields terminated by '\t'
location '/user/beifeng/hive/warehouse/emp_ext2'

Hive数据导出

方法一：

insert overwrite  local directory '/opt/datas/hive_exp_emp'
select * from default.emp;(默认分隔符存到本地)

insert overwrite  local directory '/opt/datas/hive_exp_emp'
row format delimited fields terminated by '\t'
collection items terminated by '\n'
select * from default.emp;(自定义分隔符)

insert overwrite  directory '/user/beifeng/hive/hive_exp_emp'
select * from default.emp;(默认分隔符存储到hdfs)

方法二：

bin/hive -e "select * from default.emp;" > /opt/datas/hive_exp_emp

方法三：sqoop(常用，后面章节讲)
sqoop
hdfs/hive -> rdbms
rdbms -> hdfs/hive/hbase

方法四：export

export 将hive中数据导出 hdfs
	export table default.emp to '/user/beifeng/hive/export/emp_exp';
import 将外部数据hdfs导入到hive
	create table db_hive.emp like default.emp;
	import table db_hive.emp from '/user/beifeng/hive/export/emp_exp'

Hive常见查询

简单查询

select * from emp;
select t.empno,t.enname,t.deptno  from emp t;
select * from emp limit 5;

= >= <= between and ..... is null / is not null / in /not in .... max/min/count/sum/avg

select t.empno,t.enname,t.deptno  from emp t where t.sal between 800 and 1500;
select t.empno,t.enname,t.deptno  from emp t where t.comm is null;

show function ;---查询函数
desc fucntion extended max;----查询函数的意义和用法
select count(*) cnt from emp;

group by

----查询每个部门的平均工资
select t.deptno,avg(t.sal) avg_sal from emp t group by t.deptno;
----查询每个部门中每个岗位的最高薪水
select t.deptno,t.job,max(t.sal) avg_sal from emp t group by t.deptno t.job;

having
where 是针对单挑记录进行筛选
having 是针对分组结果进行筛选

----求每个部门的平均薪水大于2000的部门
select deptno,avg(sal) avg_sal from emp group by deptno having avg_sal>2000;

join 两个表连接

等值连接join on
	select e.empno,e.enname,d.deptno,d.name from emp e join dept d on e.deptno = d.deptno;
左连接 left join 与左表为准
	select e.empno,e.enname,d.deptno,d.name from emp e left join dept d on e.deptno = d.deptno;
右连接  right join 与右表为准
	select e.empno,e.enname,d.deptno,d.name from emp e right join dept d on e.deptno = d.deptno;
全链接full join 
	select e.empno,e.enname,d.deptno,d.name from emp e full join dept d on e.deptno = d.deptno;

order by

对每个reduce内部进行排序，对全局结果来说没有排序

设置reduce个数
set mapreduce.job.reduces = 3
select * from emp sort by empno desc; 
insert overwrite local directory '/opt/datas/sortby-res' select * from emp sort by empno asc;

distribute by
类似于mapreduce中分区partition，对数据进行分区，结合sort by进行使用

insert overwrite local directory '/opt/datas/distby-res' 
select * from emp distribute by deptno sort by empno asc; 
注意：distribute by必须在sort by之前

cluster by
当distribute by和sort by字段相同时可以用cluster by代替

insert overwrite local directory '/opt/datas/clustby-res' 
select * from emp cluster by empno;

Hive UDF编程

UDF(user definition function)用户自定义函数，允许用户扩展HiveQL功能

show functions; 查看hive中内部函数
desc function split; 函数说明
desc fucntion extended split; 函数使用具体说明

编程步骤：
1、继承org.apache.hadoop.hive.ql.exec.UDF
2、需要实现evaluate函数；evaluate函数支持重载
注意：
1、UDF必须要有返回类型，可以返回null，但是返回类型不能为void
2、UDF中常用Text/LongWritable等类型，不推荐使用java类型

package com.example.hive.udf;
 
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
 
public final class Lower extends UDF {
  public Text evaluate(final Text s) {
    if (s == null) { return null; }
    return new Text(s.toString().toLowerCase());
  }
}

将自定义的UDF函数导入到hive中
方法一：

1、export导出jar包hiveudf.jar
2、在hive中加载执行add jar /opt/datas/hiveudf.jar
3、注册：create temporary function my_lower as "com.example.hive.udf"  ----为这里用hive2.1的版本 com.example.hive.udf.Lower 否则一直报找不到类!坑了我一下午
show functions;可以查看到my_lower

使用
select ename,my_lower(ename) lowername from emp limit 5;

方法二：

方法二：
CREATE FUNCTION self_lower AS 'com.example.hive.udf' 
USING JAR 'hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/hive/jars/hiveudf.jar';

add jar hdfs://inveno/third_party_jars/fastjson-1.2.4.jar;
create temporary function GetCategory as 'cn.inveno.hive.udf.GetCategory' using jar 'hdfs://inveno/hive-udf/GetCategory.jar';

工作中遇到的坑：

建立外表

create external table if not exists db_offline.tmp_spark_streaming_redis_gmp(  
data string
)  
partitioned by (statdate string)
location '/inveno-projects/article-gmp-sparkstreaming/data/redis';
加载数据：
for((i=0;i<144;i++))
do
	#echo ${arr[$i]}
	min=`date -d @${arr[$i]} "+%Y%m%d%H%M"`
	#echo $min
	#hive -e "load data inpath '/inveno-projects/article-gmp-sparkstreaming/data/redis/redis-${arr[$i]}000/part*' into table db_offline.tmp_spark_streaming_redis_gmp partition(statdate=${min})"
	hive -e "
	ALTER TABLE db_offline.tmp_spark_streaming_redis_gmp ADD if not exists PARTITION (statdate=${min}) 
	location '/inveno-projects/article-gmp-sparkstreaming/data/redis/redis-${arr[$i]}000/';
	"
done

坑1：如果使用load data 将会把数据移动到hive的默认地址中

坑2： alter tables ***** location 后面接的地址必须是文件或者文件夹，不允许使用通配符来匹配文件

创建外表时，数据对应的location位置和每个分区的位置不一定相同

查看表的详细属性：

desc formatted default.t_daily_article_operate);

查看partition属性：

desc formatted default.t_daily_article_operate partition(date='2017-09-06');

小江_xiaojiang

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hive深入详解

Hive表的创建方法一：同sqlcreate table if not exits deault.bf_log_20150913(ip string comment 'remot ip address', ---字段注释user string ,req_url string comment 'user request url')comment 'Beifeng
复制链接

扫一扫