hive安装及基本操作

最新推荐文章于 2021-10-11 16:12:57 发布

孤狼龙哥

最新推荐文章于 2021-10-11 16:12:57 发布

阅读量294

点赞数

文章标签： hive spark

Hive安装部署

下载安装包并解压

apache-hive-0.13.1-bin.tar.gz

解压：tar -zxf apache-hive-0.13.1-bin.tar.gz -C /opt/modules/

配置hive-env.sh文件

设置hadoop路径：HADOOP_HOME=/opt/modules/hadoop-2.5.0

设置hive配置文件路径： export HIVE_CONF_DIR=/opt/modules/hive-0.13.1/conf

创建hive在hdfs文件系统存储数据的默认路径并修改权限

bin/hdfs dfs -mkdir -p /user/hive/warehouse

bin/hdfs dfs -chmod g+w /user/hive/warehouse

bin/hdfs dfs -mkdir /tmp

bin/hdfs dfs -chmod 770 /tmp

复制mysql驱动包

cp mysql-connector-java-5.1.27-bin.jar /opt/modules/hive-0.13.1/lib/

修改hive-site.xml文件

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://master:3306/metastore?createDatabaseIfNotExist=true</value>

</property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

</property>

<name>javax.jdo.option.ConnectionUserName</name>

</property>

<name>javax.jdo.option.ConnectionPassword</name>

</property>

<name>hive.cli.print.header</name>

</property>

<name>hive.cli.print.current.db</name>

</property>

<name>hive.metastore.warehouse.dir</name>

<value>/user/hive/warehouse</value>

</property>

配置日志路径hive-log4j.properties

hive.log.dir=/opt/modules/hive-0.13.1/logs

hive.root.logger=INFO,DRFA

hive cli命令窗口operation

查看所有函数：show functions;

创建数据库：create database db_hive;

使用数据库：use db_hive;

创建student表：create table student(id int,name string) row format delimited fields terminated by '\t';

将本地数据导入student表：load data local inpath '/opt/datas/student.txt' into table student;

查询表中数据：select * from db_hive.student;

查看表结构：desc student ; desc formatted student;

查看配置信息：set ; set system:user.name ; set system:user.name=hive ;

查看hdfs文件系统：dfs -ls / ;

查看本地文件系统：!ls /opt/datas ;

HIVE_HOME目录下operation

bin/hive使用帮助：bin/hive -help

查询数据库：bin/hive -e <quoted-query-string>

Eg：bin/hive -e "select * from db_hive.student";

调用文件的sql查询数据库: bin/hive -f <filename>

启动hive时设置配置属性: bin/hive --hiveconf <property=value>

Database operation

创建数据库

create database db_hive_01 ;

create database if not exists db_hive_02 ;

create database if not exists db_hive_03 location '/user/hive/warehouse/db_hive_03.db' ;

查看数据库

show databases ;

show databases like 'db_hive*' ;

选择数据库

use db_hive ;

数据库描述

desc database db_hive_03 ;

desc database extended db_hive_03 ;

删除数据库

drop database db_hive_03 ;

drop database if exists db_hive_03 ;

数据库有表，使用cascade级联删除：drop database db_hive_03 cascade;

Table operation

表创建

创建管理表：

create table if not exists student(

name string comment 'name',

sex string,

age int)

comment 'student'

row format delimited fields terminated by ' '

stored as textfile;

comment ‘’：注释关键字

stored as ：文件存储类型

row format delimited fields terminated by ‘’: 指定字段分隔符

创建外部表：

create external table if not exists student_ext(

name string comment 'name',

sex string,

age int)

row format delimited fields terminated by '\t'

location '/user/hive/warehouse/student_ext';

创建分区表：

create external table if not exists student_part(

name string comment 'name',

sex string,

age int)

partitioned by (month string,day string)

row format delimited fields terminated by '\t';

加载数据到分区表：load data local inpath '/opt/datas/student.txt' into table student_part partition (month=201509,day=13);

增加分区元数据命令：alter table student_part add partition(month='201509',day=’13’);

使用正则表达式分隔符创建表：

CREATE TABLE apachelog (

host STRING,

identity STRING,

user STRING,

time STRING,

request STRING,

status STRING,

size STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"

)

STORED AS TEXTFILE;

子查询方式创建表

create table if not exists student1 as select name from student;

普通表：insert into table student2 select name from student;

分区表静态添加：insert into table student2 partition(date=’20180803’) select name from student;

分区表动态添加：insert into table student2 partition(date) select name,date from student;

create table if not exists student3 like student;

表修改和删除

表重命名：alter table student3 rename to student4;

表删除：drop table if exists student4;

加载数据

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename

[PARTITION (partcol1=val1, partcol2=val2 ...)]

导出查询结果

导出到本地：

insert overwrite local directory '/opt/datas/result.txt' select * from student;

insert overwrite local directory '/opt/datas/result.txt' row format delimited fields terminated by '\t' collection items terminated by '\n' select * from student;

导出到HDFS：

insert overwrite directory '/user/hive/result.txt' select * from student;

表查询

查询语法：

SELECT [ALL | DISTINCT] select_expr, select_expr, ...

FROM table_reference

[WHERE where_condition]

[GROUP BY col_list]

[CLUSTER BY col_list

| [DISTRIBUTE BY col_list] [SORT BY col_list]

]

[LIMIT number]

between A and B 使用：select t.empno,t.ename,t.deptno from emp t where t.sal between 800 and 1500;

查询条件is null, is not null, in, not in：select t.empno, t.ename, t.deptno from emp t where comm is null ;

聚合函数max/min/count/sum/avg：

select count(*) cnt from emp ;

select max(sal) max_sal from emp ;

select sum(sal) from emp ;

select avg(sal) from emp ;

group by /having 查询：

select t.deptno,avg(t.sal) avg_sal from default.emp t group by t.deptno;

select t.deptno,max(t.sal) max_sal from default.emp t group by t.deptno,t.job;

select t.deptno,avg(t.sal) avg_sal from default.emp t group by t.deptno having avg_sal > 2000;

Join 查询：select e.empno,e.ename,d.deptno,d.dname from emp e join dept d on e.deptno = d.deptno;

左连接left join: select e.empno,e.ename,d.deptno,d.dname from emp e left join dept d on e.deptno = d.deptno;

右连接right join : select e.empno,e.ename,d.deptno,d.dname from emp e right join dept d on e.deptno=d.deptno;

全连接 full join: select e.empno,e.ename,d.deptno,d.dname from emp e full join dept d on e.deptno=d.deptno;

导入import导出export

Export导出

导出的数据包括元素据和表数据:export table student to '/user/hive/export/student';

Import导入

将外部数据导入Hive表中:import table student_im from '/user/hive/export/student';

Order by, sort by, distribute by, cluster by使用

order by

order by全局有序，只有一个reduce：select * from emp order by sal desc;

sort by

sort by：每一个reduce内部数据有序，全局结果无序。

set mapreduce.job.reduces= 3;

insert overwrite local directory '/opt/datas/sortby' select * from emp sort by empno asc;

distribute by

distribute by类似于MR中partition,对数据进行分区，结合sort by进行使用

insert overwrite local directory ‘/opt/datas/distby’ select * from emp distribute by deptno sort by empno asc;

cluster by

cluster by：当distribute by和sort by字段相同时，使用cluster by ;

insert overwrite local directory '/opt/datas/cluster' select * from emp cluster by empno;

表存储格式orc、parquet

创建表，文件存储格式为orc，压缩为snappy：

create external table if not exists student_part(

name string comment 'name',

sex string,

age int)

partitioned by (month string,day string)

row format delimited fields terminated by '\t'

stored as orc tblproperties("orc.compress"="SNAPPY");

创建表，文件存储格式为parquet，压缩为snappy：

create external table if not exists student_part(

name string comment 'name',

sex string,

age int)

partitioned by (month string,day string)

row format delimited fields terminated by '\t'

stored as parquet tblproperties("parquet.compress"="SNAPPY");

在实际开发中，表文件的存储格式为：orc或者parquet，压缩为：snappy。

hiveserver2使用

将hive变成一种server服务端对外开放，多个客户端可以连接

启动server：bin/hiveserver2

hiveserver2的默认端口号：10000

进入hiveserver2客户端：

bin/beeline -u jdbc:hive2://master:10000 -n root -p 123456
bin/beeline

!connect jdbc:hive2://master:10000

通过帮助命令获取常用选项参数：bin/beeline --help

自定义UDF

Jar包在本地

添加jar包：add jar /opt/datas/hiveudf.jar ;

创建临时函数：create temporary function my_lower as "com.dragon.hive.udf.LowerUDF" ;

显示所有函数：show functions;

删除自定义函数：drop temporary function my_lower;

自定义udf函数使用：select ename, my_lower(ename) lowername from emp limit 5 ;

Jar包在hdfs

创建临时函数：CREATE FUNCTION self_lower AS 'com.dragon.hive.udf.LowerUDF' USING JAR 'hdfs://master:8020/hiveudf.jar';

自定义udf函数使用：select ename, self_lower(ename) lowername from emp limit 5 ;

UDF函数代码

Pom依赖：

<groupId>org.apache.hive</groupId>

</dependency>

<groupId>org.apache.hive</groupId>

</dependency>

程序代码：

import org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.Text;

public class IpCutUDF extends UDF {

public Text evaluate(Text str) {

Text output = new Text();

if(str == null) {

return output;

}

if(str.toString() == null) {

return output;

}

//按.分割

String[] ipAll = str.toString().split("\\.");

if(ipAll.length != 4) {

return output;

}

output.set(ipAll[0]+"."+ipAll[1]+"."+ipAll[2]);

return output;

}

public static void main(String[] args) {

System.out.println(new IpCutUDF().evaluate(new Text("27.38.5.159")));

}

Hive数据倾斜

map/reduce程序执行时，reduce节点大部分执行完毕，但是有一个或者几个reduce节点运行很慢，导致整个程序的处理时间很长。这是因为某一个key的条数比其他key多很多（有时是百倍或者千倍之多），这条key所在的reduce节点所处理的数据量比其他节点就大很多，从而导致某几个节点迟迟运行不完，此称之为数据倾斜。

万能膏药：hive.groupby.skewindata=true

set hive.groupby.skewindata=true，生成的查询计划会有两个MR Job。

第一个MR Job中，Map的输出结果集合会随机分布到 Reduce中，每个 Reduce 做部分聚合操作，并输出结果，这样处理的结果是相同的Group By Key有可能被分发到不同的 Reduce中，从而达到负载均衡的目的

第二个MR Job再根据预处理的数据结果按照Group By Key分布到Reduce中（这个过程可以保证相同的Group By Key被分布到同一个Reduce中），最后完成最终的聚合操作。

参数调优：hive.map.aggr=true

map端部分聚合，相当于MR的combiner

大小表关联

可以使用map Join让小的维度表（1000条以下的记录条数）先进内存。在map端完成reduce。

大表和大表关联

把空值null的key变成一个字符串加上随机数，把倾斜的数据分到不同的reduce上，由于null值关联不上，处理后并不影响最终结果。

count distinct大量相同特殊值

count distinct时，将值为空的情况单独处理。如果是计算count distinct，可以不用处理，直接过滤，在最后结果中加1。如果还有其他计算，需要进行group by，可以先将值为空的记录单独处理，再和其他计算结果进行union。

空值数据倾斜

场景：如日志中，常会有信息丢失的问题，比如全网日志中的user_id，如果取其中的user_id和bmw_users关联，会碰到数据倾斜的问题。

解决方法1： user_id为空的不参与关联

select * from log a join bmw_users b on a.user_id is not null and a.user_id = b.user_id union all select * from log a where a.user_id is null;

解决方法2 ：赋予空值新的key值

select * from log a left outer Join bmw_users b on case when a.user_id is null then concat(‘dp_hive’,rand()) else a.user_id end = b.user_id;

结论：方法2比方法1效率更好，不但io少了，而且作业数也少了。方法1的log读取两次，jobs是2。方法2的job数是1。这个优化适合无效id(比如-99,’’,null等)产生的倾斜问题。把空值的key变成一个字符串加上随机数，就能把倾斜的数据分到不同的reduce上 ,解决数据倾斜问题。

不同数据类型关联产生数据倾斜

场景：一张表s8的日志，每个商品一条记录，要和商品表关联。但关联却碰到倾斜的问题，s8的日志中有字符串商品id,也有数字的商品id,类型是string的，但商品中的数字id是bigint的。

问题原因：把s8的商品id转成数字id做hash来分配reduce，所以字符串id的s8日志，都到一个reduce上了，解决的方法验证了这个猜测。

解决方法：把数字类型转换成字符串类型

select * from s8_log a left outer join r_auction_auctions b on a.auction_id = cast(b.auction_id as string);

大表Join的数据偏斜

数据偏斜的原因包括以下两点：

1. Map输出key数量极少，导致reduce端退化为单机作业。

2. Map输出key分布不均，少量key对应大量value，导致reduce端单机瓶颈。

Hive中我们使用map join解决数据偏斜的问题，即将其中的某个小表（全量）分发到所有map端的内存进行join，从而避免了reduce。这要求分发的表可以被全量载入内存。

极限情况下，join两边的表都是大表，就无法使用MapJoin。这种问题最为棘手，目前已知的解决思路有两种

上述情况1：考虑先对Join中的一个表去重，以此结果过滤无用信息。这样一般会将其中一个大表转化为小表，再使用MapJoin 。一个实例是广告投放效果分析。

例如将广告投放者信息表i中的信息填充到广告曝光日志表w中，使用投放者id关联。因为实际广告投放者数量很少（但是投放者信息表i很大），因此可以考虑先在w表中去重查询所有实际广告投放者id列表，以此Join过滤表i，这一结果必然是一个小表，就可以使用MapJoin。

select /*+mapjoin(x)*/ * from log a left outer join (

　　select /*+mapjoin(c)*/ d.*

　　from ( select distinct user_id from log ) c

　　join users d on c.user_id = d.user_id

) x on a.user_id = x.user_id;

上述情况2：考虑切分Join中的一个表为多片，以便将切片全部载入内存，然后采用多次MapJoin得到结果。一个实例是商品浏览日志分析。

例如将商品信息表i中的信息填充到商品浏览日志表w中，使用商品id关联。但是某些热卖商品浏览量很大，造成数据偏斜。

select * from(

　　select w.id, w.time, w.amount, i1.name, i1.loc, i1.cat

　　　　from w left outer join i sampletable(1 out of 2 on id) i1)

union all(

　　select w.id, w.time, w.amount, i2.name, i2.loc, i2.cat

　　　　from w left outer join i sampletable(1 out of 2 on id) i2);

控制hive中map和reduce个数

合并小文件，减少map数

set mapred.max.split.size=100000000;

set mapred.min.split.size.per.node=100000000;

set mapred.min.split.size.per.rack=100000000;

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;这个参数表示执行前进行小文件合并，100000000表示100M,前面三个参数确定合并文件块的大小，大于文件块大小128m的，按照128m来分隔，小于128m,大于100m的，按照100m来分隔，把那些小于100m的（包括小文件和分隔大文件剩下的），进行合并。

适当的增加map数

当input的文件都很大，任务逻辑复杂，map执行非常慢的时候，可以考虑增加Map数，来使得每个map处理的数据量减少，从而提高任务的执行效率。

假设有这样一个任务：

Select data_desc,

count(1),

count(distinct id),

sum(case when …),

sum(case when ...),

sum(…)

from a group by data_desc

如果表a只有一个文件，大小为120M，但包含几千万的记录，如果用1个map去完成这个任务，肯定是比较耗时的，这种情况下，我们要考虑将这一个文件合理的拆分成多个，这样就可以用多个map任务去完成。

set mapred.reduce.tasks=10;

create table a_1 as select * from a distribute by rand(123);

这样会将a表的记录，随机的分散到包含10个文件的a_1表中，再用a_1代替上面sql中的a表，则会用10个map任务去完成。

每个map任务处理大于12M（几百万记录）的数据，效率肯定会好很多。

控制map数量需要遵循两个原则：使大数据量利用合适的map数；使单个map任务处理合适的数据量。

控制reduce数目

Hive自己如何确定reduce数：

reduce个数的设定极大影响任务执行效率，不指定reduce个数的情况下，Hive会猜测确定一个reduce个数，基于以下两个设定：

hive.exec.reducers.bytes.per.reducer（每个reduce任务处理的数据量，默认为1000^3=1G）
hive.exec.reducers.max（每个任务最大的reduce数，默认为999）

调整reduce个数方法一：

调整每个reduce处理的数据量，设置hive.exec.reducers.bytes.per.reducer参数的值；

set hive.exec.reducers.bytes.per.reducer=500000000; （500M）

调整reduce个数方法二：

set mapred.reduce.tasks = 15;

reduce个数并不是越多越好：

同map一样，启动和初始化reduce也会消耗时间和资源；另外，有多少个reduce,就会有多少个输出文件，如果生成了很多个小文件，那么如果这些小文件作为下一个任务的输入，则也会出现小文件过多的问题。

什么情况下只有一个reduce：

数据量小于hive.exec.reducers.bytes.per.reducer参数值。
没有group by的汇总，比如把select pt,count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt; 写成 select count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04';
用了order by全局有序，只有一个reduce。
有笛卡尔积

在设置reduce个数的时候也需要考虑两个原则：使大数据量利用合适的reduce数；使单个reduce任务处理合适的数据量。