Hive学习笔记_hive select * 和 select 计算执行流程的区别-CSDN博客

本文链接：https://blog.csdn.net/lxhandlbb/article/details/56479897

Hive的使用

Hive 中数据库Database基本操作

创建数据库：

(1)create database db_hive_01;
(2)create database if not exists db_hive_02;//标准的写法
(3)create database if not exists db_hive_03 location '/user/beifeng/hive/warehouse/db_hive_03.db';

显示数据库：

show databases;
show databases like 'db_hive*' //模糊显示数据库

查看数据库的信息：

desc database db_hive_03;
desc database extended db_hive_03;

创建表的三种方式：

(1)直接创建

create table IF NOT EXISTS default.bf_log_20150913 
(
    ip string COMMENT 'remote ip address',
    user string,
    req_url string COMMENT 'user request url'
)
COMMENT 'BeiFeng Web Access Logs'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION '/user/beifeng/hive/warehouse/bf_log_20150913'

(2)基于某些表查询出结果来建表

create table IF NOT EXISTS default.bf_log_20150913_sa
AS select ip,user from default.bf_log_20150913;

(3)基于已经存在的表和视图来构建新的表

create table IF NOT EXISTS default.bf_log_20150912
like default.bf_log_20150913;

查看表结构信息

desc student；
desc extended student；//查看详细信息
desc formatted student；//开发常用

load data local inpath '/opt/data/student.txt' into table db_hive.student;

查看Hive的函数

show function;

查看函数的使用

desc function upper;

查看函数的详细使用：

desc function extended upper;

select id,upper(name) uname from db_hive.student；

删除数据库：

drop database db_hive_03;
drop database db_hive_03 cascade;
drop database if exists db_hive_03;

以【雇员表和部门表】为例创建讲解Hive中表的操作

员工表

create table IF NOT EXISTS default.emp(
    empno int,
    ename string,
    job string,
    mgr int,
    hiredate string,
    sal double,
    comm double,
    deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

部门表

create table IF NOT EXISTS default.dept(
    deptno int,
    dname string,
    loc string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

加载数据

load data local inpath '/opt/data/emp.txt' overwrite into table emp;
load data local inpath '/opt/data/dept.txt' overwrite into table dept;

overwrite关键字表示覆盖表中的数据

创建子表

create table if not exists default.dept_cats
as 
select * from dept;

清除表中的数据

truncate table dept_cats;

create table if not exists default.dept_like
like default.dept;

修改表的名称

alter table dept_like rename to dept_like_rename;

删除表

drop table if exists dept_like_rename;

Hive中外部表的讲解（对比管理表）

在hive中表的类型有两种
(1)管理表
(2)托管表(外部表) 加external关键字

create external table IF NOT EXISTS default.emp_ext(
    empno int,
    ename string,
    job string,
    mgr int,
    hiredate string,
    sal double,
    comm double,
    deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

/hdfs/weblogs/
20150910.txt
20150911.txt
20150912.txt

外部表的区别在于删除表是不删除表所对应的数据文件，而管理表会删除对应的数据文件
企业中80%会使用外部表。

总结：

内部表也称为managed_table
默认存储在/user/hive/warehouse下，也可以在创建表的时候通过location来指定
删除表时，会删除表数据以及元数据

外部表称为external_table
在创建表时可以自己指定目录位置(location) 通常必须指定
删除表时，只会删除元数据不会删除表数据

create external table IF NOT EXISTS default.emp_ext2(
    empno int,
    ename string,
    job string,
    mgr int,
    hiredate string,
    sal double,
    comm double,
    deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
location '/user/beifeng/hive/warehouse/emp_ext2';

load data [LOCAL] INPATH 'filepath'
[OVERWRITE] into table tablename
[PARTITION (partcol1=val1,...)]

原始文件存储的位置
(1)本地 local
(2)hdfs

对表中的数据是否覆盖
(1)覆盖 OVERWRITE
(2)追加

分区表加载，特殊性
partition (partcol1=val1,…)

(1)加载本地文件到hive表

load data local inpath '/opt/datas/emp.txt' into table default.emp;

(2)加载hdfs文件到hive中

load data inpath '/user/beifeng/hive/datas/emp.txt' into table default.emp;

(3)加载数据覆盖表中已有的数据

load data inpath '/user/beifeng/hive/datas/emp.txt' overwrite into table default.emp;

(4)创建表的时候通过insert加载

create table default.emp_ci like emp;
insert into table default.emp_ci select * from default.emp;

(5)创建表的时候通过location指定加载

导出Hive表数据的几种方式讲解

(1)insert overwrite local directory '/opt/datas/hive_exp_emp'
   select * from default.emp;

(2)insert overwrite local directory '/opt/datas/hive_exp_emp'
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
   COLLECTION ITEMS TERMINATED BY '\n'
   select * from default.emp;

(3)bin/hive -e "select * from default.emp;" > /opt/data/exp_res.txt

(4)insert overwrite directory '/user/beifeng/hive/hive_exp_emp'
   select * from default.emp;

(5)sqoop方式

Hive中常见的查询讲解

select * from emp;
select t.empno,t.ename, t.deptno from emp t;
select * from emp limit 5; 
select t.empno,t.ename, t.deptno from emp t where t.sal between 800 and 1500;
select t.empno,t.ename, t.deptno from emp t where comm is null;

show functions;

查看函数的使用：

desc function extended min; 

select count(*)cnt from emp ;
select max(sal) max_sal from emp ;
select sum(sal) sum_sal from emp ;
select avg(sal) avg_sal from emp ;

Hive中数据导入导出Import和Export使用讲解

export 将hive表中的数据导出到外部
import 将外部数据导入到hive表中

export table tablename to 'export_target_path'

注意：这里的export_target_path指的是HDFS上的路劲

export table emp to '/user/beifeng/hive/export/emp_ext';

通过已有的表创建表结构类似的空表（不导入数据）

create table db_hive.emp like default.emp

import table db_hive.emp from '/user/beifeng/hive/export/emp_ext';

其他

order by 全局排序一个Reduce
sort by 每个Reduce内部进行排序，全局不是排序
distribute by 类似MR中partition，进行分区，结合sort by使用
cluster by 当distribute和sort字段相同时，使用方式

(1)order by是对全局数据进行排序，仅仅只有一个reduce，当数据量非常大的时候，很容易会出现问题
select * from emp order by empno desc;

(2)sort by是对每一个reduce内部进行排序，全局结果集来说不是排序的
设置job的reduce数量
set mapreduce.job.reduces=3;每个reduce会生成一个文件
select * from emp sort by empno asc;
insert overwrite local directory ‘/opt/datas/sortby-res’ select * from emp sort by empno asc;

(3)distribute by
分区partition 类似于MapReduce中分区partition，
对数据进行分区，结合sort by进行使用。
insert overwrite local directory ‘/opt/datas/disby-res’
select * from emp distribute by deptno sort by empno asc;
注意：distribute by必须在sort by前面，必须先分区才可以排序

(4)cluster by 当distribute和sort字段相同时，使用cluster by
insert overwrite local directory ‘/opt/datas/disby-res’
select * from emp cluster by empno asc;

操作

group by /having

每个部门的平均工资

select 
    t.deptno,
    avg(t.sal) avg_sal
from emp t 
group by t.deptno;

每个部门中最高岗位的薪水

select 
    t.deptno,
    t.job,
    max(t.sal) max_sal
from emp t 
group by t.deptno,t.job

having 是针对分组结果进行筛选
where是针对单挑记录进行筛选

求每个部门的平均薪水大于2000的部门

select 
    t.deptno,
    avg(t.sal) avg_sal
from emp t 
group by t.deptno
having avg_sal>2000;

join

等值join … on

select
    e.empno,e.ename, d.deptno, d.dname
from emp e join dept d 
on e.deptno=d.deptno

左右接连

select
    e.empno,e.ename, d.deptno, d.dname
from emp e left join dept d 
on e.deptno=d.deptno

Hive高级

HiveServer2、Beeline、JDBC使用

启动HiveServer2

bin/hiveserver2

启动Beeline
bin/beeline

通过Beeline连接
!connect jdbc:hive2://Master:10000 beifeng beifeng org.apache.hive.jdbc.HiveDriver
进入之后和在bin/hive是一样的，在beeline中操作只看到结果，看不到MapReduce的打印信息。

还有一种连接方式
bin/beeline -u jdbc:hive2://Master:10000/default

HiveServer2 JDBC
将分析的结果存储在hive表（result），前段通过DAO代码，进行数据的查询

Hive中常见的数据压缩讲解

(1)安装sanppy
(2)编译hadoop 2.x源码
mvn package -Pdist,native -DskipTests -Dtar -Drequire.snappy
/opt/modules/hadoop-2.5.0-src/target/hadoop-2.5.0/lib/native

bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar
wordcount -Dmapreduce.map.output.compress=true
-Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
/user/beifeng/mapreduce/wordcount/input
/user/beifeng/mapreduce/wordcount/output2

在hive中配置
set mapreduce.map.output.compress=true;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

配置演示讲解mapreduce和hive中使用snappy压缩

input -> map -> shuffle -> reduce -> output

数据压缩的好处
数据量小，减少本地磁盘IO和网络IO

压缩格式：bzip2,gzip,lzo,snappy
压缩比：bzip2>gzip>lzo bzip2最节省存储空间
解压速度：lzo>gzip>bzip2 lzo解压速度是最快的

通常情况下
block -> map
10G ,10 block
压缩
5G ,5 block

Hive Storage Format讲解

数据存储
按行存储
按列存储

ORC PARQUET用的最多

create table page_views(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;


load data local inpath '/opt/datas/page_views.data' into table page_views;

create table page_views_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc;

insert into table page_views_orc select * from page_views;

create table page_views_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS parquet;

insert into table page_views_parquet select * from page_views;

hdfs中查看文件的大小

hdfs dfs -du -h path

10000条日志数据文件大小
textfile 18.1M
orc 2.6M
parquet 13.1M

查询测试

select session_id,count(*) cnt from page_views group by session_id order by cnt desc limit 30;

62s

select session_id,count(*) cnt from page_views_orc group by session_id order by cnt desc limit 30;

61s

同时指定数据存储格式和压缩方式：

create table page_views_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");

create table page_views_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="NONE");

加载数据之后 NONE为7.6M SNAPPY方式压缩后3.8M

总结：在实际项目开发中，hive表的数据
存储格式：orcfile/parquet
数据压缩：snappy

Hive 企业使用优化

select * from emp；这里的select * 不走MapReduce
这是Hive本身的优化
select id from emp就会执行MapReduce,
其实这样也不应该去走MapReduce
这是FetchTask机制，配置如下：

hive.fetch.task.conversion
minimal

minimal是默认的配置也可以配置more
minimal情况下：(1)select * (2)分区字段的where (3)limit 不执行MapReduce
more情况下：select 某个字段不执行MapReduce

Hive的高级优化：

(1)大表拆分为子表

create [temporary] [external] table [if not exists] [db_name.]table_name
[as select_statement]

(2)外部表，分区表

结合使用
多级分区
create [temporary] [external] table [if not exists] [db_name.]table_name
[(col_name data_type [comment col_name],…)]
[partition by (col_name data_type [comment col_comment],…)]
[row format row_format]

(3)数据

存储格式(textfile,orcfile,parquet)
数据压缩(snappy)

(4)SQL

优化SQL语句
join/filter

(5)MapReduce

Reduce Number
JVM重用
推测执行

Join优化：

Commmon/Shuffle/Reduce Join
连接发生的阶段，发生在Reduce Task
大表对大表
每条数据都是从文件中读取的

Map Join

连接发生的阶段，发生在Map Task
用于大表对小表
*大表的数据放从文件中读取 cid
*小表的数据内存中 id
DistributedCache

可以设置 hive.auto.convert.join=true

SMB Join

Sort-Merge-Bucklet Join
通常用于用大表对大表的Join
通常需要设置：

set hive.auto.convert.sortmerge.join=true
set hive.optimize.bucketmap.join=true
set hive.optimize.bucketmap.sortedmerge=true

customer
3 bucket
lst 1001-1101
2nd 1201-1401
3rd 1501-1901
order
3 bucket
lst 1001-1101
2nd 1201-1401
3rd 1501-1901
customer的1st桶join order的1st桶
分桶Join

查看hive的执行计划
explain select * from emp;
explain select deptno,avg(sal) avg_sal from emp group by deptno;

并行执行：
hive.exec.parallel.thread.number=8 最好不要超过20
hive.exec.parallel=false

JVM重用
mapreduce.job.jvm.numtasks=1

Reduce数目
mapreduce.job.reduce=1

推测执行
mapreduce.map.speculative=true
hive.mapred.reduce.tasks.speculative.execution=true
mapreduce.reduce.speculative=true

Map数目
hive.merge.size.per.task=256000000

group by /count(distinct)会导致数据倾斜

动态分区调整
动态分区属性：设置为true表示开启动态分区功能(默认为false)
hive.exec.dynamic.partition=true

动态分区属性：设置为nonstrict，表示允许所有分区都是动态的(默认为strict)
设置为strict，表示必须保证至少有一个分区是静态的
hive.exec.dynamic.partition.mode=strict

动态分区属性：每个mapper或reducer可以创建的最大动态分区个数
hive.exec.max.dynamic.partition.pernode=100

动态分区属性：一个动态分区创建语句可以创建的最大动态分区个数
hive.exec.max.dynamic.partition=1000

动态分区属性：全局可以创建的最大文件个数
hive.exec.max.create.files=100000

strict mode
对分区表进行查询，在where子句中没有加分区过滤的话，讲禁止提交任务(默认：nostrict)
set hive.mapred.mode=strict
注意：使用严格模式可以禁止3种类型的查询
(1)对于分区表，不加分区字段过滤条件，不能执行
(2)对于order by 语句，必须使用limit语句
(3)限制笛卡尔积德查询(join的时候不使用on，而使用where的)

Hive项目实战

Hive 项目实战一创建表并导入日志数据，引出问题

create table IF NOT EXISTS default.bf_log_src(
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
request_body string,
http_referer string,
http_user_agent string,
http_x_forward_for string,
host string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
store as textfile;

load data local inpath '/opt/data/moodle.ibeifeng.access.log' into table bf_log_src;

思路
*原表
*针对不同的业务创建不同的子表
*数据存储格式 orcfile/parquet
*数据压缩 snappy
*map output 数据压缩 snappy
*外部表
*分区表(演示)

Hive 项目实战二使用RegexSerDe处理Apache或者Ngnix日志文件

在创建表的时候可以采用hive提供的正则校验数据格式

drop table if not exists default.bf_log_src;
create table IF NOT EXISTS default.bf_log_src(
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
request_body string,
http_referer string,
http_user_agent string,
http_x_forward_for string,
host string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    "input.regex"=".............."
)
store as textfile;

Hive 项目实战三依据原表创建子表及设置orcfile存储和snappy压缩数据

drop table if not exists default.bf_log_comm;
create table IF NOT EXISTS default.bf_log_comm(
remote_addr string,
time_local string,
request string,
http_referer string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");

加载数据

insert into table default.bf_log_comm
select remote_addr,time_local,request,http_referer from default.bf_log_src;

Hive 项目实战四数据清洗之自定义UDF去除数据双引号

去除字段的” e.g. “xxxx”
定义UDF对原表数据进行ETL清洗

第一个UDF去除引号

public class RemoveQuotesUDF extends UDF {
    public Text evaluate(Text str) {
        //validate
        if (null == str) {
            return null;
        }
        if (null == str.toString()) {
            return new Text();
        }
        //remove
        return new Text(str.toString().replaceALL("\"", ""));
    }
}

add jar /opt/datas/hiveud2.jar

create temporary function my_removequotes 
as "com.beifeng.senior.hive.udf.RemoveQuotesUDF";

list jar

再次覆盖数据

insert overwrite into table default.bf_log_comm
select 
my_removequotes(remote_addr),
my_removequotes(time_local),
my_removequotes(request),
my_removequotes(http_referer) 
from default.bf_log_src;

=======3.19、 Hive 项目实战五数据清洗之自定义UDF转换日期时间数据
重点：第二个UDF处理日期时期字段
31/Aug/2015:00:04:37 +0000 => 20150831000437

public class DateTransformUDF extends UDF {

    private final SimpleDateFormat inputFormat = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);

    private final SimpleDateFormat outputFormat = new SimpleDateFormat("yyyyMMddHHmmss");

    public Text evaluate(Text input) {
        Text pouput = new Text();
        //validate
        if (null == input) {
            return null;
        }
        if (null == input.toString()) {
            return null;
        }
        String inputDate = input.toString().trim();
        if (null == inputDate) {
            return null;
        }

        try {
            //parse
            Date parseDate = inputFormat.parse(inputDate);
            //transform
            String outputDate = outputFormat.format(parseDate);
            //set
            output.set(outputDate);
        } catch(Exception e) {
            e.printStackTrace();
            return output;
        } 
        return output;
    }
}

将程序打成jar包

add jar /opt/datas/hiveud2.jar

create temporary function my_datetransform
as "com.beifeng.senior.hive.udf.DateTransformUDF";

insert overwrite into table default.bf_log_comm
select 
my_removequotes(remote_addr),
my_removequotes(my_datetransform(time_local)),
my_removequotes(request),
my_removequotes(http_referer) 
from default.bf_log_src;

select * from default.bf_log_comm limit 5;

Hive 项目实战六依据业务编写HiveQL分析数据

查看内置函数的s使用

desc function extended substring;
substring('Facebook', 5, 1) => 'b'  下标从1开始截取

select 
    t.hour,
    count(*) cnt
from
(
    select
    substring(time_local,9,2) hour 
    from default.bf_log_comm
) t
group by t.hour
order by cnt desc;
-----

select 
    t.prex_ip,
    count(*) cnt
from
(
    select
    substring(remote_addr,1,7) prex_ip 
    from default.bf_log_comm
) t
group by t.prex_ip
order by cnt desc;