Hive笔记（一）建表，加载数据，分区、桶，JDBC

最新推荐文章于 2022-11-25 10:13:18 发布

落幕7

最新推荐文章于 2022-11-25 10:13:18 发布

阅读量908

点赞数

分类专栏： # Hive 文章标签： hive big data sql

本文链接：https://blog.csdn.net/qq_43278189/article/details/120514978

版权

Hive 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

一、Hive 查看SQL解析计划：

// extended 可选，可以打印更多细节
    explain  select  a.id
            ,a.name
            ,a.clazz
            ,t1.sum_score
    from(
        select  id
                ,sum(score) as sum_score
        from score 
        group by id
    )t1 right join (
        select  id
                ,name
                ,'文科一班' as clazz
        from students
        where clazz = '文科一班'
    ) a
    on t1.id = a.id
    order by t1.sum_score desc
    limit 10;

二、Hive完整建表

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name(
    [(col_name data_type [COMMENT col_comment], ...)]
    )
    [COMMENT table_comment]
    [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
    [CLUSTERED BY (col_name, col_name, ...) 
    [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
    [
    [ROW FORMAT row_format] 
    [STORED AS file_format]
   | STORED BY 'storage.handler.class.name' [ WITH SERDEPROPERTIES (...) ]  (Note:  only available starting with 0.6.0)
      ]
   [LOCATION hdfs_path]
   [TBLPROPERTIES (property_name=property_value, ...)]  (Note:  only available starting with 0.6.0)
   [AS select_statement]  (Note: this feature is only available starting with 0.5.0.)
    注意:
      []:表示可选 
      EXTERNAL:外部表
      (col_name data_type [COMMENT col_comment],...:定义字段名，字段类型
      COMMENT col_comment:给字段加上注释
      COMMENT table_comment:给表加上注释
      PARTITIONED BY (col_name data_type [COMMENT col_comment],...):分区 分区字段注释
      CLUSTERED BY (col_name, col_name,...):分桶
      SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS:设置排序字段 升序、降序
      ROW FORMAT row_format:指定设置行、列分隔符(默认行分隔符为\n)
      STORED AS file_format:指定Hive储存格式：textFile、rcFile、SequenceFile 默认为：textFile
      LOCATION hdfs_path:指定储存位置(默认位置在hive.warehouse目录下)
      TBLPROPERTIES (property_name=property_value, ...):跟外部表配合使用，比如：映射HBase表，然后可以使用HQL对hbase数据进行查询，当然速度比较慢
      AS select_statement:从别的表中加载数据 select_statement=sql语句

建表1：全部使用默认建表方式

    例如:
        create table students
        (
            id bigint,
            name string,
            age int,
            gender string,
            clazz string
        )
        ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; 
    注意:
        分割符不指定,默认不分割
        通常指定列分隔符,如果字段只有一列可以不指定分割符

建表2：指定location （这种方式也比较常用）

    create table students2
        (
            id bigint,
            name string,
            age int,
            gender string,
            clazz string
        )
        ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
        LOCATION '/input1'; 
     注意:
        指定Hive表的数据的存储位置，一般在数据已经上传到HDFS，想要直接使用，会指定Location，通常Locaion会跟外部表一起使用，内部表一般使用默认的location

建表3：指定存储格式

create table student_rc
        (
            id bigint,
            name string,
            age int,
            gender string,
            clazz string
        )
        ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
        STORED AS rcfile; 

    注意:
        指定储存格式为rcfile，inputFormat:RCFileInputFormat,outputFormat:RCFileOutputFormat，如果不指定，默认为textfile.
    注意：除textfile以外，其他的存储格式的数据都不能直接加载，需要使用从表加载的方式。

建表4：从其他表中加载数据

格式:
   create table xxxx as select_statement(SQL语句) (这种方式比较常用)
    例子:
   create table students4 as select * from students2;

建表5：从其他表中获取表结构

格式:
  create table xxxx like table_name  只想建表，不需要加载数据
例子:
  create table students5 like students;

三、Hive加载数据

1、使用hadoop dfs -put '本地数据' 'hive表对应的HDFS目录下'

hdfs dfs -put 'usr/local/data/students.txt' /......

2、使用 load data inpath
方式1和方式2的区别:
1.上传数据到hdfs目录和hive表没有任何关系(不需要数据格式进行匹配,hive读取数据还是需要数据格式的匹配)
2.上传数据到hive表和hive表有关系(需要数据格式进行匹配)

下列命令需要在hive shell里执行

// 将HDFS上的/input1目录下面的数据 移动至 students表对应的HDFS目录下，注意是 移动、移动、移动
// 实际上就是hdfs执行了mv的操作
load data inpath '/input1/students.txt' into table students;

// 清空表
truncate table student_test;
// 加上 local 关键字 可以将Linux本地目录下的文件 上传到 hive表对应HDFS 目录下 原文件不会被删除
// 实际上就是hadoop执行了put操作
load data local inpath '/usr/local/soft/data/students.txt' into table students;
// overwrite 覆盖加载
// 实际上就是hadoop执行了rmr然后put操作
load data local inpath '/usr/local/soft/data/students.txt' overwrite into table students;

在执行命令truncate 清空hive表时，

报错如下：
FAILED: SemanticException [Error 10146]: Cannot truncate non-managed table mtime_qa.userorder_logs

因为我清空的是一个外部表。
外部表不被hive管理，hive只是维护了一个到外部表的引用关系，并不能清空外部的数据

3、create table xxx as SQL语句

4、insert into table xxxx SQL语句（没有as）传输给别的格式的hive table

// 将 students表的数据插入到students2 这是复制 不是移动 students表中的表中的数据不会丢失
insert into table students2 select * from students;

// 覆盖插入 把into 换成 overwrite
insert overwrite table students2 select * from students;

四、Hive 内部表（Managed tables）vs 外部表（External tables）

建表：

外部表和普通表(内部表)的区别
1.外部表的路径可以自定义，内部表的路径需要在 hive/warehouse/目录下
2.删除表后，普通表数据文件和表信息都删除。外部表仅删除表信息

注意:公司中实际应用场景为外部表,为了避免表意外删除数据也丢失不能通过路径来判断是目录还是hive表(是内部表还是外部表)

// 内部表
create table students_managed01
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

//内部表指定location
create table students_managed02
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/students_managed';

// 外部表
create external table students_external01
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOCATION '/students_external';

// 外部表不指定location
create external table students_external02
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

加载数据：

hive> dfs -put /usr/local/soft/data/students.txt /input2/;
hive> dfs -put /usr/local/soft/data/students.txt /input3/;

删除表：

hive> drop table students_internal;
Moved: 'hdfs://master:9000/input2' to trash at: hdfs://master:9000/user/root/.Trash/Current
OK
Time taken: 0.474 seconds
hive> drop table students_external;
OK
Time taken: 0.09 seconds
hive>

可以看出，删除内部表的时候，表中的数据（HDFS上的文件）会被同表的元数据一起删除删除外部表的时候，只会删除表的元数据，不会删除表中的数据（HDFS上的文件）一般在公司中，使用外部表多一点，因为数据可以需要被多个程序使用，避免误删，通常外部表会结合location一起使用外部表还可以将其他数据源中的数据映射到 hive中，比如说：hbase，ElasticSearch......

ACID(重点):

1.A:原子性（Atomicity）
原子性是指事务是一个不可分割的工作单位，事务中的操作要么都发生，要么都不发生。
2.C:一致性（Consistency）
事务前后数据的完整性必须保持一致。
3.I:隔离性（Isolation）
事务的隔离性是多个用户并发访问数据库时，数据库为每一个用户开启的事务，不能被其他事务的操作数据所干扰，多个并发事务之间要相互隔离。
4.D:持久性（Durability）
持久性是指一个事务一旦被提交，它对数据库中数据的改变就是永久性的，接下来即使数据库发生故障也不应该对其有任何影响

设计外部表的初衷就是 让 表的元数据 与 数据 解耦
Managed tables are Hive owned tables where the entire lifecycle of the tables’ data are 
managed and controlled by Hive.External tables    are tables where Hive has loose 
coupling with  the data.All the write operations to the Managed tables are performed 
using Hive SQL commands. If a Managed table or partition is dropped, the    data and  metadata associated with that table or  partition are deleted. The transactional semantics (ACID) are also supported only  on Managed tables.

五、Hive 分区

分区表指的是在创建表时指定分区空间，实际上就是在hdfs上表的目录下再创建子目录

作用：进行分区裁剪，避免全表扫描，减少MapReduce处理的数据量，提高查询效率

一般在公司的hive中，所有的表基本上都是分区表，通常按日期分区、地域分区

分区表在使用的时候记得加上分区字段

分区也不是越多越好，一般不超过3级，根据实际业务衡量

建立分区表：

create external table students_pt1
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
PARTITIONED BY(pt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

增加一个分区：

alter table students_pt1 add partition(pt='20210904');

删除一个分区：

alter table students_pt drop partition(pt='20210904');

查看某个表的所有分区

show partitions students_pt; // 推荐这种方式（直接从元数据中获取分区信息）

select distinct pt from students_pt; // 不推荐

往分区中插入数据：

insert into table students_pt partition(pt='20210902') select * from students;

load data local inpath '/usr/local/soft/data/students.txt' into table students_pt partition(pt='20210902');

查询某个分区的数据：

// 全表扫描，不推荐，效率低
select count(*) from students_pt;

// 使用where条件进行分区裁剪，避免了全表扫描，效率高
select count(*) from students_pt where pt='20210101';

// 也可以在where条件中使用非等值判断
select count(*) from students_pt where pt<='20210112' and pt>='20210110';

Hive动态分区

有的时候我们原始表中的数据里面包含了 ‘‘日期字段 dt’’，我们需要根据dt中不同的日期，分为不同的分区，将原始表改造成分区表。

hive默认不开启动态分区

动态分区：根据数据中某几列的不同的取值划分不同的分区

开启Hive的动态分区支持

# 表示开启动态分区
hive> set hive.exec.dynamic.partition=true;
# 表示动态分区模式：strict（需要配合静态分区一起使用）、nostrict
# strict： insert into table students_pt partition(dt='anhui',pt) select ......,pt from students;
hive> set hive.exec.dynamic.partition.mode=nostrict;
# 表示支持的最大的分区数量为1000，可以根据业务自己调整
hive> set hive.exec.max.dynamic.partitions.pernode=1000;

建立原始表并加载数据

create table students_dt
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string,
    dt string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

建立分区表并加载数据

create table students_dt_p
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
PARTITIONED BY(dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

使用动态分区插入数据

// 分区字段需要放在 select 的最后，如果有多个分区字段 同理，它是按位置匹配，不是按名字匹配
insert into table students_dt_p partition(dt) select id,name,age,gender,clazz,dt from students_dt;
// 比如下面这条语句会使用age作为分区字段，而不会使用student_dt中的dt作为分区字段
insert into table students_dt_p partition(dt) select id,name,age,gender,dt,age from students_dt;

多级分区

create table students_year_month
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string,
    year string,
    month string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

create table students_year_month_pt
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
PARTITIONED BY(year string,month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

insert into table students_year_month_pt partition(year,month) select id,name,age,gender,clazz,year,month from students_year_month;

多级分区

六、Hive分桶

分桶实际上是对文件（数据）的进一步切分

Hive默认关闭分桶

作用：在往分桶表中插入数据的时候，会根据 clustered by 指定的字段进行hash分区对指定的buckets个数进行取余，进而可以将数据分割成buckets个数个文件，以达到数据均匀分布，可以解决Map端的“数据倾斜”问题，方便我们取抽样数据，提高Map join效率

分桶字段需要根据业务进行设定

开启分桶开关

hive> set hive.enforce.bucketing=true;

建立分桶表

create table students_buks
(
    id bigint,
    name string,
    age int,
    gender string,
    clazz string
)
CLUSTERED BY (clazz) into 12 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

往分桶表中插入数据

// 直接使用load data 并不能将数据打散
load data local inpath '/usr/local/soft/data/students.txt' into table students_buks;

// 需要使用下面这种方式插入数据，才能使分桶表真正发挥作用
insert into students_buks select * from students;

Hive分桶表的使用场景以及优缺点分析

七、Hive JDBC

启动hiveserver2

hive --service metastore

hive --service hiveserver2 &

或者
hiveserver2 &

新建maven项目并添加两个依赖

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.7.3</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc -->
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-jdbc</artifactId>
        <version>2.3.8</version>
    </dependency>

编写JDBC代码

import java.sql.*;

public class HiveJDBC {
    public static void main(String[] args) throws ClassNotFoundException, SQLException {
        Class.forName("org.apache.hive.jdbc.HiveDriver");
        Connection conn = DriverManager.getConnection("jdbc:hive2://master:10000/test3");
        Statement stat = conn.createStatement();
        ResultSet rs = stat.executeQuery("select * from students limit 10");
        while (rs.next()) {
            int id = rs.getInt(1);
            String name = rs.getString(2);
            int age = rs.getInt(3);
            String gender = rs.getString(4);
            String clazz = rs.getString(5);
            System.out.println(id + "," + name + "," + age + "," + gender + "," + clazz);
        }
        rs.close();
        stat.close();
        conn.close();
    }
}

八、Hive的数据类型

基本数据类型
数值型
TINYINT — 微整型，只占用1个字节，只能存储0-255的整数。
SMALLINT– 小整型，占用2个字节，存储范围–32768 到 32767。
INT– 整型，占用4个字节，存储范围-2147483648到2147483647。
BIGINT– 长整型，占用8个字节，存储范围-2^63到2^63-1。
布尔型BOOLEAN — TRUE/FALSE
浮点型FLOAT– 单精度浮点数。
DOUBLE– 双精度浮点数。
字符串型STRING– 不设定长度。
日期类型：
1，Timestamp 格式“YYYY-MM-DD HH：MM：SS.fffffffff”（9位小数位精度） 
2，Date DATE值描述特定的年/月/日，格式为YYYY-MM-DD。
复杂数据类型：
Structs，Maps，Arrays

复杂数据类型参考

九、Hive–HQL语法-DDL

创建数据库 create database xxxxx;
查看数据库 show databases；
删除数据库 drop database tmp;
强制删除数据库：drop database tmp cascade;
查看表：SHOW TABLES；
查看表的元信息：
    desc test_table;
    describe extended test_table;
    describe formatted test_table;
查看建表语句：show create table table_XXX
重命名表：
    alter table test_table rename to new_table;
修改列数据类型：alter table lv_test change column colxx string;
增加、删除分区：
    alter table test_table add partition (pt=xxxx) 
    alter table test_table drop if exists partition(...);

在这里插入图片描述

十、Hive 函数使用

1.主要常用函数

if函数 if（，，）
case when 函数：case when 。。。end
日期函数：to_date…
字符串函数：concat，concat_ws， split
聚合函数：sum，count，avg，min，max
null值判断：is null ，is not null
其他：round，floor，……

高级函数

窗口函数（开窗函数）：用户分组中开窗

row_number() 等
select * from 
      (select name,date_time,row_number() 
       over(partition by name order by cost desc) as rn 
       from window_t)a 
       where rn<=3;


一般用于分组中求 TopN

dense_rank()
rank()

参考教程

lateral view—行转列

lateral view用于和split, explode等UDTF一起使用，它能够将一行数据拆成多行数据，在此基础上可以对拆分后的数据进行聚合。lateral view首先为原始表的每行调用UDTF，UTDF会把一行拆分成一或者多行，lateral view再把结果组合，产生一个支持别名表的虚拟表。
explode函数参数仅接受array和map类型，不支持两个一起用。所以lateral view可以解决

创建表

create table laterv_tb(
	name string,
	weight array<int>) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
COLLECTION ITEMS TERMINATED BY ',' ;

SELECT name,new_num 
	FROM laterv_tb LATERAL 
	VIEW explode(weight) num
	AS new_num;

单词统计：

select w.word,count(*) from (select explode(split(line,' ')) word 
from wc_test)w group by w.word;

2.Hive-自定义函数UDF

1.UDF函数可以直接应用于select语句，对查询结构做格式化处理后，再输出内容。
2.编写UDF函数的时候需要注意一下几点：
a）自定义UDF需要继承org.apache.hadoop.hive.ql.exec.UDF。
b）需要evaluate函数。
3.步骤
a）把程序打包放到目标机器上去；
b）进入hive客户端，添加jar包： add jar /usr/local/testdata/hive_UP.jar;
c）创建临时函数：

hive>CREATE TEMPORARY FUNCTION f_up as 'hive_demo.hive_udf';

查询HQL语句：

select f_up(line) from wc_test;

销毁临时函数：

DROP TEMPORARY FUNCTION f_up;

注：UDF只能实现一进一出的操作，
如果需要实现多进一出，则需要实现UDAF
如果需要实现一进多出，则需要实现UDTF

3.Hive结合shell脚本企业实战用法

hive -e “select * from wc_test”
hive -f   /usr/local/testdata/test.hql

落幕7

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
Hive笔记（一）建表，加载数据，分区、桶，JDBC

一、Hive 查看SQL解析计划：// extended 可选，可以打印更多细节 explain select a.id ,a.name ,a.clazz ,t1.sum_score from( select id ,sum(score) as sum_score from score group by id )t1 rig
复制链接

扫一扫