Hive----【DML操作、对数据表的查操作】

最新推荐文章于 2025-04-27 21:01:01 发布

CoderBoom

最新推荐文章于 2025-04-27 21:01:01 发布

阅读量653

点赞数

分类专栏：大数据 hive 文章标签： HIve DML操作

本文链接：https://blog.csdn.net/CoderBoom/article/details/84311791

版权

大数据同时被 2 个专栏收录

44 篇文章

订阅专栏

hive

10 篇文章

订阅专栏

Hive–DML操作

1. Load

加载操作时间数据文件移动到与Hive表对应的位置的出复制/移动操作.

语法结构 :

load data [local] inpath 'filepath' [overwrite] into table tablename [partition (partcol1=val1, partcol2=val2 ...)]

**说明 : **

filepath :
- 相对路径 : hivedata/1.txt
- 绝对路径 : /user/hive/hivedata/1.txt
- 完整URI : hdfs : //namenode:9000/user/hive/hivedata/1.txt
指的就是要传递进来的数据的位置 , 一般我们都会选择用load方法导入数据 , 但是一定要注意overwrite属性
Local
- 如果指定了Local , load命令将在hive服务所在的linux中查找文件路径 .
- 如果没有指定local关键字 , 选择使用hdfs上的文件的路径即可.
overwrite
- 如果使用overwrite关键字 , 则目标表(或分区)中的内容会被删除! 因此要慎用!

扩展 :

hive load加载数据

针对内部表而言的

如果数据在hive服务器本地的linux上(local) 属于文件复制操作

load data local inpath '/root/hivedata/z.txt' into table t_z;
INFO  : Loading data to table itcast.t_z from file:/root/hivedata/z.txt

如果数据在hdfs文件系统上（非local）属于文件移动操作

load data inpath '/z.txt' into table t_z1;
INFO  : Loading data to table itcast.t_z1 from hdfs://node-1:9000/z.txt

内部探秘

load data local inpath '/root/hivedata/z.txt' into table t_z;
相当于  hadoop  fs -put /root/hivedata/z.txt  /user/hive/warehouse/itcast.db/t_z
----------------------------------
load data inpath '/z.txt' into table t_z1;
相当于 hadoop fs -mv /z.txt  /user/hive/warehouse/itcast.db/t_z

local图解:

在这里插入图片描述

2. insert

Hive 中 insert 主要是结合 select 查询语句使用，将查询结果插入到表中 , 例如

insert overwrite table stu_buck select * from student cluster by(Sno);

需要保证查询结果列的数目和需要插入数据表格的列数目一致

如果查询出来的数据类型和插入表格对应的列数据类型不一致 , 将会进行转换 , 但是不能保证转换一定成功 , 转换失败的数据将会为NULL .

insert 多重插入

创建一个主表以及两个测试表
create table source_table (id int, name string) row format delimited fields terminated by ',';
create table test_insert1 (id int) row format delimited fields terminated by ',';
create table test_insert2 (name string) row format delimited fields terminated by ',';

普通插入：  分别扫描两次 
insert into table test_insert1 select id from source_table;
insert into table test_insert2 select name from source_table;

多重插入： 一次扫描 多次插入
from source_table                     
insert overwrite table test_insert1 
select id
insert overwrite table test_insert2
select name;

动态分区插入

第一步 : 首先开启动态分区功能
set hive.exec.dynamic.partition=true;    #是否开启动态分区功能，默认false关闭。
第二步 : 设置动态分区模式
set hive.exec.dynamic.partition.mode=nonstrict;   #动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。
需求：
将dynamic_partition_table中的数据按照时间(day)，插入到目标表d_p_t的相应分区中。

原始表：
create table dynamic_partition_table(day string,ip string)row format delimited fields terminated by ",";

数据
2015-05-10,ip1
2015-05-10,ip2
2015-06-14,ip3
2015-06-14,ip4
2015-06-15,ip1
2015-06-15,ip2

导入数据 : 
load data local inpath '/root/hivedata/dynamic_partition_table.txt' into table dynamic_partition_table;

目标表：
create table d_p_t(ip string) partitioned by (month string,day string);

动态插入：
insert overwrite table d_p_t partition (month,day) 
select ip,substr(day,1,7) as month,day from dynamic_partition_table;

查询:
select * from d_p_t;
select * from dynamic_partition_table;

动态分区是通过位置来对应分区值的。

动态静态分区图解 :

在这里插入图片描述

hive中的查询

语法结构

select [all | distinct] select_expr, select_expr, ...
from table_reference
join table_other on expr
[where where_condition]
[group by col_list [having condition]]
[cluster by col_list | [distribute by col_list] [sort by| order by col_list]
] [limit number]

分桶查询

mapreduce.job.reduces=-1 具体多少看输入数据量
Number of reduce tasks not specified. Estimated from input data size: 1

set mapreduce.job.reduces=2;
Number of reduce tasks not specified. Defaulting to jobconf value of: 2

cluster by 根据指定的字段分并且在每个桶内排序分且排序（字段是同一个）

distribute by （分）+sort by（排序） 两个字段可以不一样

#根据学号分桶 , 根据年龄排序
#分桶前需要开启分桶规则 , 并且设置分桶的数量
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
select * from student distribute by(Sno) sort by(sage);

当两个字段一样的时候

cluster by =distribute by （分）+sort by（排序）

order by（全局排序）

select * from student order by(sage);
Number of reduce tasks determined at compile time: 1
全局意味着输出文件一个  也就是只有一个reduce task

`导出表数据`

insert + directory 注意overwrite

将查询结果保存到指定的文件目录（可以是本地，也可以是hdfs）

导出到本地 , 注意 , 导出的文件夹最好是新建的没有数据的 , 因为overwrite会覆盖

insert overwrite local directory '/root/123456' select * from student distribute by(Sno) sort by(sage);
把select查询语句的结果都出到指定的目录下

导出到hdfs

insert overwrite directory '/aaa/test' select * from t_p;

Hive智能本地模式

hive集群模式与本地模式的比较

hive sql---->mapreduce--->yarn-->HDFS（结构化数据）
所谓hive慢在于mr执行慢
---------
hive sql---->mapreduce-->mr本地模式执行（local线程模拟运行）--->hdfs

为此hive提供了智能本地模式，根据一些条件判断是否自动切换

set hive.exec.mode.local.auto=true;
满足下述三个条件智能切换成为本地模式  否则还是集群模式

# 项目总的输入工作量的大小应该小于设置的值
The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)

#分片的数量应该小于设置的值
The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default)

#reduce task为0或者1
The total number of reduce tasks required is 1 or 0.

3. Hive join

Hive中除了支持和传统数据库中一样的内关联、左关联、右关联、全关联，还支持LEFT SEMI JOIN 和 CROSS JOIN，但这两种 JOIN 类型也可以用前面的代替。

Hive 支持等值连接（a.id = b.id ）, , 不支持非等值( (a.id>b.id) ) 的连接Hive 支持多 2 个以上表之间的 join .

写 join 查询时，需要注意几个关键点：

join 时，每次 map/reduce 任务的逻辑：把最大的那个表写在最后（否则会因为缓存浪费大量内存）。
left ， right 和 full outer 关键字用于处理 join 中空记录的情况select a.val, b.val from a left outer join b on (a.key=b.key)
Join发生在WHERE子句之前
Join 是不能交换位置的

关于hive中的各种join
准备数据 a.txt

1,a
2,b
3,c
4,d
7,y
8,u

b.txt

2,bb
3,cc
7,yy
9,pp

建表：

create table a(id int,name string)
row format delimited fields terminated by ',';

create table b(id int,name string)
row format delimited fields terminated by ',';

导入数据

load data local inpath '/root/hivedata/a.txt' into table a;
load data local inpath '/root/hivedata/b.txt' into table b;

测试 :

inner join

select * from a inner join b on a.id=b.id;
结果如下 : 展示的是两个表的匹配部分
+-------+---------+-------+---------+--+
| a.id  | a.name  | b.id  | b.name  |
+-------+---------+-------+---------+--+
| 2     | b       | 2     | bb      |
| 3     | c       | 3     | cc      |
| 7     | y       | 7     | yy      |
+-------+---------+-------+---------+--+

**left join **

select * from a left join b on a.id=b.id;
结果如下 : 展示的是左表的全部 , 右边没有匹配到的的展示为NULL
+-------+---------+-------+---------+--+
| a.id  | a.name  | b.id  | b.name  |
+-------+---------+-------+---------+--+
| 1     | a       | NULL  | NULL    |
| 2     | b       | 2     | bb      |
| 3     | c       | 3     | cc      |
| 4     | d       | NULL  | NULL    |
| 7     | y       | 7     | yy      |
| 8     | u       | NULL  | NULL    |
+-------+---------+-------+---------+--+

right join

select * from a right join b on a.id=b.id;
结果如下 : 展示的是右表的全部 , 左表为匹配到的展示位NULL
+-------+---------+-------+---------+--+
| a.id  | a.name  | b.id  | b.name  |
+-------+---------+-------+---------+--+
| 2     | b       | 2     | bb      |
| 3     | c       | 3     | cc      |
| 7     | y       | 7     | yy      |
| NULL  | NULL    | 9     | pp      |
+-------+---------+-------+---------+--+

outer join

select * from a full outer join b on a.id=b.id;
结果如下 : 展示结果为左右表的全部 , 各自未匹配的展示为NULL
+-------+---------+-------+---------+--+
| a.id  | a.name  | b.id  | b.name  |
+-------+---------+-------+---------+--+
| 1     | a       | NULL  | NULL    |
| 2     | b       | 2     | bb      |
| 3     | c       | 3     | cc      |
| 4     | d       | NULL  | NULL    |
| 7     | y       | 7     | yy      |
| 8     | u       | NULL  | NULL    |
| NULL  | NULL    | 9     | pp      |
+-------+---------+-------+---------+--+

hive中的特别join

select * from a left semi join b on a.id = b.id;
等价于
select a.* from a inner join b on a.id=b.id;
但是上面一个的性能要优于下面的
结果如下 : 展示a表中的两表匹配的部分 , 只有a表
+-------+---------+--+
| a.id  | a.name  |
+-------+---------+--+
| 2     | b       |
| 3     | c       |
| 7     | y       |
+-------+---------+--+
相当于
select a.id,a.name from a where a.id in (select b.id from b); 在hive中效率极低

select a.id,a.name from a join b on (a.id = b.id);

select * from a inner join b on a.id=b.id;

cross join（##慎用）

返回两个表的笛卡尔积结果，不需要指定关联键。
select a.*,b.* from a cross join b;

4. Hive参数配置

4.1 Hive 命令行

通过本地模式执行

输入$HIVE_HOME/bin/hive –H或者 –help 可以显示帮助选项：

说明：

1、 -i 初始化 HQL 文件。
2、 -e 从命令行执行指定的 HQL
3、 -f 执行 HQL 脚本
4、 -v 输出执行的 HQL 语句到控制台
5、 -p connect to Hive Server on port number
6、 -hiveconf x=y Use this to set hive/hadoop configuration variables.

例如

# 直接查询表a
$HIVE_HOME/bin/hive -e 'select * from tab1 a';
# 将sql语句写入到脚本 , 通过脚本执行查询
$HIVE_HOME/bin/hive -f /home/my/hive-script.sql
$HIVE_HOME/bin/hive -f hdfs://<namenode>:<port>/hive-script.sql
# 初始化HQL文件
$HIVE_HOME/bin/hive -i /home/my/hive-init.sql
$HIVE_HOME/bin/hive -e 'select a.col from tab1 a'
--hiveconf hive.exec.compress.output=true
--hiveconf mapred.reduce.tasks=32

4.2 Hive 参数配置方式

hive配置参数
- 配置文件 (全局有效)
- 命令行参数 (对hive启动实例有效)
- 参数声明 (对hive的连接session有效)

conf/hive-site.xml  全局有效 不管本地还是远程模式 
bin/hive --hiveconf key=value  会话级别  谁启动谁设置 谁生效
set hive.exec.mode.local.auto=false;  会话级别 谁连接 谁设置谁生效

上面三种模式范围越来越小 , 优先级越来越高

5. Hive函数

5.1 内置运算符

Hive有四种类型的运算符 :

关系运算符
算术运算符
逻辑运算符
复杂运算

具体见《Hive官方文档》或者《hive常用运算符和函数.doc》

5.2 内置函数

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

测试各种内置函数的快捷方法：

创建一个 dual 表create table dual(id string);

load 一个文件（只有一行内容：内容为一个空格）到 dual 表

load data local inpath '/root/hivedata/dual.txt' into table dual;

查询 :

select substr('angelababy',2,3) from dual;
开头以1开始 , 小于1的都当做1
+------+--+
| _c0  |
+------+--+
| nge  |
+------+--+

具体见《Hive官方文档》或者《hive常用运算符和函数.doc》

5.3 Hive 自定义函数和Transform

5.3.1 UDF开发实例

hive UDF开发
- 继承UDFl类
- 重载evaluate方法
- 打包上传到hive的classpath中
```
add jar /xxx.jar
```
- 注册自定义函数
```
create temporary function 自定义函数名 as '类全路径';
```
自定义函数本质是临时函数跟连接的会话有关

新建JAVA maven项目

添加hive-exec-1.2.1.jar 和 hadoop-common-2.7.4.jar 依赖

<dependencies>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>1.2.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.4</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.2</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                                <filters>
                                    <filter>
                                        <artifact>*:*</artifact>
                                        <excludes>
                                            <exclude>META-INF/*.SF</exclude>
                                            <exclude>META-INF/*.DSA</exclude>
                                            <exclude>META-INF/*.RSA</exclude>
                                        </excludes>
                                    </filter>
                                </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

1、写一个 java 类，继承 UDF，并重载evaluate 方法

import org.apache.hadoop.hive.ql.exec.UDF;

/**
 * @author CoderCK
 * @Title: com.itck.hive.udf
 * @ProjectName example-udf
 * @Description: TODO
 * @create 2018/11/20  21:47
 **/
public class Lower extends UDF {
    //Hive自定义函数具体逻辑实现的地方
    public String evaluate(String in){
        return in.toLowerCase();
    }
    //求和
    public int evaluate(int a,int b){
        return a + b;
    }
}

2、打成 jar 包上传到服务器

3、将 jar 包添加到 hive 的 classpath

add jar /root/example-udf-1.0-SNAPSHOT.jar;

4、创建临时函数与开发好的 java class 关联

create temporary function tolowercase as 'com.itck.hive.udf.Lower';

5、测试

select tolowercase("ABC");
结果如下 :
+------+--+
| _c0  |
+------+--+
| abc  |
+------+--+
--------------
select tolowercase(1,2);
+------+--+
| _c0  |
+------+--+
| 3    |
+------+--+

5.3.2 Transform实现(了解)

Hive的Transform关键字提供了在SQL中调用自写脚本的功能

适合实现 Hive 中没有的功能又不想写 UDF 的情况

使用示例 1：下面这句 sql 就是借用了 weekday_mapper.py 对数据进行了处理.

add file weekday_mapper.py;
insert overwrite table u_data_new
select
transform (movieid , rate, timestring,uid)
using 'python weekday_mapper.py'
as (movieid, rating, weekday,userid)
from t_rating;

其中 weekday_mapper.py 内容如下

#!/bin/python
import sys
import datetime
for line in sys.stdin:
line = line.strip()
movieid, rating, unixtime,userid = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([movieid, rating, str(weekday),userid])

扩展 : 特殊分隔符

数据格式为
1||zhangsan
2||lisi
由于row format delimited指定是单个字符分隔 , 所以此时我们可以采用serde来实现特殊分隔符
create table t_shuang_1(id string,name string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties(
'input.regex'='(.*)\\|\\|(.*)',
'output.format.string'='%1$s %2$s'
)