Apache Hive DML语句及内置函数

最新推荐文章于 2024-10-31 22:53:42 发布

海星？海欣！

最新推荐文章于 2024-10-31 22:53:42 发布

阅读量322

点赞数

分类专栏：大数据开发文章标签： hive hadoop apache

本文链接：https://blog.csdn.net/Sun123234/article/details/129251106

版权

大数据开发专栏收录该内容

16 篇文章 0 订阅

订阅专栏

本文介绍了Hive中的DML操作，包括使用LOAD和INSERT加载数据到Hive表，以及查询数据的方法。详细讲解了从本地和HDFS加载数据的不同方法，并展示了如何插入数据。此外，文章还列举了Hive的常用内置函数，如字符串、日期、数学和条件函数。

摘要由CSDN通过智能技术生成

1、Hive SQL DML语法之加载数据

1.1 Load加载数据

在Hive中建表成功之后，就会在HDFS上创建一个与之对应的文件夹，且文件夹名字就是表名
文件夹父路径是由参数hive.metastore.warehouse.dir控制，默认值是/user/hive/warehouse；

法一：可以直接在web网页上上传数据：
在这里插入图片描述

法二：hadoop fs –put 方法
node1中：

ll
vim 1.txt #创建1.txt并输入下面内容
cat 1.txt

1,allen,18
2,james,20
3,kobe,24

在node2的hive中，创建表：

use ithei #使用库
create table 1.txt(id int,name string,age int) row format delimited fields terminated by ',';

回到node1，上传数据：

hadoop fs -put 1.txt /user/hive/warehouse/ithei/t_1

法三：load加载
法一法二是底层hdfs上直接操作，不推荐
推荐使用hdfs的load加载

LOAD DATA [LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO TABLE tablename;
local不写默认是从hdfs上的文件复制到表下，写则是从本地

LOCAL本地是哪里？
本地文件系统指的是Hiveserver2服务所在机器的本地Linux文件系统，不是Hive客户端所在的本地文件系统。

在datagrip中，SQL语法选hive，会话连接

show databases; #查询是否正常连接

use itheima;

--step1:建表
--建表student_local 用于演示从本地加载数据
create table student_local(num int,name string,sex string,age int,dept string) row format delimited fields terminated by ',';
--建表student_HDFS  用于演示从HDFS加载数据
create table student_HDFS(num int,name string,sex string,age int,dept string) row format delimited fields terminated by ',';


--建议使用beeline客户端 可以显示出加载过程日志信息
--step2:加载数据
-- 从本地加载数据  数据位于HS2（node1）本地文件系统  本质是hadoop fs -put上传操作
LOAD DATA LOCAL INPATH '/root/hivedata/students.txt' INTO TABLE student_local;
--从HDFS加载数据  数据位于HDFS文件系统根目录下  本质是hadoop fs -mv 移动操作
--先把数据上传到HDFS上  hadoop fs -put /root/hivedata/students.txt /
LOAD DATA INPATH '/students.txt' INTO TABLE student_HDFS;

1.2 Insert插入数据

create table t_2(id,int name string);
insert into t_2(1,'zhangsan')
#这样标准的insert语法，可以插入，但是会执行一个mr程序-慢，所以不推荐

yarn的web页面，http:8088

最常用的使用是：把查询返回的结果插入到另一张表中。

insert+select

insert+select表示：将后面查询返回的结果作为内容插入到指定表中
要求查询出的数据和插入表的个数、类型一致
insert into table 表名 select 列名 from 表名;

2、DML语法之查询数据

大多关键词都学过，

将文件导入：

--创建表t_usa_covid19
drop table if exists t_usa_covid19;
CREATE TABLE t_usa_covid19(
    count_date string,
    county string,
    state string,
    fips int,
    cases int,
    deaths int)
row format delimited fields terminated by ",";

--将数据load加载到t_usa_covid19表对应的路径下
load data local inpath '/root/hivedata/us-covid19-counties.dat' into table t_usa_covid19;

--查询常数返回 此时返回的结果和表中字段无关
select 1 from t_usa_covid19;
--查询当前数据库
select current_database(); --省去from关键字

-- ALL 与 DISTINCT
--返回所有匹配的行
select state from t_usa_covid19;
--相当于
select all state from t_usa_covid19;  
-- all写不写效果一样

注意：where条件中不能使用聚合函数
报错 SemanticException:Not yet supported place for UDAF ‘sum’
聚合函数要使用它的前提是结果集已经确定。
而where子句还处于“确定”结果集的过程中，因而不能使用聚合函数

is null ：为空
between 1 and 100 : 大于等于1，小于等于100
聚合函数：多行数据，聚合成一行输出
where与having
where 在分组前过滤；having在分组后过滤
where后不能使用聚合函数，having后面可以使用聚合函数
对表未知时，不要select * from 表可能数据太大，最后加个limit输出几行
limit 2，3 ：从第三行开始输出3行，默认第一行是0
执行顺序：from – where-- group by --having --order by --limit --select

根据数据库的三范式设计要求和日常工作习惯来说，我们通常不会设计一张大表把所有类型的数据都放在一起，而是不同类型的数据设计不同的表存储.
所以后面设计连接多张表 —join

join常用的：inner join = join，left join

在这里插入图片描述

3、Hive 常用函数

show functions 查看所有可用的函数
describe function extended funcname 查看函数的使用方式
Hive的函数分为两大类：内置函数（Built-in Functions）、用户定义函数UDF（User-Defined Functions）
内置函数可分为：数值类型函数、日期类型函数、字符串类型函数、集合函数、条件函数等；
用户定义函数根据输入输出的行数可分为3类：UDF、UDAF、UDTF

用户定义函数UDF分类标准：根据函数输入输出的行数

UDF（User-Defined-Function）普通函数，一进一出
UDAF（User-Defined Aggregation Function）聚合函数，多进一出
UDTF（User-Defined Table-Generating Functions）表生成函数，一进多出

在这里插入图片描述

3.1 Hive 常用的内置函数

3.1.1 字符串函数

字符串长度函数：length
字符串反转函数：reverse
字符串连接函数：concat
带分隔符字符串连接函数：concat_ws
concat_ws(拼接符，拼接内容)
concat_ws(‘.’,‘www’,array(‘itca’,‘cn’)) 结果： www.itca.cn
字符串截取函数：substr,substring
select substr(‘angelababy’,-2) —输出：by
select substr(‘angelababy’,2,2) —输出：ng
select split(‘apache hive’,’ ‘) —输出：[‘apache’,‘hive’]
select split(‘apache hive’,’ ‘)[0] —输出：apache
select split(‘apache hive’,’ ')[1] —输出：hive

3.1.2 日期函数

--获取当前日期: current_date
select current_date();
--获取当前UNIX时间戳函数: unix_timestamp
select unix_timestamp();
--日期转UNIX时间戳函数: unix_timestamp
select unix_timestamp("2011-12-07 13:01:03");
--指定格式日期转UNIX时间戳函数: unix_timestamp
select unix_timestamp('20111207 13:01:03','yyyyMMdd HH:mm:ss');
--UNIX时间戳转日期函数: from_unixtime
select from_unixtime(1618238391);
select from_unixtime(0, 'yyyy-MM-dd HH:mm:ss');

日期比较函数: datediff 日期格式要求’yyyy-MM-dd HH:mm:ss’ or ‘yyyy-MM-dd’
select datediff(‘2012-12-08’,‘2012-05-09’);—输出相差多少天
日期增加函数: date_add
select date_add(‘2012-02-28’,10);
日期减少函数: date_sub
select date_sub(‘2012-01-1’,10);