Hive 使用总结HiveQL

最新推荐文章于 2024-02-21 14:53:49 发布

BJX-XVII

最新推荐文章于 2024-02-21 14:53:49 发布

阅读量3.7k

点赞数 2

分类专栏：大数据文章标签： hive HiveQl

本文链接：https://blog.csdn.net/qq_34696236/article/details/81385770

版权

大数据专栏收录该内容

16 篇文章 1 订阅

订阅专栏

一、基本操作

hive                          #进入使用HiveQL操作
show databases；              #展示所有数据库
show databases like '*x*；    #展示包含x字段的数据库，不同于sql模糊查询 % _ 不适用 
create database dbname;       #创建数据库
use dbname;                   #使用数据库
show tables;                  #展示数据库中所有表名
drop database dbname;         #删除数据库
exit;                         #退出

二、数据库表操作

1：创建内部表/托管表（managed table）

加载数据到托管表时，Hive把数据移动到数据仓库目录。删除数据时这个表的元数据与数据会被一起删除。load是移动操作而drop是删除操作，所以当删除托管表时数据会被彻底被删除。

create table if not exists example1(
        id int,
        name string,
        complex struct<a:int,b:string,c:array<int>> #记录类型，字段类型可不同,允许嵌套
        mapper map<int,string>
      )

不使用row formate ,stored as 子句默认存储格式是分隔的文本
默认存储格式可通过hive.default.fileformat设置

因此执行create table ..；相当于执行

create table ...
row format delimited
    fields terminated by '\001'
    collection items terminated by '\002'
    map keys terminated by '\003'
    lines terminated by '\n'
stored as textfile;

2:创建外部表（external table）

hive不会将数据移动到自己的仓库，并且删除表时只会删除元数据而不会删除数据

create table if not exists example1(
        id int,
        name string,
        complex struct<a:int,b:string,c:array<int>> #记录类型，字段类型可不同,允许嵌套
        mapper map<int,string>)  
         row format delimited fields terminated by ','
         location 'path';

3:创建分区

新建表时定义分区

create table example3(
    id int,
    name string)
partitioned by (first string,second string);

加载数据时要显式指定分区：

load data local inpath 'path' into table tbname
partiton (first='first partition',second='second partition);

可以在创建表之后使用alter table 语句增加删除分区

alter table example3 add partition(first='fir',second='sec');
alter table example3 drop partition(first='fir');

4：划分桶

create table example(id int,name string)
clustered by (id) sorted by (id asc) into 4 buckets;

5:修改表

alter table source rename to target;         #修改表名
alter table target add columns(field type);  #添加列

三、Hive导入数据

1：从本地文件导入hive

load data local inpath 'LocalPath' into table tbname;
load data local inpath 'LocalPath' overwrite into table tbname;

2:从hdfs中导入到hive

load data inpath 'HDFSPath' into table tbname;
load data inpath 'HDFSPath' overwrite into table tbname;

3：insert语句导入到hive

insert overwrite table target
select * from source;

4:多表插入

from source
insert overwrite table target1
select field1,count(x)
group by fieldname
insert overwrite table target2
select field2,count(y)
group by fieldname
insert overwrite table target3
select field3,count(z)
group by fieldname

5：导出数据到文件

insert overwrite local directory 'path' select * from tbname;

四、Hive查询操作

1：排序与聚集

order by 对全局进行排序，而sort by 对每一个reducer排序。

distribute by 进行聚集，若distribute by 与sort by 适用相同列则可缩写成cluster by

from table
select * 
cluster by fieldname

2:连接操作

#内链接
select table1.*,table2.*
from table1,table2
where table.field=table.field;

#外连接
select table1.*,table2.*
from table1 left/right/full outer join table2 on(table1.field=table2.field);

#半连接
select * 
from table1 left semi join table2 on(table1.field=table2.field);
#相当于
select * 
from table1 where table1.field=(select field from table2);

3：视图

create view vname
as select * 
from example1
where id>10 and name !='0';
# 视图中select语句只有在执行引用视图的语句时才执行

五、用户自定义函数UDF、UDAF、UDTF

0：UDF、UDAF、UDTF区别

UDF：单个数据输入行，一个数据输出行；

UDAF：多个数据输入行，一个数据输出行；

UDTF：单个数据输入行，多个数据输出行；

1:内置函数

show functions;           #查看hive内置函数
describe function fname;  #查看具体函数使用方法

hive内置函数大全：https://www.cnblogs.com/MOBIN/p/5618747.html

2：编写UDF、UDAF、UDTF

UDF只能用Java编写，UDF必须继承UDF类并实现evaluate()函数(evaluate函数并非由接口定义），另外UDF支持Java的基本数据类型。

UDF:https://blog.csdn.net/qq_34696236/article/details/81411264

UDAF:

UDTF:

3：使用UDF

#在metastore中注册这个函数并命名
create function myudf as 'com.hadoop.hive.MyUdf'
using jar 'path';
#创建临时函数不持久化到metastore
add jar path;
create temporary function myudf as 'com.hadoop.hive.MyUdf';
#使用
select myudf() from tbname; #udf对大小写不敏感
#删除udf
drop function myudf;