Hive SQL

最新推荐文章于 2024-02-28 09:00:00 发布

大数据从零开始小熊嘛

最新推荐文章于 2024-02-28 09:00:00 发布

阅读量1.5k

点赞数

文章标签： hive hql

本文链接：https://blog.csdn.net/weixin_45922185/article/details/121263461

版权

DDL 表结构

数据库操作

创建数据库

create database if not exists myhive;
create database myhive2 location '/myhive2';//指定hdfs存储位置

查看数据库详细信息

Zdesc  database  myhive2;
hive (myhive)> desc database extended  myhive2;

删除数据库

drop  database  myhive2;

表创建操作

创建表

create table stu(id int,name string);
insert into stu values (1,"zhangsan");
nsert into stu values (1,"zhangsan"),(2,"lisi");

` 创建表并指定字段之间的分隔符

create  table if not exists stu2(id int ,name string) row format delimited fields terminated by '\t' stored as textfile location '/user/stu2';

//row format delimited fields terminated by '\t' 指定字段分隔符，默认分隔符为 '\001'
//stored as 指定存储格式
//location 指定存储位置

根据查询结果创建表

create table stu3 as select * from stu2;

根据已经存在的表结构创建表

create table stu4 like stu2;

删除表操作

drop table score5;

清空表操作

truncate table score6;

查询表的结构

只查询表内字段及属性

desc stu2;

详细查询

desc formatted  stu2;

查询创建表的语句

show create table stu2;

修改内容

添加列

alter table score5 add columns (mycol string, mysco string);

从本地文件系统向表中加载数据

//追加操作
load data local inpath '/export/servers/hivedatas/student.csv' into table student;
// 覆盖操作
load data local inpath '/export/servers/hivedatas/student.csv' overwrite  into table student;

从hdfs文件系统向表中加载数据

load data inpath '/hivedatas/techer.csv' into table techer;
//加载数据到指定分区
load data inpath '/hivedatas/techer.csv' into table techer partition(cur_date=20201210);

 - 从hdfs加载，移动到hive目录下，而不是拷贝过去
 - 如果表是分区表，load 时不指定分区会报错
 - 如果加载相同文件名的文件，会被自动重命名

分区表操作

创建分区表

create table score(s_id string, s_score int) partitioned by (month string);

创建一个表带多个分区

create table score2 (s_id string, s_score int) partitioned by (year string,month string,day string);

查看分区

show  partitions  score;

添加一个分区

alter table score add partition(month='201805');

删除分区

 alter table score drop partition(month = '201806');

常用类型

类型	描述
BOOLEAN	true/false
INT	4个字节的带符号整数
DEICIMAL	任意精度的带符号小数
STRING	字符串
VARCHAR	变长字符串
TIMESTAMP	时间戳，毫秒值精度 122327493795

数据查询DQL

SELECT [ALL | DISTINCT] select_expr, select_expr, ... 
FROM table_reference
[WHERE where_condition] 
[GROUP BY col_list [HAVING condition]] 
[CLUSTER BY col_list 
  | [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list] 
] 
[LIMIT number]

注意：
1、order by 会对输入做全局排序，因此只有一个reducer，会导致当输入规模较大时，需要较长的计算时间。
2、sort by不是全局排序，其在数据进入reducer前完成排序。因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只保证每个reducer的输出有序，不保证全局有序。
3、distribute by(字段)根据指定的字段将数据分到不同的reducer，且分发算法是hash散列。
4、Cluster by(字段) 除了具有Distribute by的功能外，还会对该字段进行排序。
因此，如果分桶和sort字段是同一个时，此时，cluster by = distribute by + sort b

JOIN 连接

INNER JOIN ,LEFT JOIN, RIGHT JOIN, FULL JOIN

INNER JOIN 内连接：只有进行连接的两个表中都存在与连接条件相匹配的数据才会被保留下来
select * from techer t [inner] join course c on t.t_id = c.t_id; -- inner 可省略
 
LEFT OUTER JOIN 左外连接：左边所有数据会被返回，右边符合条件的被返回
select * from techer t left join course c on t.t_id = c.t_id; -- outer可省略
 
RIGHT OUTER JOIN 右外连接：右边所有数据会被返回，左边符合条件的被返回、
select * from techer t right join course c on t.t_id = c.t_id;
 
FULL OUTER JOIN 满外(全外)连接: 将会返回所有表中符合条件的所有记录。如果任一表的指定字段没有符合条件的值的话，那么就使用NULL值替代。
SELECT * FROM techer t FULL JOIN course c ON t.t_id = c.t_id ;

注：
	1. hive2版本已经支持不等值连接，就是 join on条件后面可以使用大于小于符号了;并且也支持 join on 条件后跟or (早前版本 on 后只支持 = 和 and，不支持 > < 和 or)
	2.如hive执行引擎使用MapReduce，一个join就会启动一个job，一条sql语句中如有多个join，则会启动多个job

sort by 局部排序

//每个MapReduce内部进行排序，对全局结果集来说不是排序。
 
//设置reduce个数
set mapreduce.job.reduces=3;
 
//查看设置reduce个数
set mapreduce.job.reduces;
 
//查询成绩按照成绩降序排列
select * from score sort by s_score;
 
//将查询结果导入到文件中（按照成绩降序排列）
insert overwrite local directory '/export/servers/hivedatas/sort' select * from score sort by s_score;

distribute by 分区排序


distribute by：类似MR中partition，进行分区，结合sort by使用
 
设置reduce的个数，将我们对应的s_id划分到对应的reduce当中去
set mapreduce.job.reduces=7;
 
通过distribute by  进行数据的分区
select * from score distribute by s_id sort by s_score;

Hive函数

聚合函数

count,max,min,sum,avg

hive支持 count(),max(),min(),sum(),avg() 等常用的聚合函数

 聚合操作时要注意null值
count(*) 包含null值，统计所有行数
count(id) 不包含null值
min 求最小值是不包含null，除非所有值都是null
avg 求平均值也是不包含null

关系运算

=、!= 或 <>、<、<=、>、>=、is null、is not null、like

数学运算

加(+)、减(-)、乘(*)、除(/)、取余(%)、位与(&)、位或(|)、位异或(^)、位取反(~)

逻辑运算

与(and)、或(or)、非(not)

数值运算

取整、随机、算术

取整函数: round （遵循四舍五入）
指定精度取整函数: round

hive> select round(3.1415926,4) from tableName;
3.1416

向下取整函数: floor

hive> select floor(3.641) from tableName;
3

向上取整函数: ceil

hive> select ceil(3.1415926) from tableName;
4

取随机数函数: rand

语法: rand(),rand(int seed)
返回值: double
说明: 返回一个0到1范围内的随机数。如果指定种子seed，则会等到一个稳定的随机数序列
hive> select rand() from tableName; -- 每次执行此语句得到的结果都不同
0.5577432776034763
 
hive> select rand(100) ;  -- 只要指定种子，每次执行此语句得到的结果一样的
0.7220096548596434

幂运算函数: pow

hive> select pow(2,3) ;
8.0

开平方函数: sqrt

hive> select sqrt(16) ;
4.0

条件函数

if, case when

条件判断函数：case when (两种写法，其一)

语法: case when a then b [when c then d]* [else e] end

条件判断函数：case when (两种写法，其二)

语法: case a when b then c [when d then e]* [else f] end

日期函数

获取当前UNIX时间戳函数: unix_timestamp

UNIX时间戳转日期函数: from_unixtime

日期转UNIX时间戳函数: unix_timestamp

指定格式日期转UNIX时间戳函数: unix_timestamp

日期时间转日期函数: to_date

日期转年函数: year，month，day，hour， minute，second，weekofyear

日期比较函数: datediff

期增加函数: date_add

日期减少函数: date_sub

字符串函数

字符串长度函数：length

字符串反转函数：reverse

字符串连接函数：concat

带分隔符字符串连接函数：concat_ws

语法: concat_ws(string SEP, string A, string B…)
返回值: string
说明：返回输入字符串连接后的结果，SEP表示各个字符串间的分隔符
hive> select concat_ws(',','abc','def','gh')from tableName;
abc,def,gh

字符串截取函数：substr,substring

语法: substr(string A, int start),substring(string A, int start)
hive> select substr('abcde',3) from tableName;
cde
语法: substr(string A, int start, int len),substring(string A, int start, int len)
hive> select substr('abcde',3,2) from tableName;
cd

字符串转大写函数：upper,ucase

字符串转小写函数：lower,lcase

去空格函数：trim

左边去空格函数：ltrim

右边去空格函数：rtrim

正则表达式替换函数：regexp_replace

语法: regexp_replace(string A, string B, string C)
hive> select regexp_replace('foobar', 'oo|ar', '') from tableName;
fb

正则表达式解析函数：regexp_extract

语法: regexp_extract(string subject, string pattern, int index)
返回值: string
说明：将字符串subject按照pattern正则表达式的规则拆分，返回index指定的字符。
hive> select regexp_extract('foothebar', 'foo(.*?)(bar)', 1) from tableName;
the
hive> select regexp_extract('foothebar', 'foo(.*?)(bar)', 2) from tableName;
bar
hive> select regexp_extract('foothebar', 'foo(.*?)(bar)', 0) from tableName;
foothebar
strong>注意，在有些情况下要使用转义字符，下面的等号要用双竖线转义，这是java正则表达式的规则。
select data_field,
regexp_extract(data_field,'.*?bgStart\\=([^&]+)',1) as aaa,
regexp_extract(data_field,'.*?contentLoaded_headStart\\=([^&]+)',1) as bbb,
regexp_extract(data_field,'.*?AppLoad2Req\\=([^&]+)',1) as ccc 
from pt_nginx_loginlog_st 
where pt = '2021-03-28' limit 2;

重复字符串函数：repeat

空格字符串函数：space(n)

分割字符串函数: split

复合类型构建操作

Map类型构建: map

map (key1, value1, key2, value2, …)

map类型访问: M[key]

语法: M[key]
hive> Create table map_table2 as select map('100','tom','200','mary') as t from tableName;
hive> select t['200'],t['100'] from map_table2;
mary    tom

array类型构建: array

语法: array(val1, val2, …)
说明：根据输入的参数构建数组array类型

array类型访问: A[n]

语法: A[n]
hive> create table arr_table2 as select array("tom","mary","tim") as t
 from tableName;
hive> select t[0],t[1] from arr_table2;
tom     mary    tim

*Hive 数据处理

lateral view 结合 UDTF

(UDF，聚集函数UDAF ，表生成函数 UDTF)

UDF

 1. UDF函数可以直接应用于select语句，对查询结构做格式化处理后，再输出内容。
 2. 编写UDF函数的时候需要注意一下几点：
　　a）自定义UDF需要继承org.apache.hadoop.hive.ql.UDF。 
　　b）需要实现evaluate函数，evaluate函数支持重载。
　　例：写一个返回字符串长度的Demo

public class GetLength extends UDF{
    public int evaluate(String str) {
        try{
            return str.length();
        }catch(Exception e){
            return -1;
        }
    }
}//需先创建临时函数：

UDAF
多行进一行出，如sum()、min()，用在group  by时

 1. 必须继承org.apache.hadoop.hive.ql.exec.UDAF(函数类继承)
 org.apache.hadoop.hive.ql.exec.UDAFEvaluator(内部类Evaluator实现UDAFEvaluator接口)
 2. Evaluator需要实现 init、iterate、terminatePartial、merge、terminate这几个函数
 	 init():类似于构造函数，用于UDAF的初始化
	iterate():接收传入的参数，并进行内部的轮转，返回boolean
　terminatePartial():无参数，其为iterate函数轮转结束后，返回轮转数据，类似于hadoop的Combiner
　　merge():接收terminatePartial的返回结果，进行数据merge操作，其返回类型为boolean
　　terminate():返回最终的聚集函数结果
UDTF
	继承org.apache.hadoop.hive.ql.udf.generic.GenericUDTF,实现initialize, process, close三个方法。
	UDTF首先会调用initialize方法，此方法返回UDTF的返回行的信息（返回个数，类型）。
	初始化完成后，会调用process方法,真正的处理过程在process函数中，在process中，每一次forward()调用产生一行；如果产生多列可以将多个列的值放在一个数组中，然后将该数组传入到forward()函数。
	最后close()方法调用，对需要清理的方法进行清理。
	UDTF有两种使用方法，一种直接放到select后面，一种和lateral view一起使用。
 
 1. 直接select中使用

select explode_map(properties) as (col1,col2) from src;
//不可以添加其他字段使用
select a, explode_map(properties) as (col1,col2) from src
//不可以嵌套调用
select explode_map(explode_map(properties)) from src
//不可以和group by/cluster by/distribute by/sort by一起使用

2：和lateral view一起使用

select src.id, mytable.col1, mytable.col2 from src lateral view explode_map(properties) mytable as col1, col2;

UDTF会把一行拆分成一行或者多行，lateral view在把结果组合，产生一个支持别名表的虚拟表。其中explode可于将hive的array或者map结构拆分成多行

大数据从零开始小熊嘛

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Hive SQL

DDL 表结构表的创建创建表create database if not exists myhive;create database myhive2 location '/myhive2';//指定hdfs存储位置查看数据库详细信息desc database myhive2;hive (myhive)> desc database extended myhive2;删除数据库drop database myhive2;创建表create table
复制链接

扫一扫