hive命令使用小抄

最新推荐文章于 2022-11-12 11:54:21 发布

疯狂的小Lan

最新推荐文章于 2022-11-12 11:54:21 发布

阅读量269

点赞数

分类专栏：常用命令文章标签： hive

本文链接：https://blog.csdn.net/qq_38298902/article/details/108414413

版权

常用命令专栏收录该内容

3 篇文章 0 订阅

订阅专栏

hive命令使用小抄

前言
一、常用命令
二、常见问题
持续更新

前言

经常用到hive的很多命令，以及常见的问题，记录下来：）

一、常用命令

1、建表
create table
可用参数：

   row format delimited
   fields terminated '\t'
   lines  terminated '\n'
   stored as file_format（textfile，orcfile ....）

create table if not exists tmp.watch_movie 
(
	username string,
	moviename string, 
	price int
)
partitioned by (dayno string)
row format delimited  fields terminated '\t' lines  terminated '\n' stored as textfile location "hdfs://tmp/watchmovie.txt";

hive支持的存储文件格式有：TextFile（不压缩）、SequenceFile、RCFile、Avro Files、ORC Files、Parquet

notice：

定时调度的任务最好建表的时候加上时间分区，更加清晰
建表结束后一定要加上分号 “;”
如果是需要在hdfs上分析文本数据时，可将文件存成textfile格式。需要生成压缩文件，占用小磁盘空间时，可以存储为orcfile格式
field间隔最好选用默认“\001”，选其他的符号容易和字符串的某些符号冲突

2、解析字段中的json
查单层值
get_json_object(detail,‘$.type‘)

查多层值
get_json_object(detail,‘$.type.comedy‘) 不一定能够取出成功，所以可以采用下面方法
get_json_object(get_json_object(detail,‘$.type‘),’$.comedy’) 两层调用，效果稳定

查多个字段
json_tuple(detail,‘type’,‘time’)

json中存在list
get_json_object(detail,‘$.type[0]‘)
即可取出list中的第一个值，其他位置的值同理

#查多层值
select get_json_object(detail,'$.type') as m
from tmp.watch_movie ;
#查多层值
select get_json_object(detail,'$.type.comedy') as p
from tmp.watch_movie ;
#查多个字段
select json_tuple(detail,'type','time') as p
from tmp.watch_movie ;
#json中存在list
select get_json_object(detail,'$.type[0]') as m
from tmp.watch_movie ;

3、字段拆解成行
将某个字段的内容进行拆解，并变换成多行内容
explode() 对array进行处理
lateral view 展示其他列的内容

notice：explode的别名是一定需要的（下面例子的别名“a”）

select tmp_movie_name, price
from tmp.watch_movie
lateral view explode(split(movie_name,','))a as tmp_movie_name

4、两个表join
join是两个表匹配的集合，交集（常用）
left(outer) join 以前面表为主表，个数和前表相同（常用）
right(outer) join 以后面表为主表，个数与后表相同
full(outer) join 两个表的并集

notice：

两个表大小差距很大时，join的操作会引起数据倾斜问题，导致运行速度非常慢，因此常把小表放在前面
join选用的字段需要明确是来自哪个表。例如：a.name，则是选取a表中的name字段

select a.moviename as moviename ,a.price as price 
from tmp.watch_movie1 a 
join tmp.watch_movie2 b 
on a.moviename=b.moviename

5、字段去重
hive行去重的几种方式，distinct, group by
distinct()
group by

group by的运算速度是比distinct快的

select distinct(moviename) as moviename
from tmp.watch_movie

select moviename, price
from tmp.watch_movie
group by moviename,price

6、正则操作
用于模糊搜索，和shell的正则表达比较类似
^ 表示开头，$ 表示结尾，. 表示任意字符，* 表示匹配模式里面的任意多个

正则提取
regexp_extract(string initial_string, string pattern, int index)
regexp_extract(name, ‘(\")(.*?)(\")’,0)
第一个参数是提取的字段的，第二个参数是匹配模式，最后个参数是展示范围
0 显示与之匹配的整个字符串
1 表示显示第一个括号中的内容
2 表示显示第二个括号里面的内容

匹配中文使用 [\u4e00-\u9fa5]

正则替换
regexp_replace(string initial_string, string pattern, string replacerment)

select regexp_extract(name, '(\\\")(.*?)(\\\")',0) 
from tmp.tmp.watch_movie

7、聚合运算
count()
count(*) 统计行数，是对所有行进行统计，包括null行
count(1) 统计行数，同样是对所有行计数
count(col) 对col列的内容进行聚合计数
count(distinct col) 对col去重计数

sum() 求和函数
max() 求最大函数
min() 求最小函数
collect_set() 聚合数据同时去除重复元素，生成的是array
collect_list() 聚合数据不去除重复元素，可以用sort_array 对其中的array进行升序降序

通常和group by联合使用

#统计同名人数
select count(*) as cnt
from tmp.watch_movie 
group by name

select username, collect_set(moviename) 
from tmp.watch_movie
group by username;

select username, collect_list(moviename) 
from tmp.watch_movie 
group by username;

8、匹配与模糊匹配
完全匹配常用的是like，模糊匹配常用rlike
like 为非正则匹配，需要与字符串完全一致
%为任意字符，_为限定字符
rlike 为正则匹配，语法与正则操作的一致

notice：当使用“|”时，需要注意最后的一个不能为空字符串“”，否则就是匹配任意字符了

select movie_name, price 
from tmp.watch_movie
where movie_name rlike '.*(angry|love).*';

9、判断
判断新的
if 判断if（条件，正，反）

case…when…then…else…end

select if(price>10, 0, 1) as price
from tmp.watch_movie;

select case when price>0 then 1
			whene prince=0 then 0
			else -1 end as price
from tmp.watch_movie;

10、上传文件
将hdfs上或者本地的文件上传到数据库中
load data [local] inpath ‘filepath’ [overwrite]
into table tablename [partition (partcol1=val1, partcol2=val2)…]

notice：load的操作会把源文件直接mv到数据库对应路径中，所以通常会把表建好，然后往里面填数据

load data watch_movie.txt inpath 
into table tmp.watch_movie

10、保留位数
round() 求四舍五入
decimal(10,2) decimal(precision, scale) ，前面是整数部分，后面是小数部分
整数部分如果没有定义，则默认长度为10，小数部分如果没有指定，则默认为0

select cast(price as decimal(10,2)) as price
from tmp.watch_movie

11、增加序号
row_number() 用于增加序号，可以限定行数范围和数值范围
row_number() over(partition by…order by…) as rank
over是开窗函数，经常和聚合函数联合使用

select moviename,price
from(
	select row_number over(partiton by moviename order by price) as rank, moviename, price
	from tmp.watch_movie
	)
where rank<10;

12、时间操作
unix_timestamp(date) 如果后面只有date参数，date的形式必须为“yyyy-MM-dd HH:mm:ss”。也可以带自己想要的形式 unix_timestamp(date,‘yyyyMMdd’

from_unixtime(unixtime,string format) 时间戳转换成日期

to_string(date)，year(date)，month(date)，day(date) date的默认形式必须为“yyyy-MM-dd HH:mm:ss”

datediff(date1, date2) 返回时间差，date的形式必须为“yyyy-MM-dd”

weekofyear(date) 给定的时间是本年的第几周

date_add(date,days) 返回date增加days后的日期

date_sub(date, days) 返回date减少days后的日期

select unix_timestamp(2020-08-10) 
from tmp.watch_movie;

select datediff(2020-08-11’, ‘2020-08-10’)
from tmp.watch_movie;

13、字符串操作

length(string) 计算长度，通常在过滤文本异常值时使用

reverse(string) 字符串反转

concat(string1,string2) 用于拼接字符串（常用）

concat_ws(’,’,col1,col2) 用于聚合拼接，通常和group by一起使用

substr(‘ABCD’,-1,2) 字符串截取，第一个参数是string，第二个参数起始位置，第三个参数字符串长度

upper(string) 字符串转大写

upcase(string) 字符串转大写

lower(string) 字符串转小写

lcase(string) 字符串转小写

trim(string) 去除字符串空格

ltrim(string) 去除字符串左边空格

rtrim(string) 去除字符串右边空格

space(n) 空格字符串，n是个数

repeat(‘abs’,n) 重复字符串，n是个数

lpad(‘abd’,10,‘td’) 左补足函数

rpad(‘abd’,10,‘td’) 右补足函数

find_in_set(‘ab’,‘ab,ef,gd’) 集合查找函数，返回第一次出现的位置，这里返回1

instr() 查找字符串中的子字符串substr的位置，查找失败则返回0

二、常见问题

子查询别名问题
如果子查询缺少别名，程序会报错，别名很重要，以下例子“a”就是别名

#a就是子查询的别名
select moviename,max(price) as max_price 
from (select moviename,price 
	  from tmp.watch_movie
	  where dayno='20200907') a
group by moviename

缺失值问题
int 和string类型数据存储时，null默认存储为 \N
string类型的数据如果为字符串“”，存储则是“”
往int类型字段插入空字符串“”，结果还是\N
判断int为空时，可以使用is null 来判断
判断string为空时，is_null查出来的是\N。而为判断空字符串时，需要用 col=‘’，注意区别
split问题
针对string进行处理，处理之后string的类型会变成array，特殊字符需要转义
比如字符串里面有“”引号时，要对引号进行分隔，需要对引号进行转义
split(‘这朵花"真的"很美丽,’\"’) 这里需要三个转义字符一个是转义引号"，一个是转义反斜杠\，如果是在shell中执行，需要加四个\，在shell中也需要转义。

select split('abcdef','c') 
from tmp.watch_movie;
#output:["ab","def"]
select split('abcdef','c')[0] 
from tmp.watch_movie;
#output:["ab"]

分号问题
在建表的命令结束后需要有分号结束
在增删查改命令后一定要有一个分号

持续更新

疯狂的小Lan

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive命令使用小抄

hive使用小抄记录一下hive的常用命令及注意点
复制链接

扫一扫

专栏目录

hive命令使用小抄

hive命令使用小抄

前言

一、常用命令

二、常见问题

持续更新

“相关推荐”对你有帮助么？