hivesql的一些知识

最新推荐文章于 2024-02-26 16:51:49 发布

月升11

最新推荐文章于 2024-02-26 16:51:49 发布

阅读量625

点赞数

分类专栏： hivesql 文章标签： hive 大数据数据仓库

本文链接：https://blog.csdn.net/m0_68290271/article/details/127504560

版权

hivesql 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

hive 模糊搜索表 show table like '*name*';
查看表结构：desc table_name;
查看分区：show partitions table_name;

hive DQL查询语法

order by 全局排序只有一个reducer ，导致数据大时计算时间长
sort by 非全局排序，数据在reducer前完成排序，设置了
mapred.reduce.tasks>1，sort by 只保证每个reducer的输出有序，不保证全局有序
distribute by （字段） 根据指定的字段将数据分到不同的reducer ，分发算法是hash散列
cluster by （字段） 除了具有Distribute by 的功能外，还会对这个字段排序，分桶和sort字段是同一个
cluster by = distribute by + sort by

where 语句

select * from score where s_score < 60

null的值会被剔除
group by 分组，后接having筛选数据
where和group by 的区别
having 是在group by 分完组后对数据进行筛选，having后只能是分组字段或者聚合函数
where 是从数据表中字段直接筛选，不能在group by 后，也不能使用聚合函数

join 连接

inner join 内连接：两表都存在的数据保留

select * from teacher  inner join course on teacher .id=course.id

left join 左外连接：左边所以数据会被返回，右边符合条件的返回

select * from teacher t left join course c on t.id=c.id

right join 右外连接：右边的数据全部返回，左边符合的返回

select * from teacher t right join course c on t.id=c.id

full join 满外连接：所有表所有满足条件的保留，如果当中的指定字段无符合的条件会用null值替代

select * from teacher t full join course c on t.id=c.id

注：

hive 2 不支持等值连接，就是join on 后可以用>、<和or
hive用mr执行，一个join一个job，一条sql语句多个join启动多个job

order by 排序

全局排序只有一个reduce 数据量过大则耗费长时间
asc升序 desc降序

sort by 局部排序

每个mr内部进行排序，对全局结果并没有排序

distribute by 分区排序

类似mr中的partition 进行分区结合sort by 使用
进行数据的分区
cluster by =distribute by +sort by

hive函数

求某列的数目：count（）
最大值 max（）
最小值 min（）
求和sum（）
平均值avg（）
count（）包含null值，统计所有行数
count（id）不包含null值
min不包含null 除非所有都是null
avg不包含null*

总体标准偏离函数：stddev_pop(col)
分位数函数：percetile（bigint col，p）
中位数函数：percentile（bigint col，0.5）

4.关系运算
a like b ：like 比较，如果字符串a符合表达式b的正则语法，则为true
a rlike b：java的like 操作，如果字符串a 符合java正则表达式b的正则语法，则为true
a regexp b：功能与rlike相同

5.数值运算
1.取整函数：round（double a）
2.指定精度取整函数：round（double a，int d）
3.向下取整函数：floor（double a）
4.向上取整函数：ceil（double a）
5.取随机数函数：rand（），rand（int seed）
6.自然指数函数：exp（double a）
7.以10为底对数函数：log10（double a）
8.以2为底对数函数：log2（）
9.对数函数：log（）
10.幂运算函数pow（double a，double p）

6.条件函数
if
case when
coalesce （c1，c2，c3）
nvl（c1，c2）

7.日期函数
1.获得当前时区的unix时间戳：unix_timestamp()
2.时间戳转日期函数：form_unixtime()
3.日期转时间戳：unix_timestamp(string date)
4.日期时间转日期函数：to_date(string timestamp)
5.转年 year
6.转月 moth
7.转天：day
8.转小时:hour
9.转分钟：minute
10.转秒：second
11.转周：weekofyear
12.比较：datediff
13.增加：date_add
14.减少：data_sub

字符串函数
1.长度：length（）
2.反转：reverse（）
3.连接：concat（）
4.带分割符字符串连接函数：concat_ws( )
5.截取：substr（）
6.转大写：upper（）
7.转小写：lower（）
8.去空格：trim（）
9.左边去空格：ltrim（）
10.右边去空格：rtrim（）
11.正则替换：regexp_replace( )
12.正则解析：regexp_extract( )
13.URL解析函数：parse_url( )
14.json解析函数：get_json_object( )
15.空格：space（）
16：重复：repeat（）
17：首字符ascii函数：ascii（）
18：左补足：lpad（）
19：右补足：rpad（）
20：分割：split（）
21.集合查找：find_in_set( )

9.窗口函数
1.分组求和：sum( )over()
2.排序：
Row_number（）:1234567
Rank（）:1233567
Dense_rank( ):1233456
3.有序的数据集合平均分配到指定的数量（num）个桶中：ntile（）
4.统计窗口内往上第n行lag（col，n，default）
5.往下lead（col，n，default）
6.分组内排序后，截至到当前行，第一值：flrst_value(col) 最后一个值：last_value( )
7.cume_dist( ) 小于等于当前值的行数/分组总行数：cume_dist( )