【大数据开发】Hive——基本语法、常用子句、深入JOIN day44

最新推荐文章于 2022-10-12 10:17:57 发布

这个妹妹我见过

最新推荐文章于 2022-10-12 10:17:57 发布

阅读量344

点赞数

分类专栏： # Hive 文章标签： hive mysql

本文链接：https://blog.csdn.net/weixin_37090394/article/details/108233238

版权

Hive 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

一、Hive基本语法

hive的执行时由;号驱动的，一个;号就是一条语句的结尾

建表语法

create [temporary][external] table [if not exsist] table_name(
[columname columtype [comment '']]
)[comment '']  --字段可以有说明，表也可以有说明
[paritioned by [(columname columtype [comment '']])]  --分区
[clustered by [(columname columtype [comment '']])]  
[[sorted by [(columname columtype [comment '']])]  into num_buckets]  	--排序
[row format row_format]  --行式存储
[stored as file_format]
[location hdfs_path]  --指定hdfs的位置（指定的是一个目录）

内部表和外部表
查看表是内部表还是外部表方式

内部表和外部表的转换
t6是内部表

alter table sz2002.t6 set tblproperties('EXTERNAL'='true');  --这样做虽然编译没错误，但是结果并没有将t6转换成外部表
alter table sz2002.t6 set tblproperties('EXTERNAL'='TRUE');  --注意：TURE应该是大写

alter table sz2002.t6 set tblproperties('EXTERNAL'='false'); --false不区分大小写

总结：
内部表转换成外部表，TRUE一定要大写
外部表转换成内部表，false不区分大小写

查看hive的参数
注意：只能在Linux的hive里使用，在客户端工具中不能使用

set;

查看某个参数的当前值

set hive.cli.print.current.db;
hive.cli.print.current.db=false

设置参数的值只对当前session有效
设置了会显示当前操作的数据库是哪个
在这里插入图片描述

也可以在配置文件里修改

<property> 
	<name>hive.cli.print.current.db</name> 
	<value>false</value> 
	<description>Whether to include the current database in the Hive prompt. </description> 
</property>

开启本地模式
hive (mydb1)> set hive.exec.mode.local.auto=true;
hive的执行模式
在客户端执行hive语句
具体的用法

usage: hive
 -d,--define <key=value>          Variable subsitution to apply to hive
                                  commands. e.g. -d A=B or --define A=B
    --database <databasename>     Specify the database to use
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -H,--help                        Print help information
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable subsitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)

hive -e set
在这里插入图片描述
执行sql.hql脚本
hive -f sql.hql
会有很多提示执行成功信息
-S：静默模式
使用hive -S -f sql.hql去除成功信息

查询语法

select		--输出字段
from		--输入目录
join		--输入目录
on			--连接条件
where		--过滤条件
group by	--分组条件
having		--过滤条件，对分组之后的统计结果进行过滤
[cluster by/partition by/distribute by]		--只有分桶的时候才用的到
order by/sort by		--排序条件
limit		--限制输出的结果数
union/union all		--union会去重排序，union all 不去重不排序

语句执行顺序

第一步: FROM <left_table> 
第二步: ON <join_condition> 
第三步: <join_type> JOIN <right_table> 
第四步: WHERE <where_condition> 
第五步: GROUP BY <group_by_list> 
第六步: HAVING <having_condition> 
第七步: SELECT 
第八步: DISTINCT <select_list> 
第九步: ORDER BY <order_by_condition> 
第十步: LIMIT <limit_number> 
标准sql语句的一些规则： 
-1. 列别名的使用，必须完全符合执行顺序，不能提前使用。（mysql除外） 
-2. 在分组查询时，select子句中只能含有分组字段和聚合函数，不能有其他普通字段。(mysql除外)

注意：select虽然在having后面，但是在having后面可以使用select取的别名
但是，如果select里面有一层计算，则不能在having中使用别名，如下图所示

在这里插入图片描述

二、常用子句回顾

在这里插入图片描述

子查询：hive对子查询支持不是很友好，问题有很多
where
后面通常是表达式，还可以是非聚合韩式表达式，不能跟聚合函数
group by
通常与聚合函数一起搭配使用
查询的字段要么出现在group by后面，要么出现在聚合函数里面
having
having对分组之后的聚合函数进行过滤，=
可以使用别名，是别名所代表的计算只能是一层的计算
之后的过滤字段必须出现在select列表中
order by
全局排序
保证所有的reduce中的数据有序
如果reduce的个数只有1个，和sort by没有差别
order by和sort by通常和asc或desc搭配，默认使用asc
只要使用order by，reduce的数量永远只有一个
sort by
局部排序
只保证单个reduce中的数据有序
设置reduce的数量

set mapreduce.job.reduces=1;

默认是根据最终的数据量来确定reduce的数量
256M

limit
从结果集中取出数据的条数
union和union all
union：将多个结果集合并，去重并进行排序
union all：将多个结果集合并，不进行去重和排序
如果数据集中能确定没有重复的数据，尽量使用union all

三、JOIN

所有的join连接，都只支持等值连接（= 和 and），不支持 != > < or…

连接

left join：数据以左表为基准，左表中存在的数据都能查询出，右表的数据如果能连接上就能查询出，如果连接不上则以null来表示

right join：数据以右表为基准，右表中存在的数据都能查询出，左表的数据如果能连接上就能查询出，如果连接不上则以null来表示

full join：

4.笛卡尔积
join不添加任何ON条件或者where条件时，称之为笛卡尔积
在这里插入图片描述
5.Hive里还有一种特殊的类型
left semi join：半开连接（或者叫左开连接），通常是left join的一种优化，只能查询出左表的信息，主要用来判断左表是否存在

select *
from sz2002.u1
left semi join sz2002.u2
on ui.id=u2.id

select *
from sz2002.u1
inner join sz2002.u2
on ui.id=u2.id

select  *
from sz2002.u1
where exists (select 1 from sz2002.u2 where ui.id=u2.id)

三种的查询结果是一样的

inner join和outer join 的区别

 分区字段对outer join的on条件无效，对inner join中的on条件有效
 有 inner join 但是没有 full inner join
 有 full outer join 但是没有 outer join
 所有的join连接都只支持等值连接

map-side join
会将小表缓存到内存中，然后在map端就进行关联查找，hiv在map查找时会减少查询量，减少大量的shuffle过程和数据传输的时间

set hive.auto.convert.join=true;

以前的版本，可以添加/*+mapjoin(小表的别名)*/
到底多大的表才能称为小表?
set hive.mapjoin.smalltable.filesize=25000000;
约等于23.8M

hive提供小表标识，使用streamtable（小表的别名）

select
/*+streamtable*/
*
from sz2002.u1
inner join sz2002.u2
on ui.id=u2.id

这个妹妹我见过

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【大数据开发】Hive——基本语法、常用子句、深入JOIN day44

一、Hive基本语法hive的执行时由;号驱动的，一个;号就是一条语句的结尾建表语法create [temporary][external] table [if not exsist] table_name([columname columtype [comment '']])[comment ''] --字段可以有说明，表也可以有说明[paritioned by [(columname columtype [comment '']])] --分区[clustered by [(col
复制链接

扫一扫