Hive之HQL数据查询

最新推荐文章于 2024-04-14 21:41:05 发布

碣石观海

最新推荐文章于 2024-04-14 21:41:05 发布

阅读量417

点赞数

分类专栏： Hive

本文链接：https://blog.csdn.net/weixin_39469127/article/details/89504300

版权

Hive 专栏收录该内容

10 篇文章 2 订阅

订阅专栏

------------本文笔记整理自《Hadoop海量数据处理：技术详解与项目实战》范东来

一、select...from语句

--支持列和表的别名，支持嵌套，限行
> select l.name ln, r.course rc
> from (select id, name from left) l
> join (select id, course from right) r on l.id = r.id
> limit 100;

--支持case when
> select id, name, sex,
> case
> when sex = 'M' then '男'
> when sex = 'F' then '女'
> else '无数据'
> end
> from student;

二、where语句

--谓词表达式
A=B,A<B...
A<>B,A!=B
A is [not] null
**：A [not] like B ：B是SQL正则
**：A rlike B ：B是正则表达式
**：A regexp B ：B是正则表达式

三、group by 和 having 语句

> select classid, avg(age) from student
> group by classid
> having avg(age) > 18;

四、join语句

--inner join （直接使用join）
--left outer join (直接用left join)
--right outer join (直接用right join)
--full outer join

*--left semi join（左半连接）：满足on条件的前提下，返回左表记录，与SQL中IN（Hive不支持）相同。 

*--map side join（map端连接）：map端连接是相比Reduce端连接而言的。
    --连接时,若有一张是小表，在map阶段读入内存，可以直接在map阶段进行join，提升性能。
> select /*+ mapjoin(t1) */ t1.id, t2.id from table1 t1 join table2 t2 on t1.id = t2.id;
--开启hive自动优化（必要时自动执行map端join）：
> set hive.auto.convert.join = true; （此时只有map任务）
> set hive.auto.convert.join = false; （此时还有reduce任务）
--定义表大小：
> set hive.mapjoin.smalltable.filesize = 25,000,000（默认字节，25MB）

--多表join：Hive会为每一个join操作开启一个作业。
    --t1与t2连接后作业输出结果，与t3连接...（依次进行）
> select * 
> from table1 t1
> join table2 t2 on t2.id = t1.id
> join table3 t3 on t3.id = t1.id;

五、order by 和 sort by 语句

--order by ：一个Reducer实现全局排序
> select * form student order by id desc, age asc;

--soort by ：多个Reducer实现分区的局部排序
> select * form student sort by id desc, age asc;

<注：当两者的Reducer任务数都为1时，结果一样>

六、distribute by 和 sort by

--distribute by 对于 Hive，等同于 partitioner 对于 MapReduce
--distribute by指定分区字段，将按键分区排序，每个键分区对应一个Reducer任务，速度更快。
--<相当于二次排序>
> select col1, col2 from test distribute by col1 sort by col1, col2;

七、cluster by 语句

--cluster by c1 语句 就是 distribute by c1 sort by c1 的代替
--<注1：cluster by 时，分区列与排序列相同（个数也相同）>
--<注2：只能升序排序，不能指定降序>
> select col1, col2 from test cluster by col1;

八、分桶与抽样

--源数据data：
id
0
1
2
3
4
6
7
10
11
55
--分桶抽样
--分桶抽样查看：
> select * from data tablesample(bucket 3 out of 4 on id);
2
6
10
<即：第三个桶的数据>
<注1：tablesample(bucket x out of y on z) 中的三个参数分别表示：>
<    y：分桶的个数；x：分桶后的第x个桶（从1开始到y个）；z：分桶的依据列>
<    分桶规则：取z列值的hash值对y取余，取余结果即为分桶编号>
<    例如：id的某一值4，对4取余得0，即分到桶1>
<    例如：id的某一值6，对4取余得2，即分到桶3>
<注2：以上查询语句意为，对data表数据按照id列分4桶，取第三桶结果显示>
--也可以采用随机列抽样，不指定z
> select * from data tablesample(bucket 3 out of 4 on rand());


--以上的data表不是分桶表，还可以建表时设定成分桶表，使抽样更高效，如下：
--先分桶配置：
> set hive.enforce.bucketing=true;
--创建分桶表 buckettable：按列id分桶，分成4桶
> create table buckettable (id int) clustered by (id) into 4 buckets;
--导入数据：
> insert overwrite table buckettable select id from data;
<日志：Mapper:1, Reducer:4>
<    即：等同于分区为4，对应4个Reducer任务及输出文件>
--查询表数据：
> select * from buckettable;
0
4
1
2
6
10
3
7
11
55
--查看HDFS中的表存储路径：
> dfs -ls /user/hive/warehouse/test.db/buckettable;
    /user/hive/warehouse/test.db/buckettable/000000_0
    /user/hive/warehouse/test.db/buckettable/000001_0
    /user/hive/warehouse/test.db/buckettable/000002_0
    /user/hive/warehouse/test.db/buckettable/000003_0
<可见：分桶表生成了4个分区文件，每个文件代表一个桶>
--查看每个桶的数据：
> dfs -cat /user/hive/warehouse/test.db/buckettable/000000_0;
0
4
> dfs -cat /user/hive/warehouse/test.db/buckettable/000001_0;
1
> dfs -cat /user/hive/warehouse/test.db/buckettable/000002_0;
2
6
10
> dfs -cat /user/hive/warehouse/test.db/buckettable/000003_0;
3
7
11
55
--分桶抽样查看：
> select * from buckettable tablesample(bucket 3 out of 4 on id);
2
6
10
<即：第三个桶的数据>

九、union all语句

--多表合并，查询字段必须匹配
> select id, name from student where classid = 1
> union all
> select id, name from student where classid = 2

碣石观海

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hive之HQL数据查询

------------本文笔记整理自《Hadoop海量数据处理：技术详解与项目实战》范东来一、select...from语句--支持列和表的别名，支持嵌套，限行> select l.name ln, r.course rc> from (select id, name from left) l> join (select id, course from righ...
复制链接

扫一扫