hive学习笔记续2

最新推荐文章于 2022-11-25 08:52:35 发布
KYkankankan
最新推荐文章于 2022-11-25 08:52:35 发布
阅读量584
点赞数
分类专栏： hive 文章标签： hiveql hive 数据库
本文链接：https://blog.csdn.net/KYkankankan/article/details/81297190
版权
hive 专栏收录该内容
3 篇文章 0 订阅
订阅专栏
--group by 语句,常和聚合函数一起使用，sum,count,avg...
hive> select year(ymd),avg(price_close) from stocks
		where exchange='nasdaq' and symbol='aapl'
		group by year(ymd);
---having语句
hive> select year(ymd),avg(price_close) from stocks 
		where exchange='nasdaq' and symbol='aapl'
		group by year(ymd) having avg(price_close)>50.0;--如没有使用having语句，则需要改用下列嵌套查询
hive> select se.year,se.avg from
		(select year(ymd) as year,avg(price_close) as avg from stocks
		where exchange='nasdaq' and symbol='aapl'
		group by year(ymd)) as se
		where se.avg>50.0;
--join语句,hive不支持非等值查询，不支持在on子句中的谓词间使用on,使用表别名
---inner join 内连接,只有进行连接的两个表中都存在与连接标准相匹配的数据才会被保留下来
hive> select a.ymd,a.price_close,b.price_close 
		from stocks a join stocks b on a.ymd=b.ymd
		where a.symbol='aapl' and b.symbol='ibm';
---建表
create table if not exists dividends(
	ymd string,
	dividend float,
	)partitioned by (exchange string, symbol string)
	row format delimited fields terminated by ',';
---按照字段ymd和字段symbol作为等值连接
hive> select s.ymd,s.symbol,s.price_close,d.dividend
		from stocks  s join dividends d on s.ymd=d.ymd and s.symbol=d.symbol
		where s.symbol='aapl';
---对两张以上的表进行连接操作
hive> select a.ymd,a.price_close,b.price_close,c.price_close 
		from stocks a join stocks b on a.ymd=b.ymd	
					join stocks c on a.ymd=c.ymd
		where a.symbol='aapl' and b.symbol='ibm' and c.symbol='ge';
---json优化
-----连续查询中的表的大小从左到右是一次递增的，大表放后面
hive> select s.ymd,s.symbol,s.price_close,d.dividend
		from dividends d join stocks s on s.ymd=d.ymd and s.symbol=d.symbol
		where s.symbol='aapl';--stocks表大于dividends表
-----或者使用标记机制，无需物理挪动表的位置
select /*+streamtable (s)*/s.ymd,s.symbol,s.price_close,d.dividend 
from stocks s join dividend d on s.ymd=d.ymd and s.symbol=d.symbol 
where s.symbol='aapl';--此时hive将stocks表作为驱动表，即使其在查询中不是位于最后面的
---left outer join 左外连接，返回左表中所有符合where的记录，右表中匹配不上的字段值用null代替
hive> select s.ymd,s.symbol,s.price_close,d.dividend
	from stocks s left outer join dividends d 
	on s.ymd=d.ymd and s.symbol=d.symbol
	where s.symbol='aapl';
---right outer join 右外连接，返回右表中所有符合where的记录，左表中匹配不上的字段值用null代替
hive> select s.ymd,s.symbol,s.price_close,d.dividend
		from dividends d right outer join stocks s 
		on s.ymd=d.ymd and s.symbol=d.symbol
		where s.symbol='aapl';
---full outer join 完全外连接，返回所有表中符合where语句条件的所有记录。如果任一表的指定字段没有符合条件的值的话，那么就使用null值代替
hive> select s.ymd,s.symbol,s.price_close,d.dividend
		from dividends d full outer join stocks s 
		on d.ymd=s.ymd and d.symbol=s.symbol
		where s.symbol='aapl';
---left semi join左半开连接，返回左边表的记录，前提是记录对于右边表满足on语句的判定条件
hive> select s.ymd,s.symbol,s.price_close 
	from stocks s left semi join dividends d
	on s.ymd=d.ymd and s.symbol=d.symbol;--select和where语句中都不能引用到右边表的记录
----注:hive不支持右半开连接，左半开连接比内连接更高效
---笛卡尔积join,表示左边表的行数乘以右边表的行数等于笛卡尔结果集的大小
hive> select * from stocks join dividends;
----注：和其他连接类型不同，笛卡尔积不是并行执行的，而且使用MapReduce计算架构的话，任何方式无法优化
---map-side join,如果所有表中只有一张表是小表，那么可以在最大的表通过mapper的时候将小表完全放到内存中。省略掉常规连接操作所需要的reduce过程，有时还可以同时减少map过程的执行步骤
select /*+ mapjoin (d)*/s.ymd,s.symbol,s.price_close,d.dividend 
from stocks s join dividends d 
on s.ymd=d.ymd and s.symbol=d.symbol
where s.symbol='aapl';--hive对于右外连接和全外连接不支持这种优化
---order by 对查询结果进行一个全局排序，即所有数据通过一个reducer进行处理的过程。对于大数据集，这个过程可能消耗太过漫长的时间,asc升序默认，desc降序
select s.ymd,s.symbol,s.price_close
from stocks s 
order by s.ymd asc,s.symbol desc;
---sort by 只会在每个reducer中对数据进行排序，也就是执行一个局部排序过程。可以保证每个reducer的输出数据都是有序的，但并非全局有序。这样可以提高后面的全局排序的效率
select s.ymd,s.symbol,s.price_close
from stocks s 
sort by s.ymd asc ,s.symbol desc;
----含有sort by 的dstribute by,控制map的输出在reducer中是如何划分的，distribute by 写在sort by 之前
hive> select s.ymd,s.symbol,s.price_close
	from stocks s 
	distribute by s.symbol
	sort by s.symbol asc, s.ymd asc;--我们希望具有相同股票交易码的数据在一起处理，那么我们可以使用distribute by来保证具有相同股票交易码的记录会分发到同一个reducer中进行处理，然后使用sort by来按照我们的期望对数据进行排序
---cluster by ,如果distribute by和sort by两个语句中涉及到的列完全相同，而且采用的是升序排序方式，那么在这种情况下，cluster by 就等价于前面的2个语句
hive> select s.ymd,s.symbol,s.price_close
	from stocks s
	cluster by s.symbol;
---类型转换cast(value as type),用户可以使用这个函数对指定的值进行显示的类型转换
select name, salary from employees
where cast(salary as float)<10000.0;
----注:将浮点数转换成整数的推荐方式是使用round()或者floor()函数，而不是使用类型转换操作符cast
---嵌套抓换，将binary装换为double
select (2.0*cast(cast(b as string) as double)) from src;
---抽样查询--假设numbers表中只有number字段，其值是1-10
----分桶抽样
hive> select * from numbers tablesample(bucket 3 out of 10 on rand()) s;--返回一个随机值，并且不相等
----按照指定的列而非rand()函数进行分桶的话，同一语句多次执行的返回值是相同的
hive> select * from numbers tablesample(bucket 3 out of 10 on number) s;--执行该语句返回值相同
hive> select * from numbers tablesample(bucket 5 out of 10 on number) s;--执行结果与上句不同，但每次执行该语句返回值相同
--注：分桶语句中的分母表示的是数据将会被散列的桶的个数，而分子表示将会选择的桶的个数
hive> select * from numbers tablesample(bucket 1 out of 2 on number) s;
2
4
6
8
10
hive> select * from numbers tablesample(bucket 2 out of 2 on number) s;
1
3
5
7
9
---数据块抽样 按照抽样百分比进行抽样的方式，这种是基于行数的，按照输入路径下的数据块百分快进行的抽样
hive> select * from numbers flat tablesample(0.1 percent) s;
---union all 将2个或多个表进行合并。每一个union子查询都必须具有相同的列，并且对应的每个字段类型必须是一致的
select log.ymd,log.level,log.message
from(
select l1.ymd,l1.level,l1.message,'log1' as source from log1 l1
union all
select l2.ymd,l2.level,l2.message,'log2' as source from log2 l2)log
sort by log.ymd asc;
---使用视图来降低查询复杂度
from(
	select * from people join cart
	on (crat.prople_id=people.id) where firstname='john') a 
	select a.lastname where a.id=3;
---使用视图
create view shorter_join as select * from people join cart on (cart.people_id=people.id) where firstname='jonh';
select lastname from shorter_join where id=3;
---通过创建视图来限制数据访问可以用来保护信息不被随意查询
hive> create table userinfo (
	firstname string,lastname string,ssn string,password string);
hive> create view safer_user_info as 
	select firstname,lastname from userinfo;
----通过where子句限制数据访问,只希望其查询特定部门的员工信息
hive> create table employee (firstname string,lastname string,ssn string,password string,department string);
hive> create view techops_employee as 
	select fiestname,lastname,ssn from employee where department='techops';
----动态分区中的视图和map类型
create external table dynamictable(cols map<string,string>) 
row format delimited fields terminated by '\004'
collection items terminated	by '\001'
map keys terminated by '\002'
stored as textfile;--字典元素分隔符为^A，键值分隔符为^B
create view shipments(time,part) as 
select cols['times'],cols['parts'] from dynamictable where cols['type']='response';
---删除视图
drop view if exists shipments;
---复制视图
create table shipments2 like shipments;
--注：视图不能作为insert或load的目标表,视图是只读的
未完待续。。。
KYkankankan
关注
0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive学习笔记续2

--group by 语句,常和聚合函数一起使用，sum,count,avg...hive&gt; select year(ymd),avg(price_close) from stocks where exchange='nasdaq' and symbol='aapl' group by year(ymd);---having语句hive&gt; select year(ymd...
复制链接

扫一扫
专栏目录