读hive编程指南随笔

最新推荐文章于 2024-09-11 14:10:29 发布

不想起的昵称

最新推荐文章于 2024-09-11 14:10:29 发布

阅读量137

点赞数

分类专栏： hive 文章标签：大数据数据仓库 hadoop hive

本文链接：https://blog.csdn.net/weixin_40267121/article/details/119782314

版权

hive 专栏收录该内容

42 篇文章 9 订阅

订阅专栏

hive cli（1.1.0版本）
（1）hive客户端可以使用hadoop相关命令，去掉hdfs，例如：dfs -ls /;
（2）set hive.cli.print.header=true; 打印头部信息
（3）set hive.cli.print.current.db=true; hive cli 显示当前所在数据库
（4）删除数据库先得删除库下面的表，加cascade关键字可以让hive自行删除库下面的表
例如：drop database if exists test_dp cascade;
（5）hive自带的两个属性，last_modified_by上一次修改用户，last_modified_time上一次修改时间
（6）分区，
hive.exec.dynamic.partitions.pernode ,缺省值是100，指每个mapper或reducer可以创建的最大动态分区个数
hive.exec.max.dynamic.partitions 缺省值是1000，指的是一个动态分区创建语句可以创建的最大分区数目
hive.exec.max.created.files 缺省值是100000，全局可以创建的最大文件个数
（7）floor()向下取整，ceil()向上取整
（8）json_tuple()可以解析多个字段，例如：select a.* from test lateral view json_tuple(‘${hivevar:msg}’,’server’,’host’) a as f1,f2;
get_json_object()只能解析一个
（9）解析url，parse_url_tuple()和parse_url()
（10）job的输入数据大小必须小于参数：
hive.exec.mode.local.auto=true
hive.exec.mode.local.auto.inputbytes.max(默认128MB)
job的map数必须小于参数：hive.exec.mode.local.auto.tasks.max(默认4)
job的reduce数必须为0或者1
（11）map join 大表关联小表，将小表加入内存，
set hive hive.auto.convert.join=true; --开启
set hive.mapjoin.smalltable.filesize=25000000; --设置小表的大小
right join和full join 不支持此操作。
（12）order by和sort by的区别：
order by全局排序，会有一个reduce去处理。sort by局部排序，在每个reduce中排序。如果只有一个reduce的时候是一样的。如果取topN，sort by优于order by
distribute by类似于group by
（13）多个union all，可以insert into 代替。union all会对数据源进行分发拷贝
（14）多个left join关联时，表的数据量应该依次增大。扫描最大的那个表，其他表加入缓存。可以手动加参数，/+streamtable(s)/
（15）当查询同一个张表的两个不用语句时，
如下效率低：
insert overwrite table test1 select * from history where user=‘xiaoming’

insert overwrite table test2 select * from history where user=‘xiaohong’

优化查询：
from history
insert overwrite table test1 select * where user=‘xiaoming’
insert overwrite table test2 select * where user=‘xiaohong’
注：使用多插入模式，不能往同一个表中插入
（16）set hive.mapred.mode=strict;严格模式，会禁止三种查询。
1.分区表不加where条件限制查询，不允许扫描所有分区
2.order by语句查询，必须限制limit语句
3.限制笛卡尔的查询