hive基础知识

最新推荐文章于 2024-08-05 22:22:22 发布

西岚之妖刀

最新推荐文章于 2024-08-05 22:22:22 发布

阅读量113

点赞数

分类专栏：大数据文章标签： hive

本文链接：https://blog.csdn.net/fshx649426214/article/details/83871240

版权

大数据专栏收录该内容

2 篇文章 0 订阅

订阅专栏

一、架构原理

二、常用命令

建表
内部表

内部表数据存储的位置是hive.metastore.warehouse.dir（默认：/user/hive/warehouse）
删除内部表会直接删除元数据（metadata）及存储数据。
对内部表的修改会将修改直接同步给元数据。

create table if not exists dim_crsv_dty
(
id string COMMENT '维度主键',
crsv_dty_val string COMMENT '维度属性',
comment string COMMENT '注释',
src_sys string COMMENT '来源系统'
)
comment '客服任务维表'
partitioned by (dt string)
row format delimited fields terminated by '\t';

外部表
外部表数据的存储位置由自己制定。
删除外部表仅仅会删除元数据，HDFS上的文件并不会被删除。
而对外部表的表结构和分区进行修改，则需要修复（MSCK REPAIR TABLE table_name）。

create external table t2(
id      int,
name    string,
hobby   array<string>,
add     map<String,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
location '/user/t2';
desc formatted table_name;

分桶
利于reduct的join

CREATE TABLE bucketed_users (id INT, name STRING) 
CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;

对桶中的数据进行采样

hive> SELECT * FROM bucketed_users 
> TABLESAMPLE(BUCKET 1 OUT OF 4 ON id); 
0 Nat 
4 Ann

桶的个数从1开始计数。因此，前面的查询从4个桶的第一个中获取所有的用户。对于一个大规模的、均匀分布的数据集，这会返回表中约四分之一的数据行。我们也可以用其他比例对若干个桶进行取样(因为取样并不是一个精确的操作，因此这个比例不一定要是桶数的整数倍)。
改表
改注释

alter table table_name set  tblproperties ('comment' = "客户管理销售计划表");

加字段

alter table dws_flow_web_page_visit_d add columns (visit_times bigint);

改字段

alter table to8to_rawdata.sem_360_keyword change column old_name new_name int;
alter table dwd.dwd_flow_sdk_event_d change column openid openid string after page_click_value;

导数

load data inpath '/user/richard_chen/dim/dim_ctc.txt' 
into table dim_ctc partition(dt=20180206);

字段截取

regexp_extract('to8to_pc', "(.*?)(_)(.*?)(&|$)", 1)
regexp_extract(regexp_extract(a.url, '(.+?)(?=\\?|\\%|$|#)', 0), '(.+/[^0-9]*)', 1)

基础函数
https://www.cnblogs.com/MOBIN/p/5618747.html
时间函数
当前日期、时间

select current_date;
select current_timestamp;
select unix_timestamp(); --获得当前时区的UNIX时间戳

str转时间戳

select unix_timestamp('2017-09-15 14:23:00'); 
select unix_timestamp('2017-09-15 14:23:00','yyyy-MM-dd HH:mm:ss');

时间戳转str

select from_unixtime(1505456567); 
select from_unixtime(1505456567,'yyyy-MM-dd HH:mm:ss'); 
select from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss'); --获取系统当前时间

str转str
方法1: from_unixtime+ unix_timestamp
–20171205转成2017-12-05

select from_unixtime(unix_timestamp('20171205','yyyymmdd'),'yyyy-mm-dd') from dual;

–2017-12-05转成20171205

select from_unixtime(unix_timestamp('2017-12-05','yyyy-mm-dd'),'yyyymmdd') from dual;

方法2: substr + concat
–20171205转成2017-12-05

select concat(substr('20171205',1,4),'-',substr('20171205',5,2),'-',substr('20171205',7,2)) from dual;

–2017-12-05转成20171205

select concat(substr('2017-12-05',1,4),substr('2017-12-05',6,2),substr('2017-12-05',9,2)) from dual;

str转time

select to_date('2017-09-15 11:12:00');

time转str
方法3: date_format
–current_timestamp转str

select date_format(current_timestamp, 'yyyy-MM-dd HH:mm:ss');

日期差

select datediff('2017-09-15','2017-09-01') from dual;
--14

日期加减

hive> select date_add('2017-09-15',1) from dual;    
2017-09-16
hive> select date_sub('2017-09-15',1) from dual;    
2017-09-14

三、UDF函数

UDF
UDAF
UDTF

四、性能调优

数据倾斜
调参数

SET mapred.map.tasks  =20 ;
SET mapred.reduce.tasks  =10 ;
set mapred.min.split.size = 16252928;
set mapred.max.split.size = 67108864;
set hive.merge.mapredfiles = true;
set hive.merge.smallfiles.avgsize=67108864;

场景验证

累计计算
方法一：

select a.dt, dau,
  sum(dau) over(order by a.dt rows between UNBOUNDED PRECEDING AND CURRENT ROW) as mau1,
  sum(dau) over(partition by substr(dt, 1, 6) order by dt) as mau2
from 
(
select dt, count(distinct dvc_id) as dau
from tbl
group by dt
) a;

方法二：

select a.dt, max(a.dau), sum(case when a.dt >= b.min_dt then 1 else 0 end) as mau1
from 
(
select dt, count(distinct dvc_id) as dau
from tbl
group by dt
) a
left join
(
select dvc_id, min(dt) as min_dt
from tbl
group by dvc_id
) b
group by a.dt;

西岚之妖刀

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive基础知识

一、架构原理二、常用命令建表内部表create table if not exists dim_crsv_dty(id string COMMENT '维度主键',crsv_dty_val string COMMENT '维度属性',comment string COMMENT '注释',src_sys string COMMENT '来源系统')comment '客服任务维...
复制链接

扫一扫

专栏目录