HiveQL整理总结

最新推荐文章于 2022-10-27 12:55:55 发布

hyperminer

最新推荐文章于 2022-10-27 12:55:55 发布

阅读量2.2k

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/zhangweijiqn/article/details/53200473

版权

大数据专栏收录该内容

13 篇文章 0 订阅

订阅专栏

下面是一些常用操作的语句，其他的可以去hive官网查看。

grant create to user aaa;//create不需要on database 或者table

grant drop on database test to user aaa;//drop需要指定database或者table

show grant user datajingdo_m on database dbtest_001;

show grant role ;

show grant role on database test;

show grant role on table tab;

revoke all on database dbtest_001 from user datajingdo_m;

//回收all的权限，前提是grant的是all的权限，all权限包括增删改查，不能使用all来回收create、alter、drop等DDL权限。

drop database db_test; //该语句只能用于空数据库（无表）

hive> DROP DATABASE IF EXISTS financials CASCADE; //有表时会递归删除表，再删除数据库

show current roles;查看当前角色

create table test4 like test3; – 除生命周期属性外，test3的其他属性（字段类型，分区类型等）均与test4完全一致

create table test5 as select * from test2;

– 这个操作会创建test5，但分区，生命周期信息不会被拷贝到目标表中。

Hive只支持批量删除和插入

insert into table aaa111 select count(*) from aaa111;

insert overwrite

insert partitions

truncate table tab_name;

show partitions tab;

bucket

select * from table where id is null :判断空使用is null

explain QUERY 显示查询的执行计划

select distinct name from tableAAA　优化：　select name from tableAAA group by name

已经每个month,每个sku的销售数量，求每个month最畅销的sku:

select skuId,month from sale_detail a left join (select max (sale_qtty) as sale_qtty_max,month from sale_detail group by month) b on b.sale_qtty_max=a. sale_qtty

分位数/中位数函数： percentile， percentile_approx

DELETE FROM tablename [WHERE expression]:

Deletes can only be performed on tables that support ACID. See Hive Transactions for details.(detelte只能执行在支持ACID的hive中)，

配置查看文档：

https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

hive 简单查询不启用MapReduce Job 而启用Fetch task，在yarn applications看不到 applicationID：

如 select * from table limit 5; //这种查询不会启动MR任务，在yarn applications看不到applicationID

原因在于hive.fetch.task.conversion配置：

hive -e "set" | grep conversion

hive.fetch.task.conversion=more //默认配置为more，当select数据量小于threshold时，不进行MR，直接fetch

hive.fetch.task.conversion.threshold=1073741824 //1g大小，只有数据量超过1g，才启动 MR job

启用MapReduce Job是会消耗系统开销的。对于这个问题，从Hive0.10.0版本开始，对于简单的不需要聚合的类似SELECT <col> from <table> LIMIT n语句，不需要起MapReduce job，直接通过Fetch task获取数据，可以通过下面几种方法实现：

hive> set hive.fetch.task.conversion=more;

bin/hive --hiveconf hive.fetch.task.conversion=more

hive-site.xml里面加入以下配置：

        <property>
          <name>hive.fetch.task.conversion</name>
          <value>more</value>
          <description>
            Some select queries can be converted to single FETCH task 
            minimizing latency.Currently the query should be single 
            sourced not having any subquery and should not have
            any aggregations or distincts (which incurrs RS), 
            lateral views and joins.
            1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
            2. more    : SELECT, FILTER, LIMIT only (+TABLESAMPLE, virtual columns)
          </description>
        </property>

applicationID

每个hive MR job提交后在yarn上会生成一个applicationID，以application_前缀开始，同时也会生成一个jobID，已job_开始。

hive的metastore和hiveserver2 由2台增加到4台

连接metastore的配置

thrift://a06-r12-i32-122.jd.local:9083,thrift://a06-r12-i32-119.jd.local:9083,thrift://a06-r12-i11-24.jd.local:9083,thrift://a06-r12-i11-25.jd.local:9083

直接连接hiveserver2 服务, 使用zk链接进行负载

<property>
<name>hive.metastore.uris</name>
<value>thrift://a06-r12-i32-122.jd.local:9083,thrift://a06-r12-i32-119.jd.local:9083,thrift://a06-r12-i11-24.jd.local:9083,thrift://a06-r12-i11-25.jd.local:9083</value>
</property>

regexp_replace使用

select user_log_acct, regexp_replace(male,'NaN',1-female),female from tmp.sim_user_gender2 where male='NaN'; //字段，值，替换的值

select regexp_replace(name,'智慧','zhihui') from tmp.zhuxian_rate_name_distinct limit 5;

group by使用

select user_log_acct,male,female,max(dt) as maxDT from sim_user_gender group by user_log_acct limit 3;

FAILED: SemanticException [Error 10025]: Line 1:21 Expression not in GROUP BY key 'male'

select的字段必须在group by里或者为字段的聚合

left join使用

create table tmp.sim_user_gender3 as

select user_log_acct,male,female from sim_user_gender a

left join

(select user_log_acct,max(dt) as maxDT from sim_user_gender group by user_log_acct) b

on (a.user_log_acct=b.user_log_acct and a.dt=b.maxDT);

报错：FAILED: SemanticException Column user_log_acct Found in more than One Tables/Subqueries 同一个字段不能出现在

ROW_NUMBER() OVER函数的基本用法(hiveQL语法)

Select *,row_number() OVER (PARTITION BY COL1 ORDER BY COL2 DESC)

hive> select user_log_acct,male,female,dt,row_number()over(partition by user_log_acct order by dt desc) from tmp.sim_user_gender limit 10;

*yx_100019279 0.5 0.5 2015-08-16 1

*yx_100019279 0.5 0.5 2015-08-15 2

*yx_100019279 0.5 0.5 2015-08-14 3

*yx_100019279 0.5 0.5 2015-08-13 4

*yx_100019279 0.5 0.5 2015-08-11 5

*yx_100019279 0.5 0.5 2015-08-10 6

*yx_100019279 0.5 0.5 2015-08-09 7

*yx_100019279 0.5 0.5 2015-08-08 8

*yx_100019279 0.5 0.5 2015-08-07 9

*yx_100019279 0.5 0.5 2015-08-06 10

表示根据user_log_acct分组，在分组内部根据dt降序排序，而此函数计算的值就表示每组内部排序后的顺序编号（组内连续的唯一的)

筛选出表temp_user_simularity中两个id对应的name

源表temp_user_simularity结构： (user1,user2,ratings)

源表dim_user2结构：(usrid,name)

目标表temp_sim_user_map结构：(user1Name,user2Name,user1ID,user2ID)

思路：user1Name字段要join一次，user2Name字段要join一次。

create table tmp.temp_sim_user_map as

select usr1.name as user1,usr2.name as user2,usr1.usrid as user1ID,usr2.usrid as user2ID

from tmp.temp_user_simularity a

left join tmp.dim_user2 usr1

on usr1.usrid=a.user1

left join tmp.dim_user2 usr2

on usr2.usrid=a.user2

hive 导出分区数据到txt中，再添加到其他hive集群：

(1) 导出：

./bin/hive -e "select * from testdb_001.employees" > data/ImportEmployees.txt

(2)创建表：

CREATE EXTERNAL TABLE `app.app_user_profile_1_da`(

`user_log_acct` string COMMENT 'user id',

`cate` string COMMENT 'cid3 and cid3 name',

`features` array<string> COMMENT 'feature tags')

PARTITIONED BY (

`dt` string,

`type` string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

STORED AS INPUTFORMAT

'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

TBLPROPERTIES (

'transient_lastDdlTime'='1453368848')

(3)添加分区：

alter table app_user_profile_1_da add partition (dt='2016-05-21',type='view-shop')

(4) load数据：

load data local inpath '/root/test/app_user_profile_1_da.txt' into table app_user_profile_1_da partition(dt='2016-05-24',type='view-shop');

不要忘记 distinct

select spu_name,count( distinct user_log_acct) from tmp.nonzhuxian_items group by spu_name

导出hive表到本地csv中（导出数据默认的分隔符是\t，如果string类型字段中包含\t或者逗号，要进行替换）：

hive -e "select regexp_replace(a,' \t+|,+','-'),b,countab,counta,confidenceab,supporta,lift from tmp.zhuxian_prefixt_ab" | awk -F '\t' '{print $1","$2","$3","$4","$5","$6","$7}'>>zhuxian_prefix_ab.csv

hive mapper数量:

主要的决定因素有： input的文件总个数，input的文件大小，集群设置的文件块大小(目前为128M, 可在hive中通过set dfs.block.size;命令查看到，该参数不能自定义修改)；

举例：

a) 假设input目录下有1个文件a,大小为780M,那么hadoop会将该文件a分隔成7个块（6个128m的块和1个12m的块），从而产生7个map数
b) 假设input目录下有3个文件a,b,c,大小分别为10m，20m，130m，那么hadoop会分隔成4个块（10m,20m,128m,2m）,从而产生4个map数
即，如果文件大于块大小(128m),那么会拆分，如果小于块大小，则把该文件当成一个块。

是不是保证每个map处理接近128m的文件块，就高枕无忧了？

答案也是不一定。比如有一个127m的文件，正常会用一个map去完成，但这个文件只有一个或者两个小字段，却有几千万的记录，

如果map处理的逻辑比较复杂，用一个map任务去做，肯定也比较耗时。

针对上面的问题3和4，我们需要采取两种方式来解决：即减少map数和增加map数；

Hive自己如何确定reduce数：

reduce个数的设定极大影响任务执行效率，不指定reduce个数的情况下，Hive会猜测确定一个reduce个数，基于以下两个设定：

hive.exec.reducers.bytes.per.reducer（每个reduce任务处理的数据量，默认为1000^3=1G）

hive.exec.reducers.max（每个任务最大的reduce数，默认为999）

计算reducer数的公式很简单N=min(参数2，总输入数据量/参数1)

即，如果reduce的输入（map的输出）总大小不超过1G,那么只会有一个reduce任务；

hive日期函数, year, month, day,week:

select distinct(dt) from gdm_m04_ord_det_sum where dt>'2016-01-01' and month(dt) in(3,5,7,9)

表app.userprofile是一个包含561168636条数据的大表，gdm_m04_ord_det_sum是包含1344346条数据的表，要将userprofile中包含后者中的id的数据取出来：

select a.* from app.userprofile a

join (select user_log_acct from gdm.gdm_m04_ord_det_sum where month(dt)=7 group by user_log_acct) b

on a.user_log_acct=b.user_log_acct and a.dt='2016-10-12'

实际上，下面要比上面快很多（先将子查询的数据写到表里，然后再join）：

val glasses_users = hiveContext.sql("select user_log_acct from gdm.gdm_m04_ord_det_sum where item_third_cate_name like '%太阳镜%' and month(dt)=7 group by user_log_acct")

glasses_users.write.saveAsTable("tmp2.glasses_users")

val glasses_users_portrait = hiveContext.sql("select a.* from app.app_ba_userprofile_prop_api_mkt_sub a join tmp2.glasses_users b on a.user_log_acct=b.user_log_acct and a.dt='2016-10-12'")

glasses_users_portrait.write.saveAsTable("tmp2.glasses_user_portrait")

    要用小表去join大表：

    select b.* from tmp2.glasses_users a join app.app_ba_userprofile_prop_api_mkt_sub b on a.user_log_acct=b.user_log_acct and b.dt='2016-10-12'

分段查询:

select sum(CASE when cvl_glob_fmargval<60 then 1 else 0 end),sum(CASE when cvl_glob_fmargval>=70 and cvl_glob_fmargval<80 then 1 else 0 end) from userProfiles

select count(if(cvl_glob_fmargval<60,1,null)),count(if(cvl_glob_fmargval>=70 and cvl_glob_fmargval<80,1,null)) from userProfiles

hyperminer

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
HiveQL整理总结

下面是一些常用操作的语句，其他的可以去hive官网查看。grant create to user aaa;//create不需要on database 或者tablegrant drop on database test to user aaa;//drop需要指定database或者tableshow grant user datajingdo_m
复制链接

扫一扫

专栏目录