hive进阶四

最新推荐文章于 2024-05-24 17:26:47 发布

jhchengxuyuan

最新推荐文章于 2024-05-24 17:26:47 发布

阅读量720

点赞数

分类专栏：大数据 hive 文章标签： hive

本文链接：https://blog.csdn.net/jhchengxuyuan/article/details/101116439

版权

大数据同时被 2 个专栏收录

10 篇文章 2 订阅

订阅专栏

hive

7 篇文章 1 订阅

订阅专栏

hive进阶四

hive的字段分隔符：

hive默认的列与列之间的分隔符是：\001 、ctrl+V ctrl+A(^A) 、SOH 、 \u0001(多用于java输出),注意不是tab
通常分隔符：
tab
,
" "
|
\n
\001	^A (\u0001,注意不是\0001也不是\01)
\002	^B
\003	^C

hive的文件存储格式：

File Formats and Compression](https://cwiki.apache.org/confluence/display/Hive/FileFormats): RCFile, Avro, ORC, Parquet; Compression, LZO
注意以上所有的文件格式不能用load方式加载。

各个存储格式的属性：

hive默认的数据文件存储格式为：textfile

textfile：普通的文本文件存储，不压缩。占用空间，查询效率低下。(小量数据可以使用)

sequencefile:
hive为用户提供的二进制存储，本身就压缩。不能用load方式加载数据

rcfile:
hive提供行列混合存储，hive在该格式下，将会尽量把附近的行和列的块尽量存储到一起。仍然压缩，查询效率较高。

orc ：
优化后的rcfile。

parquet ：
典型列式存储。自带压缩，查询较快(按列查询)

<name>hive.default.fileformat</name>
    <value>TextFile</value>
    <description>
      Expects one of [textfile, sequencefile, rcfile, orc].
      Default file format for CREATE TABLE statement. Users can explicitly override it by CREATE TABLE ... STORED AS [FORMAT]
      
      
textfile:可以配合压缩配置属性进行压缩。
map端输出压缩：
mapreduce.map.output.compress=false
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec

reduce输出压缩(reduce压缩)：
snappy、bzip2、gzip、DefaultCompress
mapreduce.output.fileoutputformat.compress=false
mapreduce.output.fileoutputformat.compress.type=NONE/RECORD/BLOCK(默认RECORD)
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec

hive压缩配置：
set hive.exec.compress.output=false;
set hive.exec.compress.intermediate=false;
set hive.intermediate.compression.codec=
set hive.intermediate.compression.type=


CREATE TABLE `u4`(
  `id` int,
  `name` string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as textfile;

set mapreduce.output.fileoutputformat.compress=true;
set hive.exec.compress.output=true;
insert into table u4
select * from u2;

2：
sequence ：
CREATE TABLE `u4`(
  `id` int,
  `name` string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as sequencefile;

3：
rcfile ： 
CREATE TABLE `u5`(
  `id` int,
  `name` string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as rcfile;


4：
orc ： 
CREATE TABLE `u6`(
  `id` int,
  `name` string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as orc;

5：
parquet：
CREATE TABLE `u7`(
  `id` int,
  `name` string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as PARQUET;
insert into table u7
select * from u2;

自定义：
数据：
seq_yd元数据文件：
aGVsbG8gemhhbmdoYW8=
aGVsbG8gZmVpZmVpLGdvb2QgZ29vZCBzdHVkeSxkYXkgZGF5IHVw
seq_yd文件为base64编码后的内容，decode后数据为：

create table cus(str STRING)  
stored as  
inputformat 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'  
outputformat 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextOutputFormat'; 

LOAD DATA LOCAL INPATH '/home/hivedata/cus' INTO TABLE cus;

Hive视图

总结：视图不是表视图是虚表视图依赖于表

hive的视图简单理解为逻辑上的表
hive现目前只支持逻辑视图，不支持物化视图。

hive的视图意义：
1、对数据进行局部暴露(涉及隐私数据不暴露)。
2、简化复杂查询。

创建视图(cvas)：

create view if not exists tab_v1 
as 
select id from u2;

查看视图：

show tables;

show create table tab_v1;

-----------------------------------------------------------------------------
show create table tab2;
OK
CREATE VIEW `tab2` AS select `lg`.`user_id` from `ali`.`lg`
Time taken: 0.14 seconds, Fetched: 1 row(s)
-----------------------------------------------------------------------------

desc tab2;

-----------------------------------------------------------------------------
OK
user_id             	string 
-----------------------------------------------------------------------------
视图是否可以克隆：(hive-1.2.1暂时不支持)
create view tab_v2 like tab_v1; ---不可以的

可以改视图名字等操作和表类似
alter view tab1 rename to tab2;


当表的结构修改时视图的数据也会随着表的修改而修改

删除视图：
drop view if exists tab_v2;    (正确写法)
drop table if exists tab_v1;   (不支持)

注意：
1、不建议先删除视图对应的表后再查询视图。
2、视图是不能用insert into 或者load 方式来加载数据。
3、视图是只读，不能修改其结构、表相关属性。

hive的日志：

hive的系统日志：
默认目录：/tmp/{user.name}
例如我的hive在root用户下操作时日志放在：/tmp/root/hive.log,root的同级目录下还会有很多文件。
hive.log.dir={java.io.tmpdir}/{user.name}
hive.log.file=hive.log
hive的查询日志：
会在hive的conf下的hive-log4j2.propertie有显示配置
<name>hive.querylog.location</name>
<value>{system:java.io.tmpdir}/${system:user.name}</value>
<description>Location of Hive run time structured log file</description>

hive的运行方式：

1、cli ： 命令行(hive/beeline)  如果启动beeline连接需要启动hiveserver2

hive --service hiveserver2 &
hiveserver2 &
注意踩坑：常见的错误之一root用户...不允许什么登陆。这时需要去更改一个权限就是hadoop中的core-site.xml文件
需要加：
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
上面代表这两行可以允许root用户和root组的登录
------------------------------------------------------
 <property>
     <name>hadoop.proxyuser.root.hosts</name>
     <value>192.168.80.10/16</value>
   </property>
   这个属性配置时允许这个ip网段的登录最好都配置了
-----------------------------------------------------


beeline 可以设置是否启用用户密码，用户权限设置？如何设置

2、java的jdbc连接运行

比较麻烦不太常用

3、hive -f hql文件

比较灵活，内部可以同时执行多个语句

4、hive -e 查询语句

hive -e 'current_date+1' hive自带的一个函数可通过-e来执行

#!/bins/bash

u5_query="
select 
* 
from 
qf24.u5
"

hive -e $u5_query
hive -e $u6_query

属性设置：

1、hive-site.xml   (全局,配置公共和启动前必须配置元数据库的配置、日志配置等)
2、hive通过命令行参数设置 hive --hiveconf a=10 -e ''
3、hive通过cli端set设置 
set ...
select ...;

三者配置优先级依次增高。

hive的jdbc：

1、conn、ps\rs的关闭顺序需要时rs\ps\conn,否则报错sasl
2、连接的用户名和密码需要 填写，如果没有配置可以使用root、root,否则会报错没有权限。
3、执行前要打开hiveserver2

kylin ：加速hive的查询(加查询预执行，并将结果保存hbase中)

优化

1、考虑环境  (硬件服务器、配置)
2、业务      (统计指标的实现思路)
3、代码或者配置属性  (hive-default.xml中的属性)

1、查看执行计划
explain extended
select 
id id,
count(id) cnt
from u4 
group by id;

    > explain extended
    > select
    > id id,
    > count(id) cnt
    > from u4
    > group by id;



ABSTRACT SYNTAX TREE:

TOK_QUERY
   TOK_FROM
      TOK_TABREF
         TOK_TABNAME
            u4
   TOK_INSERT
      TOK_DESTINATION
         TOK_DIR
            TOK_TMP_FILE
      TOK_SELECT
         TOK_SELEXPR
            TOK_TABLE_OR_COL
               id
            id
         TOK_SELEXPR
            TOK_FUNCTION
               count
               TOK_TABLE_OR_COL
                  id
            cnt
      TOK_GROUPBY
         TOK_TABLE_OR_COL
            id


STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u4
            Statistics: Num rows: 4 Data size: 16 Basic stats: COMPLETE Column stats: NONE
            GatherStats: false
            Select Operator
              expressions: id (type: int)
              outputColumnNames: id
              Statistics: Num rows: 4 Data size: 16 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(id)
                keys: id (type: int)
                mode: hash
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 4 Data size: 16 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 4 Data size: 16 Basic stats: COMPLETE Column stats: NONE
                  tag: -1
                  value expressions: _col1 (type: bigint)
                  auto parallelism: false
      Path -> Alias:
        hdfs://hadoop01:9000/user/hive/warehouse/qf24.db/u4 [u4]
      Path -> Partition:
        hdfs://hadoop01:9000/user/hive/warehouse/qf24.db/u4
          Partition
            base file name: u4
            input format: org.apache.hadoop.mapred.TextInputFormat
            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
            properties:
              COLUMN_STATS_ACCURATE true
              bucket_count -1
              columns id,name
              columns.comments
              columns.types int:string
              field.delim ,
              file.inputformat org.apache.hadoop.mapred.TextInputFormat
              file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              location hdfs://hadoop01:9000/user/hive/warehouse/qf24.db/u4
              name qf24.u4
              numFiles 1
              numRows 4
              rawDataSize 16
              serialization.ddl struct u4 { i32 id, string name}
              serialization.format ,
              serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              totalSize 28
              transient_lastDdlTime 1568602270
            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              properties:
                COLUMN_STATS_ACCURATE true
                bucket_count -1
                columns id,name
                columns.comments
                columns.types int:string
                field.delim ,
                file.inputformat org.apache.hadoop.mapred.TextInputFormat
                file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                location hdfs://hadoop01:9000/user/hive/warehouse/qf24.db/u4
                name qf24.u4
                numFiles 1
                numRows 4
                rawDataSize 16
                serialization.ddl struct u4 { i32 id, string name}
                serialization.format ,
                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                totalSize 28
                transient_lastDdlTime 1568602270
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: qf24.u4
            name: qf24.u4
      Truncated Path -> Alias:
        /qf24.db/u4 [u4]
      Needs Tagging: false
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            GlobalTableId: 0
            directory: hdfs://hadoop01:9000/tmp/hive/root/5cc41b9b-d7a2-4416-8945-a17f6b462de7/hive_2019-09-16_15-26-40_893_5149303040930924164-1/-mr-10000/.hive-staging_hive_2019-09-16_15-26-40_893_5149303040930924164-1/-ext-10001
            NumFilesPerFileSink: 1
            Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE Column stats: NONE
            Stats Publishing Key Prefix: hdfs://hadoop01:9000/tmp/hive/root/5cc41b9b-d7a2-4416-8945-a17f6b462de7/hive_2019-09-16_15-26-40_893_5149303040930924164-1/-mr-10000/.hive-staging_hive_2019-09-16_15-26-40_893_5149303040930924164-1/-ext-10001/
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                properties:
                  columns _col0,_col1
                  columns.types int:bigint
                  escape.delim \
                  hive.serialization.extend.additional.nesting.levels true
                  serialization.format 1
                  serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            TotalFiles: 1
            GatherStats: false
            MultiFileSpray: false

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
        
        
查看stage之间的依赖关系，stage的个数，也能查看执行顺序,可以改变hql语句调整执行顺序。
stage原则是越少越好，依赖越简单越好，一个stage是一个mr或者mr的一部分。


数据倾斜：
数据倾斜：由于key分布不均匀造成的数据向一个方向偏离的现象。
本身数据就倾斜
join语句容易造成
count(distinct col) 很容易造成倾斜
group by 也可能会造成

倾斜现象：
卡在某一个reduce任务的某个进度。


解决方法：
1、找到造成数据倾斜的key，然后再通过hql语句避免(查看日志是哪个task失败--->找该task中关联字段、group by\count(distrinct col) ---> 抽样字段个数 ---> 判断是否是倾斜的key )。单独拿出来处理，然后在和正常的结果进行union all。

2、造成倾斜的key加随机数(加的随机不能造成二次倾斜、保证加随机不能影响原有的业务)。

 select 
 t2.*
 from t_user2 t2
 join t_user2 t1
 on t2.id = t1.id
 ;

3、设置相关倾斜的属性
hive.map.aggr=true;
hive.groupby.skewindata=false;  (建议开启)
hive.optimize.skewjoin=false;
skewjoin 先关属性查看：
skew 相关的属性：

4、如上都不行，则需要从新查看业务，优化语句流程。



2、join
hive的查询永远是小表(结果集)驱动大表(结果集)
hive中的on的条件只能是等值 and连接 
注意hive是否配置普通join转换成map端join、以及mapjoin小表文件大小的阀值
注意hive的倾斜join：
hive.optimize.skewjoin=false
hive.skewjoin.key=100000
hive.skewjoin.mapjoin.map.tasks=10000

3、limit的优化：
hive.limit.row.max.size=100000
hive.limit.optimize.limit.file=10
hive.limit.optimize.enable=false  (如果limit较多时建议开启)
hive.limit.optimize.fetch.max=50000

4、本地模式
hive.exec.mode.local.auto=false (建议打开)
hive.exec.mode.local.auto.inputbytes.max=134217728  (128M)
hive.exec.mode.local.auto.input.files.max=4

5、并行执行：
hive.exec.parallel=false   (建议开启)
hive.exec.parallel.thread.number=8

6、严格模式
hive.mapred.mode=nonstrict

7、mapper和reducer的个数：
不是mapper和redcuer个数越多越好，也不是越少越好。适合就好。

将小文件合并处理(将输入类设置为：CombineTextInputFormat)
通过配置将小文件合并：
mapred.max.split.size=256000000   
mapred.min.split.size.per.node=1
mapred.min.split.size.per.rack=1
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

手动设置：
set mapred.map.tasks=2;

reducer的个数(自动决定和手动设置)：
mapred.reduce.tasks=-1
hive.exec.reducers.max=1009

8、配置jvm重用：
mapreduce.job.jvm.numtasks=1   ###

mapred.job.reuse.jvm.num.tasks=1;



10、索引是一种hive的优化：(索引并不好)

11、分区本身就是hive的一种优化：

12、job的数量：
一般是一个查询产生一个job，然后通常情况一个job，可以是一个子查询、一个join、一个group by 、一个limit等一些操作。

1个job:
select
t1.*
from t_user1 t1
left join t_user2 t2
on t1.id = t2.id
where t2.id is null
;

如下3个job:
select
t1.*
from t_user1 t1
where id in (
select
t2.id
from t_user2 t2
limit 1
)
;

13、analyze:
参考官网:https://cwiki.apache.org/confluence/display/Hive/StatsDev

Analyze，分析表（也称为计算统计信息）是一种内置的Hive操作，可以执行该操作来收集表上的元数据信息。这可以极大的改善表上的查询时间，因为它收集构成表中数据的行计数，文件计数和文件大小（字节），并在执行之前将其提供给查询计划程序。

已经存在表的Analyze语法：
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]  -- (Note: Fully support qualified table name since Hive 1.2.0, see HIVE-10007.)
  COMPUTE STATISTICS 
  [FOR COLUMNS]          -- (Note: Hive 0.10.0 and later.)
  [CACHE METADATA]       -- (Note: Hive 2.1.0 and later.)
  [NOSCAN];

例1(指定分区)、
ANALYZE table dw_employee_hive partition(bdp_day=20190701) COMPUTE STATISTICS;
收集表的bdp_day=20190701的这个分区下的所有列列相关信息。它是一个细粒度的分析语句。它收集指定的分区上的元数据，并将该信息存储在Hive Metastore中已进行查询优化。该信息包括每列，不同值的数量，NULL值的数量，列的平均大小，平均值或列中所有值的总和（如果类型为数字）和值的百分数。

例2(指定所有列)、
ANALYZE table dw_employee_hive partition(bdp_day=20190701) COMPUTE STATISTICS FOR COLUMNS;
收集表的bdp_day=20190701的这个分区下的所有列相关信息。

例3(指定某列)、
ANALYZE table dw_employee_hive partition(bdp_day=20190701) COMPUTE STATISTICS FOR COLUMNS snum,dept;

例4、
ANALYZE TABLE dw_employee_hive partition(bdp_day=20190701) COMPUTE STATISTICS NOSCAN;
收集指定分区相关信息，然后不进行扫描。

测试分析后的结果。
例1、
DESCRIBE EXTENDED dw_employee_hive partition(bdp_day=20190701);

描述结果:
...parameters:{totalSize=10202043, numRows=33102, rawDataSize=430326, ...

例2、
desc formatted dw_employee_hive partition(bdp_day=20190701) name;

结果如下：
# col_name  data_type   min max num_nulls   distinct_count  avg_col_len max_col_len num_trues   num_falses  comment
name string 0 37199 4.0 4 from deserializer


注意:
对分区表的分析，一般都要指定分区，如对全表分析，则可以这样使用partition(bdp_day).
优化后查询结果可以参考:https://www.cnblogs.com/lunatic-cto/p/10988342.html

mysql的存储过程(可以了解下，没详细说明)：

需求1：循环往一个表中插入数据：
Id name(“”+i) age(随机数)

CREATE TABLE IF NOT EXISTS USER(
id BIGINT(11) NOT NULL AUTO_INCREMENT,
NAME VARCHAR(45) DEFAULT NULL,
age INT(1) DEFAULT 1,
PRIMARY KEY(id)
)
ENGINE=INNODB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8;
需求2：
User
Id	name	age


User-info
Uid	birthday sex .....


在mysql中的声明字段：
使用关键字：declare
普通类型申明格式如下：
Decalare 字段名 字段类型(位数)  [default 默认值];
如:declare name varchar(45) default ‘’;

在mysql中的赋值：
使用关键字：set
如: Set i =100;

在mysql中判断常用的是if...end if  、if...else... 和 if...else if...else...等。
这几个的if else语句格式：(控制条件用可以and或者or)
If 控制条件 then
判断体; //每一句用分号结束，
End if;

If 控制条件 then
判断体; //每一句用分号结束
Else
判断体; //每一句用分号结束
End if;

If 控制条件 then
判断体; //每一句用分号结束
Elseif 控制条件   //注意Elseif不能分开 
判断体; //每一句用分号结束
Else
判断体; //每一句用分号结束
End if;





在MySQL存储过程的语句中有三个标准的循环方式：WHILE循环，REPEAT循环以及LOOP循环，还有一种非标准的循环方式：GOTO，不过这种循环方式最好别用，很容易引起程序的混乱。其中最常见的是while和repreat
这几个循环语句的格式如下：
WHILE 控制条件 DO
循环体; //每一句用分号分隔开
END WHILE;

REPEAT
循环体; //每一句用分号分割开
UNTIL 控制条件  //不能加分号
END REPEAT;

将1张表数据导入到另一张表：
INSERT into test.`USER`(`name`,age) 
SELECT
`name`,
age
FROM stu
;

存储过程：
BEGIN

DECLARE _id INT(11) DEFAULT 0;
DECLARE _nm VARCHAR(22) DEFAULT '';

#循环标记
DECLARE  _done int default 0;  

DECLARE stu_set cursor for 
SELECT 
s.id id,
s.`name` nm
FROM stu1 s
;


DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET _done = 1;#错误定义，标记循环结束 

#循环游标
OPEN stu_set;
     /* 循环执行    */ 
		 REPEAT
				FETCH stu_set INTO _id,_nm;  
			 IF NOT _done THEN
				INSERT INTO  test.`USER`(`NAME`,age) VALUES (_nm,_id);
				END IF;
		UNTIL _done END REPEAT; #当_done=1时退出被循  
CLOSE stu_set;


END

hive的存储过程

CREATE PROCEDURE set_message(IN name STRING, OUT result STRING)
BEGIN
 SET result = 'Hello, ' || name || '!';
END;
 
-- Now call the procedure and print the results
DECLARE str STRING;
CALL set_message('world', str);
PRINT str;
 
Result:
--
Hello, world!


案例：
use ali;
create procedure select_u5()
begin
select * from ali.lg;
end;


create function hello(text string)
returnS string
BEGIN
RETRUEN 'Hello,' || text || '!';
END;

create procedure select_u53()
begin
FOR item IN(
SELECT user_id,ds FROM ali.read limit 2
)
loop
        println item.user_id || '|' || item.ds || '|' || hello(item.ds);
end loop;
end;


create procedure pc()
begin
DECLARE tabname VARCHAR DEFAULT 'ali.pay';
DECLARE user_id INT;
DECLARE cur CURSOR FOR 'SELECT user_id FROM ' || tabname;
OPEN cur;
FETCH cur INTO user_id;
WHILE SQLCODE=0 THEN
  PRINT user_id;
  FETCH cur INTO user_id;
END WHILE;
CLOSE cur;
end;


测试调用：
include /usr/local/sc/fp.hql  --去调用这个文件中的一些方法

call select_u5();  --从引入的那个文件夹中去调用方法

call select_u53();

call hello("text");

call pc();