hive的补充

最新推荐文章于 2024-06-04 00:44:25 发布

流浮影

最新推荐文章于 2024-06-04 00:44:25 发布

阅读量245

点赞数

分类专栏： hive hadoop 文章标签： hive hadoop

本文链接：https://blog.csdn.net/weixin_44273391/article/details/101107131

版权

hadoop 同时被 2 个专栏收录

30 篇文章 0 订阅

订阅专栏

hive

11 篇文章 0 订阅

订阅专栏

hive的补充

hive的分隔符

hive默认的列与列之间的分隔符是：\001,注意不是tab
通常分隔符：
tab
,
" "
|
\n
\001	^A (\u0001,注意不是\0001也不是\01)
\002	^B
\003	^C

poseexplode:

hive的文件存储格式：

hive默认的数据文件存储格式为：textfile
textfile：普通的文本文件存储，不压缩。占用空间，查询效率低下。(小量数据可以使用)
sequencefile:hive为用户提供的二进制存储，本身就压缩。不能用load方式加载数据
rcfile:hive提供行列混合存储，hive在该格式下，将会尽量把附近的行和列的块尽量存储到一起。仍然压缩，查询效率较高。
orc ： 优化后的rcfile。
parquet ：典型列式存储。自带压缩，查询较快(按列查询)
<name>hive.default.fileformat</name>
    <value>TextFile</value>
    <description>
      Expects one of [textfile, sequencefile, rcfile, orc].
      Default file format for CREATE TABLE statement. Users can explicitly override it by CREATE TABLE ... STORED AS [FORMAT]
    </description>（hive-site.xml.tem）

create table if not exists text1(
uid int,
uname string
)
row format delimited fields terminated by ' '
;

load data local inpath '/hivedata/seq1' into table seq1;

创建sequencefile

create table if not exists seq1(
uid int,
uname string
)
row format delimited fields terminated by ','
stored as sequencefile
;

该方式不行：
load data local inpath '/hivedata/seq1' into table seq1;

使用以下方式：
insert into table seq1
select uid,uname from text1;

select * from seq1;
OK
1       ajskdj
2       张三
3       爱上达拉斯
4       kalsdhlkas
5       aklshdklas
5       爱睡觉的
5       kalsdfk

hdfs dfs -cat /user/hive/warehouse/qf_test.db/seq1/000000_0;
19/09/16 19:26:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Text
2,张三3,爱上达拉斯
                       4,kalsdhlkas
                                     5,aklshdklas5,爱睡觉的  5,+a+_df+[_[_--+@had----01 ~]#

rcfile

create table if not exists rc1(
uid int,
uname string
)
row format delimited fields terminated by ' '
stored as rcfile
;

该方式不行：
load data local inpath '/hivedata/seq1' into table rc1;

使用inset into方式：
insert into table rc1
select uid,uname from text1;

create table seq2(
movie string,
rate string,
times string,
uid string
)
row format delimited fields terminated by ','
stored as sequencefile
;

create table if not exists rc1(
movie string,
rate string,
times string,
uid string
)
row format delimited fields terminated by ','
stored as rcfile
;

from t_movieRate
insert into table seq2
select * 
insert into table rc1
select *
;

综合效率：是defaultCodec+rcfile较好

压缩格式

map端输出压缩：
mapreduce.map.output.compress=false
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.DefaultCodec

reduce输出压缩(reduce压缩)：
snappy、bzip2、gzip、DefaultCompress
mapreduce.output.fileoutputformat.compress=false
mapreduce.output.fileoutputformat.compress.type=NONE/RECORD/BLOCK(默认RECORD)
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec

hive压缩配置：
set hive.exec.compress.output=false;
set hive.exec.compress.intermediate=false;
set hive.intermediate.compression.codec=
set hive.intermediate.compression.type=

CREATE TABLE `u4`(
  `id` int,
  `name` string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as textfile;

set mapreduce.output.fileoutputformat.compress=true;
set hive.exec.compress.output=true;
insert into table u4
select * from u2;

2、sequence ：
CREATE TABLE `u4`(
  `id` int,
  `name` string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as sequencefile;

3、rcfile ： 
CREATE TABLE `u5`(
  `id` int,
  `name` string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as rcfile;


4、orc ： 
CREATE TABLE `u6`(
  `id` int,
  `name` string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as orc;

5、parquet
CREATE TABLE `u7`(
  `id` int,
  `name` string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as PARQUET;
insert into table u7
select * from u2;

所有文件格式不能是用load方式加载（默认是text是可以加载的，如果将数据存储为设定的文件格式的数据也是可以加载的（即数据格式与inputformat（stored as
inputformat ）设定格式相同））

自定义存储格式

自定义：
数据：
hello zhanghao
hello feifei,good good study,day day up
seq_yd元数据文件：
aGVsbG8gemhhbmdoYW8=
aGVsbG8gZmVpZmVpLGdvb2QgZ29vZCBzdHVkeSxkYXkgZGF5IHVw
seq_yd文件为base64编码后的内容，decode后数据为：http://tool.oschina.net/encrypt?type=3
在线base64加码解码

create table cus(str STRING)  
stored as  
inputformat 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'  
outputformat 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextOutputFormat'; 

LOAD DATA LOCAL INPATH '/home/hivedata/cus' INTO TABLE cus;

hive的视图（逻辑视图）：

创建视图：（cvas）
hive的视图简单理解为逻辑上的表
hive现目前只支持逻辑视图，不支持物化视图。
hive的视图意义：
1、对数据进行局部暴露(涉及隐私数据不暴露)。
2、简化复杂查询。
创建视图：
create view if not exists tab_v1 
as 
select uid from tab1 where uid < 10;

查看视图：

show tables;
show create table tab_v1;
desc tab_v1;
视图是否可以克隆：(hive-1.2.1暂时不支持)
create view tab_v2 like tab_v1;
？？view 暂没有结构的修改？？？？？(直接修改元数据可行)

ALTER VIEW view_name SET TBLPROPERTIES table_properties  
table_properties:  
  : ("TBL_NAME" = "tbv1")  
;

create view if not exists v1 as select * from text1;

ALTER VIEW v1 SET TBLPROPERTIES("TBL_NAME" = "v11");  ???任然有问题
create view if not exists v1 as select * from text1;

删除视图：
drop view if exists tab_v2;

注意：
1、切忌先删除视图对应的表后再查询视图。
2、视图是不能用insert into 或者load 方式来加载数据。
3、视图是只读，不能修改其结构、表相关属性。

hive的日志：

hive的系统日志：
默认目录：/tmp/{user.name}
hive.log.dir={java.io.tmpdir}/{user.name}
hive.log.file=hive.log
hive的查询日志：
<name>hive.querylog.location</name>
<value>{system:java.io.tmpdir}/${system:user.name}</value>
<description>Location of Hive run time structured log file</description>
set hive.querylog.location
    > ;
hive.querylog.location=/tmp/root
hive>

hive的运行方式：

1、cli ： 命令行(hive/beeline)  如果启动beeline连接需要启动hiveserver2
hive --service hiveserver2 &
hiveserver2 &
beeline 可以设置是否启用用户密码，用户权限设置。
beeline connect有几种方式，见hive-site.xml,缺省为NONE。

 <property>
    <name>hive.server2.authentication</name>
    <value>NONE</value>
    <description>（hive-site-tem）
2、java的jdbc连接运行
	hive的jdbc
	1.conn、ps、rs的关闭顺序需要rs\ps\conn，否则报错sasl
	2.连接的用户名和密码需要填写，如果没有配置可以使用root、root，否则会报错没有权限。
kylin：加速hive的查询（加查询与预执行，并将结果保存hbase中）
3、hive -f hql文件
注意：
	1、一个--hivevar 或者 --hiveconf 只能带一个参数
	2、--hiveconf 或者 --hivevar 可以混合使用
	3、--hiveconf 或 --hivevar 定义参数不能取消
	hiveconf：可读可写
	hivevar：自定义临时变量，可读可写
	system：可读可写
	env：可读不可写

--hiveconf <property=value>   Use value for given property
--hivevar <key=value>         Variable subsitution to apply to hive
                                  commands. e.g. --hivevar A=B
	hive -S :进入hive的静默模式，只显示查询结果，不显示执行过程；
	hive -i: - <文件名>初始化SQL文件
	
	-d，——定义<key=value>变量subsitution应用于hive
命令。例如-d A=B或者——定义A=B
4、hive -e "查询语句"
 hive -e "select * from qf_test.u2"

Logging initialized using configuration in jar:file:/usr/local/hive-1.2/lib/hive-common-1.2.1.jar!/hive-log4j.properties
OK
2       bb
3       cc
7       yy
9       pp
Time taken: 3.205 seconds, Fetched: 4 row(s)

创建hql文件：
CREATE TABLE if not exists `u6`(
  `id` int,
  `name` string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as orc;

insert into table ali_test.u6
select
id,
name
from ${hiveconf:stn}
limit ${hivevar:lim}
;


insert into table  ali_test.u1
select
id,
name
from ${hiveconf:stn}
limit ${hivevar:lim}
;


select * from ali_test.u2;

select * from ali_test.u1;
执行
hive -f a.hql --database qf_test --hiveconf stn=u2 --hivevar lim=2
Time taken: 94.653 seconds
OK
2       bb
3       cc
7       yy
9       pp
Time taken: 0.874 seconds, Fetched: 4 row(s)
OK
3       cc
2       bb
1       a
2       b
3       c
4       d
7       y
8       u

属性配置：

1，hive-site.xml(全局)
2.hive通过命令行参数配置 hive --hiveconf a=10 -e ''(当前语句执行)
3.hive通过cli端set设置  set  ..... select ....;

三者配置：优先级以次增高；

1、属性优先级别从上往下一次升高。
2、hive-site.xml是全局和永久的，其它两是临时和局部。
3、hive-site.xml适合所有属性配置，而后两个对于系统级别的属性不能配置。
比如启动所需的元数据库url、log配置等。

hive的jdbc

java的jdbc连接运行
	hive的jdbc
	1.conn、ps、rs的关闭顺序需要rs\ps\conn，否则报错sasl
	2.连接的用户名和密码需要填写，如果没有配置可以使用root、root，否则会报错没有权限。
	hive的工具类
	package hive_jdbc;

import java.sql.*;


/**
 * 连接工具类
 */
public class HiveJdbcUtil {
    private static String driverName = "org.apache.hive.jdbc.HiveDriver";
    private static String url = "jdbc:hive2://192.168.182.201:10000/ali_test"; //指定数据库为dw_hivedata
    private static String userName = "root"; //hiveserver2的用户名和密码
    private static String password = "root";

    public static void main(String[] args) {
        System.out.println(getConn());
    }

    /**
     * 获取hive的驱动连接
     *
     * @return
     */
    public static Connection getConn() {
        Connection conn = null;
        try {
            Class.forName(driverName); //加载驱动
            //获取conn
            conn = DriverManager.getConnection(url, userName, password);
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
            System.exit(1);
        } catch (SQLException e) {
            e.printStackTrace();
        }
        return conn;
    }

    /**
     * 关闭连接
     *
     * @param conn
     */
    public static void closeConn(Connection conn, PreparedStatement ps, ResultSet rs) {
        if (conn != null) {
            try {
                conn.close();
            } catch (SQLException e) {
                e.printStackTrace();
            }
        }

//        if (ps != null) {
//            try {
//                ps.close();
//            } catch (SQLException e) {
//                e.printStackTrace();
//            }
//        }
//
//        if (rs != null) {
//            try {
//                rs.close();
//            } catch (SQLException e) {
//                e.printStackTrace();
//            }
//        }
    }
}

hive的远程模式

和1差不多，只是将元数据放在别的服务器上，这种的就是咱们常说的集群模式。
可以有一个hive的server和多个hive的client。

hive也可以启动为一个服务器，来对外提供

启动方式，（假如是在hadoop01上）：
启动为前台：bin/hiveserver2
启动为后台：nohup bin/hiveserver2 1>/var/log/hiveserver.log 2>/var/log/hiveserver.err &

启动成功后，可以在别的节点上用beeline去连接
方式（1）
hive/bin/beeline  回车，进入beeline的命令界面
输入命令连接hiveserver2
beeline> !connect jdbc:hive2://hdp01:10000
（hdp01是hiveserver2所启动的那台主机名，端口默认是10000）
方式（2）
或者启动就连接：
bin/beeline -u jdbc:hive2://hdp01:10000 -n hadoop

接下来就可以做正常sql查询了

[root@hadoop-01 data]# beeline 
Beeline version 1.2.1 by Apache Hive
beeline> !connect jdbc:hive2://hadoop-01:10000
Connecting to jdbc:hive2://hadoop-01:10000
Enter username for jdbc:hive2://hadoop-01:10000: root
Enter password for jdbc:hive2://hadoop-01:10000: ****
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://hadoop-01:10000> 
0: jdbc:hive2://hadoop-01:10000> 
0: jdbc:hive2://hadoop-01:10000> show databases;
OK
+----------------+--+
| database_name  |
+----------------+--+
| default        |
+----------------+--+
3 rows selected (2.947 seconds)
0: jdbc:hive2://hadoop-01:10000>

analyze和job的数量：

一般是一个查询产生一个job，然后通常情况一个job，可以是一个子查询、一个join、一个group by 、一个limit等一些操作。

1个job:
select
t1.*
from t_user1 t1
left join t_user2 t2
on t1.id = t2.id
where t2.id is null
;

如下3个job:
select
t1.*
from t_user1 t1
where id in (
select
t2.id
from t_user2 t2
limit 1
)
;

13、analyze:
参考官网:https://cwiki.apache.org/confluence/display/Hive/StatsDev

Analyze，分析表（也称为计算统计信息）是一种内置的Hive操作，可以执行该操作来收集表上的元数据信息。这可以极大的改善表上的查询时间，因为它收集构成表中数据的行计数，文件计数和文件大小（字节），并在执行之前将其提供给查询计划程序。

已经存在表的Analyze语法：
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]  -- (Note: Fully support qualified table name since Hive 1.2.0, see HIVE-10007.)
  COMPUTE STATISTICS 
  [FOR COLUMNS]          -- (Note: Hive 0.10.0 and later.)
  [CACHE METADATA]       -- (Note: Hive 2.1.0 and later.)
  [NOSCAN];

例1(指定分区)、
ANALYZE table dw_employee_hive partition(bdp_day=20190701) COMPUTE STATISTICS;
收集表的bdp_day=20190701的这个分区下的所有列列相关信息。它是一个细粒度的分析语句。它收集指定的分区上的元数据，并将该信息存储在Hive Metastore中已进行查询优化。该信息包括每列，不同值的数量，NULL值的数量，列的平均大小，平均值或列中所有值的总和（如果类型为数字）和值的百分数。

例2(指定所有列)、
ANALYZE table dw_employee_hive partition(bdp_day=20190701) COMPUTE STATISTICS FOR COLUMNS;
收集表的bdp_day=20190701的这个分区下的所有列相关信息。

例3(指定某列)、
ANALYZE table dw_employee_hive partition(bdp_day=20190701) COMPUTE STATISTICS FOR COLUMNS snum,dept;

例4、
ANALYZE TABLE dw_employee_hive partition(bdp_day=20190701) COMPUTE STATISTICS NOSCAN;
收集指定分区相关信息，然后不进行扫描。

测试分析后的结果。
例1、
DESCRIBE EXTENDED dw_employee_hive partition(bdp_day=20190701);

描述结果:
...parameters:{totalSize=10202043, numRows=33102, rawDataSize=430326, ...

例2、
desc formatted dw_employee_hive partition(bdp_day=20190701) name;

结果如下：
# col_name  data_type   min max num_nulls   distinct_count  avg_col_len max_col_len num_trues   num_falses  comment
name string 0 37199 4.0 4 from deserializer


注意:
对分区表的分析，一般都要指定分区，如对全表分析，则可以这样使用partition(bdp_day).
优化后查询结果可以参考:https://www.cnblogs.com/lunatic-cto/p/10988342.html

流浮影

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive的补充

hive的补充hive的分隔符hive默认的列与列之间的分隔符是：\001,注意不是tab通常分隔符：tab," "|\n\001 ^A (\u0001,注意不是\0001也不是\01)\002 ^B\003 ^Cposeexplode:hive的文件存储格式：hive默认的数据文件存储格式为：textfiletextfile：普通的文本文件存储，不压缩。占用空间，...
复制链接

扫一扫