24HIve的Struct与优化

最新推荐文章于 2024-08-02 00:27:30 发布

木生火18624

最新推荐文章于 2024-08-02 00:27:30 发布

阅读量837

点赞数

分类专栏：大数据学习路程

本文链接：https://blog.csdn.net/penghao_1/article/details/104491776

版权

大数据学习路程专栏收录该内容

97 篇文章 2 订阅

订阅专栏

struct:可以放数组

create table if not exists str1(
name string,
addr struct<province:string,city:string,street:string>
)
row format delimited 
fields terminated by '\t'
collection items terminated by ','
;

加载数据

查询一下数据是否OK

查询：
查询城市是东京的用户所在地省份

select 
name,
addr.province,
addr.street
from str1
where addr.city = 'dongjing'
;

uid   uname   manager   tax   addr
1   xdd   ll,lw,ly,lc,lh   wuxian:300,gongjijin:1200,shebao:240   北京,西城区,中南海
2   lkq   lw,ly,lc,lsj   wuxian:260,gongjijin:1000,shebao:360   江西,昌平区,回龙观街道

查询：下属数量大于4个，公积金大于1000，省份在北京的用户

create table if not exists str2(
uid int,
uname string,
manager array<string>,
tax map<string,int>,
addr struct<province:string,city:string,street:string>
)
row format delimited 
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
collection items terminated by ';'
;

load data local inpath '/hivedata/str2.txt' into table str2;

查询：下属数量大于4个，公积金大于1000，省份在北京的用户
字段：
第一个下属，第二个下属
gongjijin
省市县

select 
uid,
uname,
manager[0],
manager[1],
tax['shebao'],
addr.province,
addr.city,
addr.street
from str2
where size(manager) > 4 AND
tax['gongjijin'] > 1000 AND
addr.province = '北京'
;

嵌套数据类型

hive共支持8个层级的分隔符
\001 \002 ... \008
\u0001

uid   uname   manager   tax   addr
1   xdd   ll,lw,ly,lc,lh   wuxian:[300;360;460],gongjijin:[1200;1500;1800],shebao:[240;320;500]   北京,西城区,中南海
2   lkq   lw,ly,lc,lsj   wuxian:260,gongjijin:1000,shebao:360   江西,昌平区,回龙观街道

###文件的读取/解析的方式：ROW FORMAT [ROW_FORMAT]
hive使用两个类去读写数据
一个类用户从文件中读取数据，一条一条的读取数据，（可能是一行数据，也可能是xml文件中的一个完整的标签）
一个类用户从上面读到的记录中切分数据，切分出一个一个的字段（可能简单的按照分隔符切分，可能对复杂结构进行自定义切分）。

ROW FORMAT：表示用什么INPUTFORMAT去读取数据
DELIMITED:用默认的org.apache.hadoop.mapred.TextInputFormat去读取数据行，行以回车符做为分隔符(这个是短包)
FIELDS TERMINATED：表示用什么Serde类去解析一行中的数据，默认用org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

----------serde

serialize:（序列化，写数据）
deserializ:（反系列化，读数据）

常用的serde：csv\tsv\json serder \ regexp serder等

csv:逗号分隔值
tsv:以tab分隔值
json：json格式的数据
regexp：数据需要符合正则表达式

创建表：

create table if not exists csv1(
id int,
name string,
age int
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'--指定解析的类
stored as textfile
;

load data local inpath '/hivedata/stu1.txt' into table csv1;

create table if not exists csv2(
id int,
name string,
age int
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",", 
"quoteChar"="'",
"escapeChar"="\\"
)
stored as textfile
;

json serde:

如果使用第三方的 jar包或者自己写的，就必须先加载jar包 (先要引用)
add jar /root/json-serde-1.3-jar-with-dependencies.jar;
{"pid":1,"content":"this is class gp1923"}
{"pid":2,"content":"this is pid 2 content"}

create table if not exists js1(
pid int,
content string
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'--指定第三方serde
;

load data local inpath '/hivedata/js.txt' into table js1;

{"uid":1,"uname":"xdd","manager":["ll","lw","ly","lc","lh"],"tax":{"wuxian":[300,360,460],"gongjijin":[1200,1500,1800],"shebao":[240,320,500]},"addr":{"province":"北京","city":"西城区","street":"中南海"}}
{"uid":2,"uname":"lkq","manager":["lw","ly","lc","lsj"],"tax":{"wuxian":[260,300,420],"gongjijin":[1000,1200,1500],"shebao":[360,420,600]},"addr":{"province":"江西","city":"昌平区","street":"回龙观街道"}}

create table if not exists ts2(--解析json
uid int,
uname string,
manager array<string>,--数组
tax map<String,array<int>>,
addr struct<province:string,city:string,street:string>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
;

load data local inpath '/hivedata/js1.txt' overwrite into table ts2;

regex serde :正则的

hive对文件中字段的分隔符默认情况下只支持单字节分隔符，如果数据中的分隔符是多字符的
01||gaoyuanyuan
02||cls

create table t_1(--普通格式解决不了
id int,
name string
)
row format delimited
fields terminated by '||'
;

create table if not exists t_reg(--用正则的方式
id string,
name string
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties(
"input.regex"="(.*)\\|\\|(.*)",
"output.format.string"="%1$s %2$s"
)
;

load data local inpath '/hivedata/t1.txt' overwrite into table t_reg;

通过自定义InputFormat来解决特殊字符分隔符的问题

原理：在inputformat读取行的时候将数据的多字节分隔符替换成hive默认的分隔符\001，一遍hive的默认serde可以按照默认分隔符来对字段进行抽取

将工程打包，并copy到hive的安装目录下的lib目录，并重启hive
同时需要add jar的才能在执行查询时使用自定义jar包

create table if not exists t_bi(
id int,
name string
) 
row format delimited 
fields terminated by '\001'
stored as inputformat 'com.qfedu.bigdata.hiveUDF.inputformat.BiDelimiterInputFormat'
;--自定义的类

load data

####################

hive中数据的存储格式：
hive中默认的列与列之间的分割符：\001，不是tab \t
常用的分隔符：
tab
,
/
-
|
:
\n
\001 ^A
\002 ^B
\003
……
\008

数据存储格式：

textfile：hive的默认的数据文件存储格式。普通的文本文件，不压缩。效率不高
sequencefile:hive为用户提供的二进制存储，本身就是压缩
rcfile:hive提供的行列式混合存储，hive在这种格式下，将会尽量的把附件的行和列的块存储在一起。依然是压缩，查询的效率比较高。
orc:
paquet:是spark常用

<name>hive.default.fileformat</name>
    <value>TextFile</value>
    <description>
      Expects one of [textfile, sequencefile, rcfile, orc].
      Default file format for CREATE TABLE statement. Users can explicitly override it by CREATE TABLE ... STORED AS [FORMAT]
    </description>

create table if not exists text1(
uid int,
uname string,
age int
)
row format delimited fields terminated by ','
stored as textfile
;

load data local inpath '/home/hivedata/csv' into table text1;

create table if not exists seq1(
uid int,
uname string,
age int
)
row format delimited fields terminated by ','
stored as sequencefile
;

###这些特殊的格式，不能使用load方式加载
load data local inpath '/home/hivedata/csv' into table seq1;

只能通过临时表查询插入数据

insert into seq1
select * from text1;

rcfile：

create table if not exists rc1(
uid int,
uname string,
age int
)
row format delimited fields terminated by ','
stored as rcfile
;

###不能使用load方式加载

load data local inpath '/home/hivedata/csv' into table rc1;

insert into rc1
select * from text1;

综合来说，defaultCodec+rcfile效率较好

自定义存储格式：
seq_yd源数据：
aGVsbG8gZ3AxOTIz
Z29vZCBnb29kIHN0dWR5LGRheSBkYXkgdXA=

解码后的数据：
hello gp1923
good good study,day day up

create table if not exists cus(
str string
)
stored as inputformat 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'
outputformat 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextOutputFormat'
;

load data local inpath '/home/hivedata/cus' into table cus;

#########hive的索引
索引是数据库的一种标准的技术，hive在0.7之后提供了索引的功能，但是这里的索引性能没办法和传统的数据库相比。

有点：提高查询效率，避免全表扫描。
缺点：冗余存储，加载数据比较慢。

索引文件特点：索引的数据有序，量比较小。

索引关键字：index

create index idx_1 on table movieration(movie)
as 'compact'
with deferred rebuild
;

修改索引：(重建索引)
alter index idx_1 on table movieration rebuild;

查看索引：
show index on movieration;

创建联合索引：

create index idx_2 on table movieration(uid,movie)
as 'bitmap'
with deferred rebuild;

删除索引：
drop index idx_1 on movieration;

create index idx_2 on table ratingjson(json)
as 'compact'
with deferred rebuild
;

######视图：
hive上视图简单理解成逻辑上的表（只是给一个处理范围）
hive目前只支持逻辑视图，不支持物化视图。

hive的视图的意义：
1、对数据进行局部暴露（涉及到隐私的数据就不暴露）
2、简化复杂的语句

创建视图：

create view if not exists v1
as
select * from text1;

查看视图：
show tables;
show create table v1;

克隆视图：(暂不支持)
create view v2 like v1;

修改视图：（没有视图结构的修改，如果确实要修改，直接修改元数据）

删除视图：
drop view if exists v1;

注意：
1、切忌先删除视图对应的表在查询视图。
2、视图不能使用insert into 或者load方式来加载数据。
3、视图是只读，不能修改其结构、相关属性。

-----hive的导入导出
1、本地文件
2、hdfs
3、insert into
4、直接将数据copy到表目录
5、location
6、克隆带数据
7、多表导入
8、CTAS create table t1 as

####hive不允许局部数据操作（增、删、改）

多表导入:

from text1--from提前，读数据的时候读一次
insert overwrite table csv1
select *
insert overwrite table csv2
select * 
where uid < 3
;

with alias as (select * from tab where )
insert into
select * from alias
where

hive的数据导出：
1、从hive表中导出数据到本地文件
2、从hive中导出到hdfs中
3、从hive表导出到其他表

关键字：insert overwrite directory--

insert overwrite local directory '/home/hivedata/out/00'
row format delimited fields terminated by '\t'
select * from text1
;

hive的执行参数：
hive -help

hive -e 'select * from qf1705.text1 where uid < 3;' >> '/home/hivedata/out/00'

insert overwrite directory '/user/out/00'
select * from text1;

insert overwrite directory '/user/out/01'
row format delimited fields terminated by ',' --导出的数据以逗号分隔
select * from text1;

----------hive的压缩
hive的map阶段压缩：
set hive.exec.compress.output =false/true
set hive.exec.compress.intermediate=false/true
hive.intermediate.compression.codec=snappy|gzip|bzip2|lzo
hive.intermediate.compression.type

hive的reduce阶段压缩：
set hive.exec.compress.output =false
set hive.exec.compress.intermediate=false
hive.intermediate.compression.codec
hive.intermediate.compression.type

####hive的运行方式
hive-default.xml
1、hive-site.xml
2、hive可以通过命令行参数设置
3、hive可以通过客户端set方式设置

区别：
属性优先级从下往上依次升高
2、hive-site.xml设置是全局的和永久性的，其他方式都是临时的和局部性的。
3、hive-site.xml适合所有属性设置，后两种对于系统级别的属性是不能设置的。

系统设置：元数据url、log配置。

hive有四种属性：
hiveconf：可读可写
hivevar：自定义的临时变量，可读可写
sytem：可读可写：
env：可读不可写

--hiveconf <property=value> Use value for given property
--hivevar <key=value> Variable subsitution to apply to hive
commands. e.g. --hivevar A=B


hive -e
hive -f
hive -S 静默模式
hive -i
hive --database

hive> set -v; 查看当前参数
hive -e 'set' | grep current

hive的三种执行方式：
1、cli
2、hive -e 'command '
3、hive -f .hql 纯sql文件

hive --database gp1809 --hivevar ls=2 --hivevar ls1='id > 0' --hiveconf tn=t_userinfo --hiveconf cn1=id --hivevar cn2=name -e 'select ${hiveconf:cn1},${hivevar:cn2} from ${hiveconf:tn} where ${hivevar:ls1} limit ${hivevar:ls}'

注意：
1、一个hivevar 或者hiveconf只能带一个参数
2、--hiveconf或者--hivevar可以混合使用
3、--hiveconf或者--hivevar定义参数不能取消

----hive的优化------------
1、环境方面：服务器配置、容器的配置、环境搭建
2、具体的软件配置参数：
3、代码级别的优化：

1、explain和explain extended
explain extended select * from t_userinfo;

explain：只对hql语句进行解释。
explain extended：对hql语句进行解释，以及抽象表达式树的生成。

stage：相当于一个job，一个stage可以是limit、一个子查询、也是group by 等。

hive默认一次只执行一个stage，但是如果stage之间没有相互依赖，可以并行执行。
任务越复杂，hql代码复杂度越高，stage越多，运行时间就越长。
union all

2、join
hive的查询永远是小表（结果集）驱动大表（结果集）。
hive的on的条件只支持等值连接。
注意hive是否配置普通join转换成map端的join、以及mapjoin小表文件大小的阀值。

3、limit的优化：
set hive.limit.row.max.size =100000
set hive.limit.optimize.limit.file=10
set hive.limit.optimize.enable=false(如果limit比较多的时候建议开启)
set hive.limit.optimize.fetch.max=50000

4、本地模式：
set hive.exec.mode.local.auto=false/true(建议开启)
set hive.exec.mode.local.auto.inputbytes.max=134217728
set hive.exec.mode.local.auto.input.files.max=4

5、并行执行：
set hive.exec.parallel = false/true（建议开启）
set hive.exec.parallel.thread.number=8

6、严格模式：
hive.mapred.mode=nonstrict/strict(建议开启)

7、map和reduce 个数：
不是个数越多就越好，也不是越少越好。
将小文件合并处理（输入类设置成:CombineTextInputFormat）(在文件级别就将小文件合并成大文件，再上传到hdfs上)
mapred.max.split.size=256000000
mapred.min.split.size.per.node=1
mapred.min.split.size.per.rack=1
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

手动设置：
set mapred.map.tasks=2
reduce的个数：自动决定和手动设置
set mapred.reduce.tasks=-1

set hive.exec.reducers.max=1009

8、配置jvm重用：
mapreduce.job.jvm.numtasks=1 ###

mapred.job.reuse.jvm.num.tasks=1

9、数据倾斜：（重点）
由于key分布不均衡造成的数据往一个方向偏移的现象。
本身的数据倾斜。
join语句容易造成
count（distinct col）
group by

set hive.map.aggr=true
set hive.groupby.skewindata=false/true(建议开启)
set hive.optimize.skewjoin=false/true

10、索引

11、视图

12、分区本身就是hive的一种优化

13、分桶

14、fetch
select * from text1;

15、预测执行
kylin

mysql和hive的区别：
mysql用自己的存储引擎，hive使用的hdfs来存储。hbase
mysql使用自己的执行引擎，而hive使用的是mapreduce来执行。
mysql使用环境几乎没有限制，hive是基于hadoop的。
mysql的低延迟，hive是高延迟。
mysql的handle的数据量较小，而hive的handle数据量较大。
mysql的可扩展性较低，而hive的扩展性较高。
mysql的数据存储格式要求严格，而hive对数据格式不做严格要求。
mysql可以允许局部数据插入、更新、删除等，而hive不支持局部数据的操作。