hive库、表、分区、桶的一些概念:
Databases:数据库,概念等同于关系型数据库的Schema;
Tables:表,概念等同于关系型数据库的表;
Partitions:分区,概念类似于关系型数据库的表分区,便于提高效率;
Buckets (or Clusters):分桶,同一个分区内的数据还可以细分,将相同的KEY再划分至一个桶中,
这个有点类似于HASH分区,只不过这里是HASH分桶,也有点类似子分区吧,进一步提高效率;
Hive中的表分为内部表(MANAGED_TABLE)和外部表(EXTERNAL_TABLE)。默认为内部表;
区别: 内部表DROP时候会删除HDFS上的数据;
外部表DROP时候不会删除HDFS上的数据;
内部表适用场景:适用于Hive中间表、结果表、以及不需要从外部(如本地文件、HDFS)load数据的情况。或者从外部或者
其他表insert到内部表中。
外部表适用场景:源表,需要定期将外部数据映射到表中。
适用场景举例说明:每天将收集到的网站日志定期流入HDFS文本文件,一天一个目录,
在Hive中建立外部表作为源表,通过添加分区的方式,将每天HDFS上的原始日志映射到外部表的天分区中,
在外部表(原始日志表)的基础上做大量的统计分析,用到的中间表、结果表使用内部表存储,数据通过SELECT+INSERT进入内部表。
简单内表创建:
create table dd (name string , age string) location '/input/table_data'; 不指定external默认创建内部表
并且会在hdfs上新建一个dd表的数据存放地 若不指定location会默认在/user/hive/warehouse/目录对应的库下创建。
load data inpath '/input/data' into table dd; 加载数据到表中。
外部表的简单创建:
create external table aa (name string , age string); 会在/user/hive/warehouse/对应的库下面建一个表目录
load data inpath '/input/edata' into table aa;
删除这个外部表后,
/user/hive/warehouse/对应的库下面建一个表目录下的数据不会删除,但是/input/edata/下的数据在上一步load后已经没有了!
数据的位置发生了变化!本质是load一个hdfs上的数据时会转移数据!
小结hive内外表区别
1、在导入数据到外部表,数据并没有移动到自己的数据仓库目录下(如果指定了location的话),
也就是说外部表中的数据并不是由它自己来管理的!而内部表则不一样;
2、在删除内部表的时候,Hive将会把属于表的元数据和数据全部删掉;
而删除外部表的时候,Hive仅仅删除外部表的元数据,数据是不会删除的!
3. 在创建内部表或外部表时加上location 的效果是一样的,只不过表目录的位置不同而已,
加上partition用法也一样,只不过表目录下会有分区目录而已,
load data local inpath直接把本地文件系统的数据上传到hdfs上,
有location上传到location指定的位置上,没有的话上传到hive默认配置的数据仓库中。
加入分区的原因:
1、加入分区避免Hive Select查询中扫描整个表内容,会消耗很多时间做没必要的工作。
(例如每一天的日志存放在一个分区中,这样根据特定的日期查询)
2、一个表可以拥有一个或者多个分区,每个分区以文件夹的形式单独存在表文件夹的目录下。
桶表
对于每一个表(table)或者分区, Hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围划分。
Hive也是 针对某一列进行桶的组织。Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。
桶是以文件的形式存放在表或者分区的目录下。
把表(或者分区)组织成桶(Bucket)有两个理由:
(1)获得更高的查询处理效率。桶为表加上了额外的结构,Hive 在处理有些查询时能利用这个结构。
具体而言,连接两个在(包含连接列的)相同列上划分了桶的表,可以使用 Map 端连接 (Map-side join)高效的实现。
比如JOIN操作。对于JOIN操作两个表有一个相同的列,如果对这两个表都进行了桶操作。
那么将保存相同列值的桶进行JOIN操作就可以,可以大大较少JOIN的数据量。
(2)使取样(sampling)更高效。在处理大规模数据集时,在开发和修改查询的阶段,
如果能在数据集的一小部分数据上试运行查询,会带来很多方便。
分区表创建
CREATE EXTERNAL TABLE IF NOT EXISTS data_table(
ROWKEY STRING,
STATION INT,
MONTH INT,
DAY INT,
HOUR INT,
MINUTE INT
)
PARTITIONED BY (YEAR INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
关键字STORED AS: 指定表在HDFS上的文件存储格式,可选的文件存储格式有:
TEXTFILE:文本默认值.SEQUENCEFILE:二进制序列文件 。RCFILE、ORC、PARQUET:
几种列式存储格式文件。关键字LOCATION: 指定表在HDFS上的存储位置
hiveSQL和mysql语法相像,对于大小写不敏感,包括函数。
CREATE EXTERNAL TABLE test1 (
id INT,
ip STRING COMMENT '访问者IP',
avg_view_depth DECIMAL(5,1),
bounce_rate DECIMAL(6,5)
) COMMENT '表详情注释'
PARTITIONED BY (day STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS textfile
LOCATION 'hdfs://user/hive/warehose/';
关键字EXTERNAL:表示该表为外部表,如果不指定EXTERNAL关键字,则表示内部表 .
关键字COMMENT: 为表和列添加注释。
关键字PARTITIONED BY: 表示该表为分区表,分区字段为day,类型为string。
关键字ROW FORMAT DELIMITED: 指定表的分隔符,通常后面要与以下关键字连用:
FIELDS TERMINATED BY ',':指定每行中字段分隔符为逗号。
LINES TERMINATED BY '\n':指定行分隔符
COLLECTION ITEMS TERMINATED BY ',' :指定集合中元素之间的分隔符 。
MAP KEYS TERMINATED BY ':':指定数据中Map类型的Key与Value之间的分隔符,
hive分区及复杂数据类型综合应用demo
数据样本类型 a.txt
1,xiaoming,book-TV-code,beijing:chaoyang-shagnhai:pudong
2,lilei,book-code,nanjing:jiangning-taiwan:taibei
3,lihua,music-book,heilongjiang:haerbin
4,dalong,magic-book,shenzhen:nanshan-guangzhou:fanyu
一般情况下:
数据的导出字段分割最好为制表符。
表的字段分割最好用制表符'\t'防止一些数据中本身会有一些字符","导致加载数据丢失或出错。
建分区表
create table if not exists test2(
id int
,name string
,hobby array<string>
,add map<string,string>
)
partitioned by (pt_d string)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';
加载数据 默认存储text.File
load data local inpath 'linux本地文件路径/a.txt' overwrite into table test1 partition (pt_d = '201808');
查看数据及分区:
select * from test1;
hive> select * from test2;
OK
1 xiaoming ["book","TV","code"] {"beijing":"chaoyang","shagnhai":"pudong"} 201808
2 lilei ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"} 201808
3 lihua ["music","book"] {"heilongjiang":"haerbin"} 201808
4 dalong ["magic","book"] {"shenzhen":"nanshan","guangzhou":"fanyu"} 201808
Time taken: 2.182 seconds, Fetched: 4 row(s)
分区查看:
show partitions test2;
hive> show partitions test2;
OK
pt_d=201808
Time taken: 10.137 seconds, Fetched: 1 row(s)
加载数据未指定目录,数据加载到默认hdfs目录:
hive> dfs -ls /user/hive/warehouse/ludb.db/
> ;
Found 3 items
drwxrwxrwx - root supergroup 0 2018-08-07 16:19 /user/hive/warehouse/ludb.db/58_tongcheng
drwxrwxrwx - root supergroup 0 2018-08-01 21:27 /user/hive/warehouse/ludb.db/region
drwxrwxrwx - root supergroup 0 2018-08-08 11:31 /user/hive/warehouse/ludb.db/test2
将数据再一次加载到另一个分区;
load data local inpath 'linux本地文件路径/a.txt' overwrite into table test1 partition (pt_d = '201807');
再一次查看表分区:
hive> show partitions test2;
OK
pt_d=201807
pt_d=201808
Time taken: 10.117 seconds, Fetched: 2 row(s)
再一次查看表数据:
hive> select * from test2;
OK
1 xiaoming ["book","TV","code"] {"beijing":"chaoyang","shagnhai":"pudong"} 201807
2 lilei ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"} 201807
3 lihua ["music","book"] {"heilongjiang":"haerbin"} 201807
4 dalong ["magic","book"] {"shenzhen":"nanshan","guangzhou":"fanyu"} 201807
1 xiaoming ["book","TV","code"] {"beijing":"chaoyang","shagnhai":"pudong"} 201808
2 lilei ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"} 201808
3 lihua ["music","book"] {"heilongjiang":"haerbin"} 201808
4 dalong ["magic","book"] {"shenzhen":"nanshan","guangzhou":"fanyu"} 201808
Time taken: 2.178 seconds, Fetched: 8 row(s)
数据有重复,因为同样的数据插入到不同的分区中。
查看hdfs表分区目录:
dfs -ls /user/hive/warehouse/ludb.db/test2;
Found 2 items
drwxrwxrwx - root supergroup 0 2018-08-08 11:38 /user/hive/warehouse/ludb.db/test2/pt_d=201807
drwxrwxrwx - root supergroup 0 2018-08-08 11:31 /user/hive/warehouse/ludb.db/test2/pt_d=201808
对应的hdfs表分区目录
再次查询对应分区数据就不会又重复:
hive> select * from test2 where pt_d='201808';
OK
1 xiaoming ["book","TV","code"] {"beijing":"chaoyang","shagnhai":"pudong"} 201808
2 lilei ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"} 201808
3 lihua ["music","book"] {"heilongjiang":"haerbin"} 201808
4 dalong ["magic","book"] {"shenzhen":"nanshan","guangzhou":"fanyu"} 201808
Time taken: 0.812 seconds, Fetched: 4 row(s)
表分区的增加和删除
alter table test2 add partition (pt_d = '201701');
alter table test2 drop partition (pt_d = '201701');
删除分区(删除相应分区文件)
注意,对于外表进行drop partition并不会删除hdfs上的文件,并且通过msck repair table table_name同步回hdfs上的分区
创建外部表:
create external table testlu (id int)
row format delimited
fields terminated by '\t'
partitioned by (age int);
一次增加多分区;
alter table testlu add partition(age=3) partition(age=4);
注意:加载数据时,如果数据在hdfs中,不能加local,默认情况下表会建立在默认库当中,如果要建指定的库和表,
先要指定用哪个库,然后在对应的库中建表。
当用where条件查询报错如下:
hive> select * from test2 where pt_d='201808';
FAILED: SemanticException MetaException(message:You have an error in your SQL syntax;
check the manual that corresponds to your MySQL server version for the right syntax
to use near 'OPTION SQL_SELECT_LIMIT=DEFAULT' at line 1
产生的原因是hive lib库中的mysql驱动包版本不匹配,需要重新下载对应的驱动包cp到lib目录
INSERT INTO TABLE test
PARTITION (month = '2018-08',day = '2018-08-08')
SELECT * FROM testlu;
指将testlu的所有数据追加到test表中
INSERT overwrite INTO TABLE test
PARTITION (month = '2018-08',day = '2018-08-08')
SELECT * FROM testlu;
指将testlu的所有数据插入test表,原数据被覆盖。即覆盖插入
当查询中指定了month='2018-08' AND day='2018-08-08',MapReduce直接从该目录中读取数据,
如果只指定了month='2018-08',那么MapReduce将/month=2018-08/下所有的子目录都作为Input。如果有30
个子目录就会扫描30个子目录
hive表中的数据导出到文件系统
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/test/'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
SELECT * FROM test;
hive array、map、struct三种数据类型的使用
hive> create table student_test(id INT, info struct<name:STRING, age:INT>)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> COLLECTION ITEMS TERMINATED BY ':'; 其他的不指定系统默认
OK
Time taken: 0.446 seconds
'FIELDS TERMINATED BY' :字段与字段之间的分隔符
''COLLECTION ITEMS TERMINATED BY' :一个字段各个item的分隔符
导入数据:
vi text5.txt
1,zhou:30
2,yan:30
3,chen:20
4,li:80
hive> LOAD DATA LOCAL INPATH '/user/work/data/test5.txt' INTO TABLE student_test;
查询:
hive> select info.age from student_test;
Total MapReduce jobs = 1
......
Total MapReduce CPU Time Spent: 490 msec
OK
30
30
20
80
Time taken: 21.677 seconds
Array使用
建表:
hive> create table class_test(name string, student_id_list array<INT>)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> COLLECTION ITEMS TERMINATED BY ':';
OK
Time taken: 0.099 seconds
导入数据:
vi text6.txt
034,1:2:3:4
035,5:6
036,7:8:9:10
hive>LOAD DATA LOCAL INPATH '/home/data/test6.txt' INTO TABLE class_test ;
Copying data from file:/home/data/test6.txt
Copying file: file:/home/work/test6.txt
Loading data to table default.class_test
OK
Time taken: 0.198 seconds
查询数据:
hive> select student_id_list[3] from class_test;
Total MapReduce jobs = 1
......
Total MapReduce CPU Time Spent: 480 msec
OK
4
NULL
10
Time taken: 21.574 seconds
Map使用
建表:
hive> create table employee(id string, perf map<string, int>)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> COLLECTION ITEMS TERMINATED BY ','
> MAP KEYS TERMINATED BY ':';
OK
Time taken: 0.144 seconds
‘MAP KEYS TERMINATED BY’ :key value分隔符
导入数据:
vi test7.txt
1 job:80,team:60,person:70
2 job:60,team:80
3 job:90,team:70,person:100
hive> LOAD DATA LOCAL INPATH '/home/work/data/test7.txt' INTO TABLE employee;
查询数据:
hive> select perf['person'] from employee;
Total MapReduce jobs = 1
......
Total MapReduce CPU Time Spent: 460 msec
OK
70
NULL
100
Time taken: 20.902 seconds
hive> select perf['person'] from employee where perf['person'] is not null;
Total MapReduce jobs = 1
.......
Total MapReduce CPU Time Spent: 610 msec
OK
70
100
Time taken: 21.989 seconds
hive>
动态分区:
如果用上述的静态分区,插入的时候必须首先要知道有什么分区类型,而且每个分区写一个load data,太麻烦。
使用动态分区可解决以上问题,其可以根据查询得到的数据动态分配到分区里。其实动态分区与静态分区区别就是不指定分区目录,
由系统自己选择。首先,启动动态分区功能
hive> set hive.exec.dynamic.partition=true;
假设已有一张表par_tab,前两列是名称name和国籍nation,后两列是分区列,性别sex和日期dt,数据如下
hive> select * from par_tab;
OK
lily china man 2013-03-28
nancy china man 2013-03-28
hanmeimei america man 2013-03-28
jan china man 2013-03-29
mary america man 2013-03-29
lilei china man 2013-03-29
heyong china man 2013-03-29
yiku japan man 2013-03-29
emoji japan man 2013-03-29
Time taken: 1.141 seconds, Fetched: 9 row(s)
现在我把这张表的内容直接插入到另一张表par_dnm中,并实现sex为静态分区,dt动态分区(不指定到底是哪日,让系统自己分配决定)
hive> insert overwrite table par_dnm partition(sex='man',dt)
> select name, nation, dt from par_tab;
插入后看下目录结构
drwxr-xr-x - hadoop supergroup 0 2017-03-29 10:32 /user/hive/warehouse/par_dnm/sex=man
drwxr-xr-x - hadoop supergroup 0 2017-03-29 10:32 /user/hive/warehouse/par_dnm/sex=man/dt=2013-03-28
-rwxr-xr-x 1 hadoop supergroup 41 2017-03-29 10:32 /user/hive/warehouse/par_dnm/sex=man/dt=2013-03-28/000000_0
drwxr-xr-x - hadoop supergroup 0 2017-03-29 10:32 /user/hive/warehouse/par_dnm/sex=man/dt=2013-03-29
-rwxr-xr-x 1 hadoop supergroup 71 2017-03-29 10:32 /user/hive/warehouse/par_dnm/sex=man/dt=2013-03-29/000000_0
再查看分区数
hive> show partitions par_dnm;
OK
sex=man/dt=2013-03-28
sex=man/dt=2013-03-29
Time taken: 0.065 seconds, Fetched: 2 row(s)
注意,动态分区不允许主分区采用动态列而副分区采用静态列,这样将导致所有的主分区都要创建副分区静态列所定义的分区。
hive一列转多列及json解析函数的用法:
hive一列转多列其实就是对之前的复杂数据类型的进一步应用:
建表:
hive> create external table class_test3(id string, name string, B array<string>)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> COLLECTION ITEMS TERMINATED BY '-';
vi b.txt
1,xiaoming,book-TV-code
2,lilei,book-code
3,lihua,music-book
4,dalong,magic-book
load data local inpath '' into table class_test1
hive> select * from class_test3;
OK
1 xiaoming ["book","TV","code"]
2 lilei ["book","code"]
3 lihua ["music","book"]
4 dalong ["magic","book"]
但是最终表中的数据我们是这样的:
OK
1 xiaoming book
1 xiaoming TV
1 xiaoming code
2 lilei book
2 lilei code
3 lihua music
3 lihua book
4 dalong magic
4 dalong book
重新建一张新表:把解析的数据再插入到新表当中。
create table class_test4(id string, name string, book_name string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
插入解析后的数据:
insert into table class_test4
select id, name, books from class_test3 lateral view explode(B) class_test3 as books
查询解析后的数据
hive> select * from class_test4;
OK
1 xiaoming book
1 xiaoming TV
1 xiaoming code
2 lilei book
2 lilei code
3 lihua music
3 lihua book
4 dalong magic
4 dalong book
Time taken: 0.108 seconds, Fetched: 9 row(s)
解析函数用: lateral view explode("被解析的列名")
进一步应用:
数据样本:
1,xiaoming,book-TV-code,beijing-chaoyang-shagnhai-pudong
2,lilei,book-code,nanjing-jiangning-taiwan-taibei
3,lihua,music-book,heilongjiang-haerbin
4,dalong,magic-book,shenzhen-nanshan-guangzhou-fanyu
建表:
create external table class_test5(id string, name string, books_name array<string>, city array<string>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-';
重新建内表:
create table class_test6(id string, name string, book_name string, city string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
插入解析后的数据:
insert into table class_test6
select id, name, books, city from class_test5 lateral view explode(books_name) class_test5 as books
lateral view explode(city) class_test5 as city
查询数据:
hive> select * from class_test6;
OK
1 xiaoming book beijing
1 xiaoming book chaoyang
1 xiaoming book shagnhai
1 xiaoming book pudong
1 xiaoming TV beijing
1 xiaoming TV chaoyang
1 xiaoming TV shagnhai
1 xiaoming TV pudong
1 xiaoming code beijing
1 xiaoming code chaoyang
1 xiaoming code shagnhai
1 xiaoming code pudong
2 lilei book nanjing
2 lilei book jiangning
2 lilei book taiwan
2 lilei book taibei
2 lilei code nanjing
2 lilei code jiangning
2 lilei code taiwan
2 lilei code taibei
3 lihua music heilongjiang
3 lihua music haerbin
3 lihua book heilongjiang
3 lihua book haerbin
4 dalong magic shenzhen
4 dalong magic nanshan
4 dalong magic guangzhou
4 dalong magic fanyu
4 dalong book shenzhen
4 dalong book nanshan
4 dalong book guangzhou
4 dalong book fanyu
Time taken: 0.106 seconds, Fetched: 32 row(s)
hive json数据格式解析:
主要用于非机构化的日志网络数据处理
json数据结构:
{"id": 1701439105,"ids": [2154137571,3889177061,1496915057,……,1663973284],"total_number": 493}
建表:
CREATE TABLE IF NOT EXISTS tmp_json_test (
json string
)
STORED AS textfile ;
加载数据:
load data local inpath '/opt/datas/weibotest.json' overwrite into table tmp_json_test;
select get_json_object(t.json,'$.id'), get_json_object(t.json,'$.total_number') from tmp_json_test t ;
select t2.* from tmp_json_test t1 lateral view json_tuple(t1.json, 'id', 'total_number') t2 as c1, c2;
方法一使用函数get_json_object , 方法二使用函数 json_tuple
{
"message":"2015/12/08 09:14:4",
"client": "10.108.24.253",
"server": "passport.suning.com",
"request": "POST /ids/needVerifyCode HTTP/1.1",
"server": "passport.sing.co",
"version":"1",
"timestamp":"2015-12-08T01:14:43.273Z",
"type":"B2C","center":"JSZC",
"system":"WAF","clientip":"192.168.61.4",
"host":"wafprdweb03",
"path":"/usr/local/logs/waf.error.log",
"redis":"192.168.24.46"}
select b.* from tmp_json_test3 a lateral view json_tuple(a.json, 'message','client','server','request','server','timestamp',
'type','system','host','path','redis') b
official example:
Example: src_json table is a single column (json), single row table:
+----+
json
+----+
{"store":
{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
"bicycle":{"price":19.95,"color":"red"}
},
"email":"amy@only_for_json_udf_test.net",
"owner":"amy"
}
+----+
The fields of the json object can be extracted using these queries:
hive> SELECT get_json_object(src_json.json, '$.owner') FROM src_json;
amy
hive> SELECT get_json_object(src_json.json, '$.store.fruit\[0]') FROM src_json;
{"weight":8,"type":"apple"}
hive> SELECT get_json_object(src_json.json, '$.non_exist_key') FROM src_json;
NULL