hive的分区与分桶

【动态分区】通过股票案例进行说明(编程指南5.1):
   下面介绍动态分区,因为按照上面的方法向分区表插入数据,如果源数据量很大,针对一个分区就要写一个INSERT,非常麻烦。
   况且在之前的版本中,必须先手动创建好所有的分区后才能插入,这就更麻烦了,必须要知道源数据中有什么样的数据才能创建分区。
   使用动态分区可以很好的解决以上问题,动态分区可以根据查询得到数据自动匹配到相应的分区中去


     1)整理数据stocks.cvs 上传到hdfs $>hadoop fs -put stocks222.csv /data/stocks

     2)动态分区可以这样实现 基础表stocks
		create external table if not exists stocks (exchange1 string,symbol string,ymd string,price_open float,price_high float,price_low float,price_close float,volume int,price_adj_close float) row format delimited fields terminated by ',' location '/data/stocks';
     
 3)动态分区表stocks_partitions 按年月进行分区
     hive> create table stocks_partitions(exchange1 string,symbol string,ymd string,price_open float,price_high float,price_low float,price_close float,volume int,price_adj_close float) partitioned by (year string,month string) row format delimited fields terminated by ',';
      
	动态分区的使用方法很简单,根据查询的值让数据库自己来判断到哪一个分区中去
    hive> insert overwrite table stocks_partitions partition(year,month) select exchange1,symbol,ymd,price_open,price_high,price_low,price_close,volume,price_adj_close,substr(ymd,0,4) as year,substr(ymd,6,2) as month from stocks;
   
如要报以下错误:
    FAILED: SemanticException [Error 10096]: Dynamic partition strict mode 
	requires at least one static partition column. To turn this off set hive.
	exec.dynamic.partition.mode=nonstrict
    动态分区默认是严格的,需要设置不是严格的
	hive> set hive.exec.dynamic.partition.mode=nonstrict;


如果报以下错误:
    ic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:933)
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:709)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)

  分别查看分区数 或设置分区属性
  hive> set hive.exec.max.dynamic.partitions=100000;
  hive> set hive.exec.max.dynamic.partitions.pernode=100000;


  如果报以下错误(小文件stocks.csv不会报此错误):
  Diagnostic Messages for this Task:
  Container [pid=11171,containerID=container_1563106228073_0002_01_000005] is running beyond virtual memory limits. Current usage: 190.2 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.
  
  解决办法:
  1)停止yarn   stop-yarn.sh
  2)修改marped-site.xml
    <property>
		<name>mapreduce.map.memory.mb</name>
		<value>2048</value>
		<description>要从调度程序请求的每个内存的数量map任务.
    </description>
	</property>

	<property>
		<name>mapreduce.reduce.memory.mb</name>
		 <value>2048</value>
		 <description>要从调度程序请求的每个内存的数量reduce任务
		</description>
	</property>


    -->查看insert 生成的mr:
 Time taken for load dynamic partitions : 1864
Loading partition {year=1991, month=11}
Loading partition {year=2009, month=11}
Loading partition {year=1994, month=7/}
Loading partition {year=1991, month=10}
Loading partition {year=1994, month=9/}
Loading partition {year=2009, month=12}
Loading partition {year=2009, month=10}
Loading partition {year=1994, month=8/}
Loading partition {year=1994, month=5/}
Loading partition {year=1994, month=6/}
Loading partition {year=2007, month=1/}
Loading partition {year=2007, month=2/}
 Time taken for adding to write entity : 15
Partition stocksdb.stocks_partitions{year=1991, month=10} stats: [numFiles=1, numRows=11, totalSize=565, rawDataSize=554]
Partition stocksdb.stocks_partitions{year=1991, month=11} stats: [numFiles=1, numRows=8, totalSize=404, rawDataSize=396]
Partition stocksdb.stocks_partitions{year=1994, month=5/} stats: [numFiles=1, numRows=6, totalSize=343, rawDataSize=337]
Partition stocksdb.stocks_partitions{year=1994, month=6/} stats: [numFiles=1, numRows=22, totalSize=1230, rawDataSize=1208]
Partition stocksdb.stocks_partitions{year=1994, month=7/} stats: [numFiles=1, numRows=19, totalSize=1065, rawDataSize=1046]
Partition stocksdb.stocks_partitions{year=1994, month=8/} stats: [numFiles=1, numRows=23, totalSize=1285, rawDataSize=1262]
Partition stocksdb.stocks_partitions{year=1994, month=9/} stats: [numFiles=1, numRows=3, totalSize=162, rawDataSize=159]
Partition stocksdb.stocks_partitions{year=2007, month=1/} stats: [numFiles=1, numRows=6, totalSize=350, rawDataSize=344]
Partition stocksdb.stocks_partitions{year=2007, month=2/} stats: [numFiles=1, numRows=16, totalSize=929, rawDataSize=913]
Partition stocksdb.stocks_partitions{year=2009, month=10} stats: [numFiles=1, numRows=15, totalSize=811, rawDataSize=796]
Partition stocksdb.stocks_partitions{year=2009, month=11} stats: [numFiles=1, numRows=13, totalSize=703, rawDataSize=690]
Partition stocksdb.stocks_partitions{year=2009, month=12} stats: [numFiles=1, numRows=7, totalSize=372, rawDataSize=365]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.31 sec   HDFS Read: 13729 HDFS Write: 9053 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 310 msec
OK
Time taken: 36.078 seconds

    -->查看自动分区表
hive (stocksdb)> select * from stocks_partitions;
OK
NASDAQ	ARKR	1991/10/31	3.62	3.62	3.5		3.62	2800	2.59	1991	10
NASDAQ	ARKR	1991/10/30	3.62	3.75	3.62	3.75	2500	2.68	1991	10
NASDAQ	ARKR	1991/10/29	3.5	 	3.62	3.5		3.5		6500	2.5		1991	10
NASDAQ	ARKR	1991/10/28	3.5		3.5		3.5		3.5		600		2.5		1991	10
NASDAQ	ARKR	1991/10/25	3.38	3.38	3.25	3.25	2500	2.32	1991	10
NASDAQ	ARKR	1991/10/24	3.62	3.75	3.25	3.25	9500	2.32	1991	10
NASDAQ	ARKR	1991/10/23	3.5		3.62	3.5		3.62	1000	2.59	1991	10
NASDAQ	ARKR	1991/10/22	3.5		3.5		3.5		3.5		2000	2.5		1991	10
NASDAQ	ARKR	1991/10/18	3.62	3.62	3.62	3.62	1000	2.59	1991	10
NASDAQ	ARKR	1991/10/17	3.62	3.62	3.62	3.62	1700	2.59	1991	10

-->1)查看hdfs -lsr /user
-->2)查看hdfs 50070

-->通过分区表查看year=2009年(不是扫面全表)
hive (stocksdb)> select * from stocks_partitions where year=2009;
OK
NASDAQ	ABXA	2009/10/22	3.23	3.32	3.04	3.3		222900	3.3		2009	10
NASDAQ	ABXA	2009/10/21	3.19	3.39	3.11	3.22	316100	3.22	2009	10
NASDAQ	ABXA	2009/10/20	3.33	3.33	3.12	3.18	143700	3.18	2009	10
NASDAQ	ABXA	2009/10/19	3.31	3.33	3.24	3.32	118100	3.32	2009	10
NASDAQ	ABXA	2009/10/16	3.33	3.37	3.27	3.29	196500	3.29	2009	10
NASDAQ	ABXA	2009/10/15	3.33	3.39	3.24	3.35	552300	3.35	2009	10
NASDAQ	ABXA	2009/10/14	3.36	3.42	3.2		3.33	313700	3.33	2009	10
NASDAQ	ABXA	2009/10/13	3.4		3.4		3.21	3.32	222800	3.32	2009	10
NASDAQ	ABXA	2009/10/12	3.3		3.46	3.25	3.36	242200	3.36	2009	10
NASDAQ	ABXA	2009/10/9	3.22	3.31	3.22	3.31	127400	3.31	2009	10
NASDAQ	ABXA	2009/10/8	3.19	3.27	3.05	3.2		550200	3.2		2009	10
NASDAQ	ABXA	2009/10/7	3.17	3.25	3.11	3.18	177900	3.18	2009	10
NASDAQ	ABXA	2009/10/6	3.15	3.34	3.11	3.18	390600	3.18	2009	10
NASDAQ	ABXA	2009/10/5	3.07	3.24	3.06	3.13	238600	3.13	2009	10
NASDAQ	ABXA	2009/10/2	3.13	3.19	2.85	3.09	531600	3.09	2009	10
NASDAQ	ABXA	2009/11/30	2.36	2.36	2.11	2.25	446100	2.25	2009	11
NASDAQ	ABXA	2009/11/27	2.35	2.42	2.3		2.35	135200	2.35	2009	11
NASDAQ	ABXA	2009/11/25	2.48	2.49	2.4		2.45	77500	2.45	2009	11
NASDAQ	ABXA	2009/11/24	2.35	2.45	2.35	2.43	147700	2.43	2009	11
NASDAQ	ABXA	2009/11/23	2.4		2.45	2.31	2.38	222900	2.38	2009	11
NASDAQ	ABXA	2009/11/20	2.38	2.6		2.2		2.37	278300	2.37	2009	11
NASDAQ	ABXA	2009/11/19	2.45	2.65	2.35	2.38	405900	2.38	2009	11
NASDAQ	ABXA	2009/11/18	2.45	2.62	2.4		2.53	251900	2.53	2009	11
NASDAQ	ABXA	2009/11/17	2.55	2.55	2.37	2.48	318700	2.48	2009	11
NASDAQ	ABXA	2009/11/16	2.6		2.71	2.5		2.56	363400	2.56	2009	11
NASDAQ	ABXA	2009/11/13	2.99	2.99	2.34	2.6		653900	2.6		2009	11
NASDAQ	ABXA	2009/11/12	3.09	3.15	2.83	2.99	659000	2.99	2009	11
NASDAQ	ABXA	2009/11/11	2.95	3.1		2.85	3.09	256700	3.09	2009	11
NASDAQ	ABXA	2009/12/9	2.55	2.77	2.5		2.67	158500	2.67	2009	12
NASDAQ	ABXA	2009/12/8	2.71	2.74	2.52	2.55	131700	2.55	2009	12
NASDAQ	ABXA	2009/12/7	2.65	2.76	2.65	2.71	174200	2.71	2009	12
NASDAQ	ABXA	2009/12/4	2.63	2.66	2.53	2.65	230900	2.65	2009	12
NASDAQ	ABXA	2009/12/3	2.55	2.62	2.51	2.6		360900	2.6		2009	12
NASDAQ	ABXA	2009/12/2	2.41	2.59	2.4		2.53	287700	2.53	2009	12
NASDAQ	ABXA	2009/12/1	2.35	2.44	2.27	2.4		302000	2.4		2009	12
Time taken: 1.307 seconds, Fetched: 35 row(s)

-->通过分区表查看year=2009年 month=10月
hive (stocksdb)> select * from stocks_partitions where year=2009 and month=10;
OK
NASDAQ	ABXA	2009/10/22	3.23	3.32	3.04	3.3	222900	3.3	2009	10
NASDAQ	ABXA	2009/10/21	3.19	3.39	3.11	3.22	316100	3.22	2009	10
NASDAQ	ABXA	2009/10/20	3.33	3.33	3.12	3.18	143700	3.18	2009	10
NASDAQ	ABXA	2009/10/19	3.31	3.33	3.24	3.32	118100	3.32	2009	10
NASDAQ	ABXA	2009/10/16	3.33	3.37	3.27	3.29	196500	3.29	2009	10
NASDAQ	ABXA	2009/10/15	3.33	3.39	3.24	3.35	552300	3.35	2009	10
NASDAQ	ABXA	2009/10/14	3.36	3.42	3.2		3.33	313700	3.33	2009	10
NASDAQ	ABXA	2009/10/13	3.4		3.4		3.21	3.32	222800	3.32	2009	10
NASDAQ	ABXA	2009/10/12	3.3		3.46	3.25	3.36	242200	3.36	2009	10
NASDAQ	ABXA	2009/10/9	3.22	3.31	3.22	3.31	127400	3.31	2009	10
NASDAQ	ABXA	2009/10/8	3.19	3.27	3.05	3.2		550200	3.2		2009	10
NASDAQ	ABXA	2009/10/7	3.17	3.25	3.11	3.18	177900	3.18	2009	10
NASDAQ	ABXA	2009/10/6	3.15	3.34	3.11	3.18	390600	3.18	2009	10
NASDAQ	ABXA	2009/10/5	3.07	3.24	3.06	3.13	238600	3.13	2009	10
NASDAQ	ABXA	2009/10/2	3.13	3.19	2.85	3.09	531600	3.09	2009	10
Time taken: 0.259 seconds, Fetched: 15 row(s)

静态分区(SP)列:在涉及多个分区列的DML / DDL中,其值在COMPILE TIME(由用户给出)已知的列。
INSERT OVERWRITE TABLE T PARTITION (ds, hr = 11)
SELECT key, value, ds/, hr/ FROM srcpart WHERE ds is not null and hr=11;

动态分区(DP)列:其值仅在执行时间已知的列。

Hive 分桶 (权)

分桶的含义:实质是将数据分成不同的文件。hive中的分桶和hadoop中的reduce个数相同

hadoop的分区 key.hashcode % reduceTask
hive的分桶 column.hashcode % buckets

hive对值进行哈希并将结果除以桶的个数取余数,这样,任何一桶里都会有一个随机的用户集合
create table bucketed_users(id int,name string)
clustered by (id) into 4 buckets;

基础表数据,用来查询插入数据
22,zhagnsan
23,lisi
24,wangwu
25,zhaoliu
26,haha
17,lili
22,yiyi

create table student(id int,name string) row format delimited fields terminated by ‘,’;

加载表数据
load data local inpath ‘/home/hyxy/Desktop/t1.txt’ overwrite into table student;

创建分桶表

create table bucket_user(
id int,name string)
clustered by (id) into 4 buckets
row format delimited fields terminated by ‘,’;

$>dfs -lsr /user;
drwxr-xr-x - hyxy supergroup 0 2019-05-30 14:52 /user/hive/warehouse/stocksdb.db
drwxr-xr-x - hyxy supergroup 0 2019-05-30 14:52 /user/hive/warehouse/stocksdb.db/bucket_user

导入基础表数据到分桶表中
insert overwrite table bucket_user select * from student;


如果不设置采用分桶运行mapreduce显示如下信息
Query ID = hyxy_20190715103623_b121ffdc-e57e-47a8-9e34-32ab520480a0
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_1563155465027_0003, Tracking URL = http://master:8088/proxy/application_1563155465027_0003/
Kill Command = /home/hyxy/soft/hadoop/bin/hadoop job -kill job_1563155465027_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

查看hdfs并没有分桶效果
drwxrwxrwx - hyxy supergroup 0 2019-07-15 10:36 /user/hive/warehouse/bucket_user

1.设置采用分桶:
hive> set hive.enforce.bucketing=true;

Query ID = hyxy_20190530152218_f0b98faf-8c55-40f9-b057-c8d230267db2
Total jobs = 1
Launching Job 1 out of 1
【Number of reduce tasks determined at compile time: 4】
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:

hdfs查看显示分桶:
drwxrwxrwx - hyxy supergroup 0 2019-07-15 10:39 /user/hive/warehouse/bucket_user
-rwxrwxrwx 3 hyxy supergroup 10 2019-07-15 10:39 /user/hive/warehouse/bucket_user/000000_0
-rwxrwxrwx 3 hyxy supergroup 19 2019-07-15 10:39 /user/hive/warehouse/bucket_user/000001_0
-rwxrwxrwx 3 hyxy supergroup 28 2019-07-15 10:39 /user/hive/warehouse/bucket_user/000002_0
-rwxrwxrwx 3 hyxy supergroup 8 2019-07-15 10:39 /user/hive/warehouse/bucket_user/000003_0


2.插入数据 insert into bucket_user values(1,‘wangwu’);
$>dfs -lsr /user;
drwxr-xr-x - hyxy supergroup 0 2019-05-30 14:38 /user/hive/warehouse/stocksdb.db/bucket_user
-rwxr-xr-x 2 hyxy supergroup 0 2019-05-30 14:38 /user/hive/warehouse/stocksdb.db/bucket_user/000000_0


-rwxr-xr-x 2 hyxy supergroup 9 2019-05-30 14:38 /user/hive/warehouse/stocksdb.db/bucket_user/000001_0


-rwxr-xr-x 2 hyxy supergroup 0 2019-05-30 14:38 /user/hive/warehouse/stocksdb.db/bucket_user/000002_0
-rwxr-xr-x 2 hyxy supergroup 0 2019-05-30 14:38 /user/hive/warehouse/stocksdb.db/bucket_user/000003_0

hive>dfs -cat /user/hive/warehouse/stocksdb.db/bucket_user/000001_0 ;
1,wangwu

导入基础表数据到分桶表中
insert overwrite table bucket_user select * from student;

drwxr-xr-x - hyxy supergroup 0 /user/hive/warehouse/stocksdb.db/bucket_user
-rwxr-xr-x 2 hyxy supergroup 10 /user/hive/warehouse/stocksdb.db/bucket_user/000000_0
-rwxr-xr-x 2 hyxy supergroup 19 /user/hive/warehouse/stocksdb.db/bucket_user/000001_0
-rwxr-xr-x 2 hyxy supergroup 28 /user/hive/warehouse/stocksdb.db/bucket_user/000002_0
-rwxr-xr-x 2 hyxy supergroup 8 /user/hive/warehouse/stocksdb.db/bucket_user/000003_0

从桶中取数据:
[hyxy@master Desktop]$ hadoop fs -cat /user/hive/warehouse/stocksdb.db/bucket_user/000001_0
17,lili
25,zhaoliu

hive (stocksdb)> select * from bucket_user tablesample(bucket 2 out of 4 on id);
OK
17 lili
25 zhaoliu
Time taken: 0.683 seconds, Fetched: 2 row(s)

加载文件数据方式查看分桶结构
/home/hyxy/Desktop/qq.txt
22,zhagnsan
23,lisi
24,wangwu
25,zhaoliu
26,haha
17,lili
22,yiyi

load data local inpath ‘/home/hyxy/Desktop/qq.txt’ overwrite into table stocksdb.bucket_user;

$>查看hdfs(无效果,即只能insert …table select table)
drwxr-xr-x - hyxy supergroup 0 2019-05-30 15:19 /user/hive/warehouse/stocksdb.db
drwxr-xr-x - hyxy supergroup 0 2019-05-30 15:42 /user/hive/warehouse/stocksdb.db/bucket_user
-rwxr-xr-x 2 hyxy supergroup 44 2019-05-30 15:42 /user/hive/warehouse/stocksdb.db/bucket_user/qq.txt

总结分区和分桶,分区是目录,将数据分成清晰的目录便于查看和管理
分桶是在分区基础上更细一步, 将数据按类型区分更多的文件

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值