【动态分区】通过股票案例进行说明(编程指南5.1):
下面介绍动态分区,因为按照上面的方法向分区表插入数据,如果源数据量很大,针对一个分区就要写一个INSERT,非常麻烦。
况且在之前的版本中,必须先手动创建好所有的分区后才能插入,这就更麻烦了,必须要知道源数据中有什么样的数据才能创建分区。
使用动态分区可以很好的解决以上问题,动态分区可以根据查询得到数据自动匹配到相应的分区中去
1)整理数据stocks.cvs 上传到hdfs $>hadoop fs -put stocks222.csv /data/stocks
2)动态分区可以这样实现 基础表stocks
create external table if not exists stocks (exchange1 string,symbol string,ymd string,price_open float,price_high float,price_low float,price_close float,volume int,price_adj_close float) row format delimited fields terminated by ',' location '/data/stocks';
3)动态分区表stocks_partitions 按年月进行分区
hive> create table stocks_partitions(exchange1 string,symbol string,ymd string,price_open float,price_high float,price_low float,price_close float,volume int,price_adj_close float) partitioned by (year string,month string) row format delimited fields terminated by ',';
动态分区的使用方法很简单,根据查询的值让数据库自己来判断到哪一个分区中去
hive> insert overwrite table stocks_partitions partition(year,month) select exchange1,symbol,ymd,price_open,price_high,price_low,price_close,volume,price_adj_close,substr(ymd,0,4) as year,substr(ymd,6,2) as month from stocks;
如要报以下错误:
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode
requires at least one static partition column. To turn this off set hive.
exec.dynamic.partition.mode=nonstrict
动态分区默认是严格的,需要设置不是严格的
hive> set hive.exec.dynamic.partition.mode=nonstrict;
如果报以下错误:
ic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:933)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:709)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
分别查看分区数 或设置分区属性
hive> set hive.exec.max.dynamic.partitions=100000;
hive> set hive.exec.max.dynamic.partitions.pernode=100000;
如果报以下错误(小文件stocks.csv不会报此错误):
Diagnostic Messages for this Task:
Container [pid=11171,containerID=container_1563106228073_0002_01_000005] is running beyond virtual memory limits. Current usage: 190.2 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.
解决办法:
1)停止yarn stop-yarn.sh
2)修改marped-site.xml
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
<description>要从调度程序请求的每个内存的数量map任务.
</description>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>2048</value>
<description>要从调度程序请求的每个内存的数量reduce任务
</description>
</property>
-->查看insert 生成的mr:
Time taken for load dynamic partitions : 1864
Loading partition {year=1991, month=11}
Loading partition {year=2009, month=11}
Loading partition {year=1994, month=7/}
Loading partition {year=1991, month=10}
Loading partition {year=1994, month=9/}
Loading partition {year=2009, month=12}
Loading partition {year=2009, month=10}
Loading partition {year=1994, month=8/}
Loading partition {year=1994, month=5/}
Loading partition {year=1994, month=6/}
Loading partition {year=2007, month=1/}
Loading partition {year=2007, month=2/}
Time taken for adding to write entity : 15
Partition stocksdb.stocks_partitions{year=1991, month=10} stats: [numFiles=1, numRows=11, totalSize=565, rawDataSize=554]
Partition stocksdb.stocks_partitions{year=1991, month=11} stats: [numFiles=1, numRows=8, totalSize=404, rawDataSize=396]
Partition stocksdb.stocks_partitions{year=1994, month=5/} stats: [numFiles=1, numRows=6, totalSize=343, rawDataSize=337]
Partition stocksdb.stocks_partitions{year=1994, month=6/} stats: [numFiles=1, numRows=22, totalSize=1230, rawDataSize=1208]
Partition stocksdb.stocks_partitions{year=1994, month=7/} stats: [numFiles=1, numRows=19, totalSize=1065, rawDataSize=1046]
Partition stocksdb.stocks_partitions{year=1994, month=8/} stats: [numFiles=1, numRows=23, totalSize=1285, rawDataSize=1262]
Partition stocksdb.stocks_partitions{year=1994, month=9/} stats: [numFiles=1, numRows=3, totalSize=162, rawDataSize=159]
Partition stocksdb.stocks_partitions{year=2007, month=1/} stats: [numFiles=1, numRows=6, totalSize=350, rawDataSize=344]
Partition stocksdb.stocks_partitions{year=2007, month=2/} stats: [numFiles=1, numRows=16, totalSize=929, rawDataSize=913]
Partition stocksdb.stocks_partitions{year=2009, month=10} stats: [numFiles=1, numRows=15, totalSize=811, rawDataSize=796]
Partition stocksdb.stocks_partitions{year=2009, month=11} stats: [numFiles=1, numRows=13, totalSize=703, rawDataSize=690]
Partition stocksdb.stocks_partitions{year=2009, month=12} stats: [numFiles=1, numRows=7, totalSize=372, rawDataSize=365]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.31 sec HDFS Read: 13729 HDFS Write: 9053 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 310 msec
OK
Time taken: 36.078 seconds
-->查看自动分区表
hive (stocksdb)> select * from stocks_partitions;
OK
NASDAQ ARKR 1991/10/31 3.62 3.62 3.5 3.62 2800 2.59 1991 10
NASDAQ ARKR 1991/10/30 3.62 3.75 3.62 3.75 2500 2.68 1991 10
NASDAQ ARKR 1991/10/29 3.5 3.62 3.5 3.5 6500 2.5 1991 10
NASDAQ ARKR 1991/10/28 3.5 3.5 3.5 3.5 600 2.5 1991 10
NASDAQ ARKR 1991/10/25 3.38 3.38 3.25 3.25 2500 2.32 1991 10
NASDAQ ARKR 1991/10/24 3.62 3.75 3.25 3.25 9500 2.32 1991 10
NASDAQ ARKR 1991/10/23 3.5 3.62 3.5 3.62 1000 2.59 1991 10
NASDAQ ARKR 1991/10/22 3.5 3.5 3.5 3.5 2000 2.5 1991 10
NASDAQ ARKR 1991/10/18 3.62 3.62 3.62 3.62 1000 2.59 1991 10
NASDAQ ARKR 1991/10/17 3.62 3.62 3.62 3.62 1700 2.59 1991 10
-->1)查看hdfs -lsr /user
-->2)查看hdfs 50070
-->通过分区表查看year=2009年(不是扫面全表)
hive (stocksdb)> select * from stocks_partitions where year=2009;
OK
NASDAQ ABXA 2009/10/22 3.23 3.32 3.04 3.3 222900 3.3 2009 10
NASDAQ ABXA 2009/10/21 3.19 3.39 3.11 3.22 316100 3.22 2009 10
NASDAQ ABXA 2009/10/20 3.33 3.33 3.12 3.18 143700 3.18 2009 10
NASDAQ ABXA 2009/10/19 3.31 3.33 3.24 3.32 118100 3.32 2009 10
NASDAQ ABXA 2009/10/16 3.33 3.37 3.27 3.29 196500 3.29 2009 10
NASDAQ ABXA 2009/10/15 3.33 3.39 3.24 3.35 552300 3.35 2009 10
NASDAQ ABXA 2009/10/14 3.36 3.42 3.2 3.33 313700 3.33 2009 10
NASDAQ ABXA 2009/10/13 3.4 3.4 3.21 3.32 222800 3.32 2009 10
NASDAQ ABXA 2009/10/12 3.3 3.46 3.25 3.36 242200 3.36 2009 10
NASDAQ ABXA 2009/10/9 3.22 3.31 3.22 3.31 127400 3.31 2009 10
NASDAQ ABXA 2009/10/8 3.19 3.27 3.05 3.2 550200 3.2 2009 10
NASDAQ ABXA 2009/10/7 3.17 3.25 3.11 3.18 177900 3.18 2009 10
NASDAQ ABXA 2009/10/6 3.15 3.34 3.11 3.18 390600 3.18 2009 10
NASDAQ ABXA 2009/10/5 3.07 3.24 3.06 3.13 238600 3.13 2009 10
NASDAQ ABXA 2009/10/2 3.13 3.19 2.85 3.09 531600 3.09 2009 10
NASDAQ ABXA 2009/11/30 2.36 2.36 2.11 2.25 446100 2.25 2009 11
NASDAQ ABXA 2009/11/27 2.35 2.42 2.3 2.35 135200 2.35 2009 11
NASDAQ ABXA 2009/11/25 2.48 2.49 2.4 2.45 77500 2.45 2009 11
NASDAQ ABXA 2009/11/24 2.35 2.45 2.35 2.43 147700 2.43 2009 11
NASDAQ ABXA 2009/11/23 2.4 2.45 2.31 2.38 222900 2.38 2009 11
NASDAQ ABXA 2009/11/20 2.38 2.6 2.2 2.37 278300 2.37 2009 11
NASDAQ ABXA 2009/11/19 2.45 2.65 2.35 2.38 405900 2.38 2009 11
NASDAQ ABXA 2009/11/18 2.45 2.62 2.4 2.53 251900 2.53 2009 11
NASDAQ ABXA 2009/11/17 2.55 2.55 2.37 2.48 318700 2.48 2009 11
NASDAQ ABXA 2009/11/16 2.6 2.71 2.5 2.56 363400 2.56 2009 11
NASDAQ ABXA 2009/11/13 2.99 2.99 2.34 2.6 653900 2.6 2009 11
NASDAQ ABXA 2009/11/12 3.09 3.15 2.83 2.99 659000 2.99 2009 11
NASDAQ ABXA 2009/11/11 2.95 3.1 2.85 3.09 256700 3.09 2009 11
NASDAQ ABXA 2009/12/9 2.55 2.77 2.5 2.67 158500 2.67 2009 12
NASDAQ ABXA 2009/12/8 2.71 2.74 2.52 2.55 131700 2.55 2009 12
NASDAQ ABXA 2009/12/7 2.65 2.76 2.65 2.71 174200 2.71 2009 12
NASDAQ ABXA 2009/12/4 2.63 2.66 2.53 2.65 230900 2.65 2009 12
NASDAQ ABXA 2009/12/3 2.55 2.62 2.51 2.6 360900 2.6 2009 12
NASDAQ ABXA 2009/12/2 2.41 2.59 2.4 2.53 287700 2.53 2009 12
NASDAQ ABXA 2009/12/1 2.35 2.44 2.27 2.4 302000 2.4 2009 12
Time taken: 1.307 seconds, Fetched: 35 row(s)
-->通过分区表查看year=2009年 month=10月
hive (stocksdb)> select * from stocks_partitions where year=2009 and month=10;
OK
NASDAQ ABXA 2009/10/22 3.23 3.32 3.04 3.3 222900 3.3 2009 10
NASDAQ ABXA 2009/10/21 3.19 3.39 3.11 3.22 316100 3.22 2009 10
NASDAQ ABXA 2009/10/20 3.33 3.33 3.12 3.18 143700 3.18 2009 10
NASDAQ ABXA 2009/10/19 3.31 3.33 3.24 3.32 118100 3.32 2009 10
NASDAQ ABXA 2009/10/16 3.33 3.37 3.27 3.29 196500 3.29 2009 10
NASDAQ ABXA 2009/10/15 3.33 3.39 3.24 3.35 552300 3.35 2009 10
NASDAQ ABXA 2009/10/14 3.36 3.42 3.2 3.33 313700 3.33 2009 10
NASDAQ ABXA 2009/10/13 3.4 3.4 3.21 3.32 222800 3.32 2009 10
NASDAQ ABXA 2009/10/12 3.3 3.46 3.25 3.36 242200 3.36 2009 10
NASDAQ ABXA 2009/10/9 3.22 3.31 3.22 3.31 127400 3.31 2009 10
NASDAQ ABXA 2009/10/8 3.19 3.27 3.05 3.2 550200 3.2 2009 10
NASDAQ ABXA 2009/10/7 3.17 3.25 3.11 3.18 177900 3.18 2009 10
NASDAQ ABXA 2009/10/6 3.15 3.34 3.11 3.18 390600 3.18 2009 10
NASDAQ ABXA 2009/10/5 3.07 3.24 3.06 3.13 238600 3.13 2009 10
NASDAQ ABXA 2009/10/2 3.13 3.19 2.85 3.09 531600 3.09 2009 10
Time taken: 0.259 seconds, Fetched: 15 row(s)
静态分区(SP)列:在涉及多个分区列的DML / DDL中,其值在COMPILE TIME(由用户给出)已知的列。
INSERT OVERWRITE TABLE T PARTITION (ds, hr = 11)
SELECT key, value, ds/, hr/ FROM srcpart WHERE ds is not null and hr=11;
动态分区(DP)列:其值仅在执行时间已知的列。
Hive 分桶 (权)
分桶的含义:实质是将数据分成不同的文件。hive中的分桶和hadoop中的reduce个数相同
hadoop的分区 key.hashcode % reduceTask
hive的分桶 column.hashcode % buckets
hive对值进行哈希并将结果除以桶的个数取余数,这样,任何一桶里都会有一个随机的用户集合
create table bucketed_users(id int,name string)
clustered by (id) into 4 buckets;
基础表数据,用来查询插入数据
22,zhagnsan
23,lisi
24,wangwu
25,zhaoliu
26,haha
17,lili
22,yiyi
create table student(id int,name string) row format delimited fields terminated by ‘,’;
加载表数据
load data local inpath ‘/home/hyxy/Desktop/t1.txt’ overwrite into table student;
创建分桶表
create table bucket_user(
id int,name string)
clustered by (id) into 4 buckets
row format delimited fields terminated by ‘,’;
$>dfs -lsr /user;
drwxr-xr-x - hyxy supergroup 0 2019-05-30 14:52 /user/hive/warehouse/stocksdb.db
drwxr-xr-x - hyxy supergroup 0 2019-05-30 14:52 /user/hive/warehouse/stocksdb.db/bucket_user
导入基础表数据到分桶表中
insert overwrite table bucket_user select * from student;
如果不设置采用分桶运行mapreduce显示如下信息
Query ID = hyxy_20190715103623_b121ffdc-e57e-47a8-9e34-32ab520480a0
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_1563155465027_0003, Tracking URL = http://master:8088/proxy/application_1563155465027_0003/
Kill Command = /home/hyxy/soft/hadoop/bin/hadoop job -kill job_1563155465027_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
查看hdfs并没有分桶效果
drwxrwxrwx - hyxy supergroup 0 2019-07-15 10:36 /user/hive/warehouse/bucket_user
1.设置采用分桶:
hive> set hive.enforce.bucketing=true;
Query ID = hyxy_20190530152218_f0b98faf-8c55-40f9-b057-c8d230267db2
Total jobs = 1
Launching Job 1 out of 1
【Number of reduce tasks determined at compile time: 4】
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
hdfs查看显示分桶:
drwxrwxrwx - hyxy supergroup 0 2019-07-15 10:39 /user/hive/warehouse/bucket_user
-rwxrwxrwx 3 hyxy supergroup 10 2019-07-15 10:39 /user/hive/warehouse/bucket_user/000000_0
-rwxrwxrwx 3 hyxy supergroup 19 2019-07-15 10:39 /user/hive/warehouse/bucket_user/000001_0
-rwxrwxrwx 3 hyxy supergroup 28 2019-07-15 10:39 /user/hive/warehouse/bucket_user/000002_0
-rwxrwxrwx 3 hyxy supergroup 8 2019-07-15 10:39 /user/hive/warehouse/bucket_user/000003_0
2.插入数据 insert into bucket_user values(1,‘wangwu’);
$>dfs -lsr /user;
drwxr-xr-x - hyxy supergroup 0 2019-05-30 14:38 /user/hive/warehouse/stocksdb.db/bucket_user
-rwxr-xr-x 2 hyxy supergroup 0 2019-05-30 14:38 /user/hive/warehouse/stocksdb.db/bucket_user/000000_0
-rwxr-xr-x 2 hyxy supergroup 9 2019-05-30 14:38 /user/hive/warehouse/stocksdb.db/bucket_user/000001_0
-rwxr-xr-x 2 hyxy supergroup 0 2019-05-30 14:38 /user/hive/warehouse/stocksdb.db/bucket_user/000002_0
-rwxr-xr-x 2 hyxy supergroup 0 2019-05-30 14:38 /user/hive/warehouse/stocksdb.db/bucket_user/000003_0
hive>dfs -cat /user/hive/warehouse/stocksdb.db/bucket_user/000001_0 ;
1,wangwu
导入基础表数据到分桶表中
insert overwrite table bucket_user select * from student;
drwxr-xr-x - hyxy supergroup 0 /user/hive/warehouse/stocksdb.db/bucket_user
-rwxr-xr-x 2 hyxy supergroup 10 /user/hive/warehouse/stocksdb.db/bucket_user/000000_0
-rwxr-xr-x 2 hyxy supergroup 19 /user/hive/warehouse/stocksdb.db/bucket_user/000001_0
-rwxr-xr-x 2 hyxy supergroup 28 /user/hive/warehouse/stocksdb.db/bucket_user/000002_0
-rwxr-xr-x 2 hyxy supergroup 8 /user/hive/warehouse/stocksdb.db/bucket_user/000003_0
从桶中取数据:
[hyxy@master Desktop]$ hadoop fs -cat /user/hive/warehouse/stocksdb.db/bucket_user/000001_0
17,lili
25,zhaoliu
hive (stocksdb)> select * from bucket_user tablesample(bucket 2 out of 4 on id);
OK
17 lili
25 zhaoliu
Time taken: 0.683 seconds, Fetched: 2 row(s)
加载文件数据方式查看分桶结构
/home/hyxy/Desktop/qq.txt
22,zhagnsan
23,lisi
24,wangwu
25,zhaoliu
26,haha
17,lili
22,yiyi
load data local inpath ‘/home/hyxy/Desktop/qq.txt’ overwrite into table stocksdb.bucket_user;
$>查看hdfs(无效果,即只能insert …table select table)
drwxr-xr-x - hyxy supergroup 0 2019-05-30 15:19 /user/hive/warehouse/stocksdb.db
drwxr-xr-x - hyxy supergroup 0 2019-05-30 15:42 /user/hive/warehouse/stocksdb.db/bucket_user
-rwxr-xr-x 2 hyxy supergroup 44 2019-05-30 15:42 /user/hive/warehouse/stocksdb.db/bucket_user/qq.txt
总结分区和分桶,分区是目录,将数据分成清晰的目录便于查看和管理
分桶是在分区基础上更细一步, 将数据按类型区分更多的文件
2716

被折叠的 条评论
为什么被折叠?



