下面这个文章很好的讲解了Partitions的使用方法
http://www.aahyhaa.com/archives/316
其他参考文章:
http://p-x1984.iteye.com/blog/1156408
http://www.cnblogs.com/tangtianfly/archive/2012/03/13/2393449.html
http://www.2cto.com/kf/201210/160777.html
http://blog.csdn.net/acmilanvanbasten/article/details/17252673
但是,这里还需要补充一点,也是我学习过程中的一个误区:
对于具备分区字段的表,导入的数据,只能导入到指定的分区,而我曾经以为,数据导入时,会自动根据字段进行分区。这有什么区别呢?
比如,我的表按照city分区,我有一份各个城市的天气,大概数据如下:
2014-05-23|07:33:58 China shenzhen rain -28 -21 199
2014-05-23|07:33:58 China hangzhou fine -26 -19 200
2014-05-23|07:33:58 China hangzhou fine 6 14 200
然后我把这个数据加载到表中:load data inpath '/tmp/wetherdata4.txt' into table weatherpartion partition(city='hangzhou');
我的预期是:会根据city字段创建2个分区目录,一个叫hangzhou,一个叫shenzhen,并且会shenzhen的这一行记录放到shenzhen这个分区,把杭州的2行记录放到hangzhou
但实际上,只创建了一个分区hangzhou,并且3条数据都加载进了hangzhou这个分区,这很明显,数据与分区没有一致。
此时,再仔细思考下分区的使用场景,我的理解是:
1、数据文件生成的时候,会根据某个字段生成不同的文件,比如场景的日志文件,每天会产生一个,同一天的日志会放到一个文件中
2、不同的数据文件,会累加到一个表中做大数据分析
根据上面的解释,就比较好理解为什么导入的时候,一个文件只能导入到一个指定的分区(可能是多个条件指定的唯一分区,比如city=hangzhou,country=china),
再看看创建表时候的语句,被指定分区的字段其实不是表创建时候的字段(比如city字段),也就是说,其实用于分区的字段,并不应该作为数据真正的字段,只能认为是一个辅助字段,而为了hql语法上的支持,故hive会在创建的时候把分区字段也加入到表的字段中,因为语法需要。
比如每个城市会把自己的天气数据汇总给某个机构,那机构就会对city进行分区,而每个城市汇总的天气数据里,可以没有city这个字段,因为一个城市,它的city值肯定是一样的,写与不写都无所谓。此时,这个机构在导入数据的时候指明 partition(city='XXX');即可,这样,每个城市的天气数据就导入的对应的目录下,当查找指定城市的天气时,系统只会访问对应目录下的原始数据文件,不会访问表中其他城市的原始数据文件,从而提高效率。
create table weatherpartion
(date string, weath string,
minTemperat int, maxTemperat int,
pmvalue int) partitioned by (country string, city string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
STORED AS TEXTFILE;
其实hive对分区的设计,正是符合了它的本意:不对数据文件做任何修改。如果是根据某个字段,自动分区,那势必会把一个大文件拆成多个小文件,这就违背了不修改数据文件的初衷了。
为了验证性能,我们做一个实验,同样的数据,一个是按pcity分区,一个不分区,看看同样的hql的执行速度:
分区情况:
hive> dfs -ls /user/hive/warehouse/weatherpartion;
Found 6 items
drwxr-xr-x - hadoop supergroup 0 2014-05-24 13:33 /user/hive/warehouse/weatherpartion/pcity=beijin
drwxr-xr-x - hadoop supergroup 0 2014-05-24 13:34 /user/hive/warehouse/weatherpartion/pcity=guangzhou
drwxr-xr-x - hadoop supergroup 0 2014-05-24 13:34 /user/hive/warehouse/weatherpartion/pcity=hangzhou
drwxr-xr-x - hadoop supergroup 0 2014-05-24 13:34 /user/hive/warehouse/weatherpartion/pcity=nanjing
drwxr-xr-x - hadoop supergroup 0 2014-05-24 13:34 /user/hive/warehouse/weatherpartion/pcity=shanghai
drwxr-xr-x - hadoop supergroup 0 2014-05-24 13:34 /user/hive/warehouse/weatherpartion/pcity=shenzhen
在同样的情况下,多次执行同一句hql,取耗时的平均值:
一:未分区的weather表,三表联合查询,排序
select cy.number,wh.*,pm.pmlevel
from cityinfo cy join weather wh on (cy.name=wh.city)
join pminfo pm on (pm.pmvalue=wh.pmvalue)
where wh.city='hangzhou' and wh.weath='fine' and wh.minTemperat in
( -18,25,43) order by maxTemperat DESC limit 20;
执行5次,耗时如下:
Job 0: Map: 5 Reduce: 2 Cumulative CPU: 41.52 sec HDFS Read: 1061735802 HDFS Write: 6918992 SUCCESS
Job 1: Map: 2 Reduce: 1 Cumulative CPU: 5.14 sec HDFS Read: 6923192 HDFS Write: 7232163 SUCCESS
Job 2: Map: 1 Reduce: 1 Cumulative CPU: 1.81 sec HDFS Read: 7232532 HDFS Write: 1126 SUCCESS
Total MapReduce CPU Time Spent: 48 seconds 470 msec
Time taken: 72.781 seconds, Fetched: 20 row(s)
Job 0: Map: 5 Reduce: 2 Cumulative CPU: 45.71 sec HDFS Read: 1061735802 HDFS Write: 6918992 SUCCESS
Job 1: Map: 2 Reduce: 1 Cumulative CPU: 5.25 sec HDFS Read: 6923192 HDFS Write: 7232163 SUCCESS
Job 2: Map: 1 Reduce: 1 Cumulative CPU: 1.82 sec HDFS Read: 7232532 HDFS Write: 1126 SUCCESS
Total MapReduce CPU Time Spent: 52 seconds 780 msec
Time taken: 66.584 seconds, Fetched: 20 row(s)
Job 0: Map: 5 Reduce: 2 Cumulative CPU: 43.55 sec HDFS Read: 1061735802 HDFS Write: 6918992 SUCCESS
Job 1: Map: 2 Reduce: 1 Cumulative CPU: 5.29 sec HDFS Read: 6923192 HDFS Write: 7232163 SUCCESS
Job 2: Map: 1 Reduce: 1 Cumulative CPU: 1.82 sec HDFS Read: 7232532 HDFS Write: 1126 SUCCESS
Total MapReduce CPU Time Spent: 50 seconds 660 msec
Time taken: 62.12 seconds, Fetched: 20 row(s)
Job 0: Map: 5 Reduce: 2 Cumulative CPU: 41.09 sec HDFS Read: 1061735802 HDFS Write: 6918992 SUCCESS
Job 1: Map: 2 Reduce: 1 Cumulative CPU: 5.12 sec HDFS Read: 6923192 HDFS Write: 7232163 SUCCESS
Job 2: Map: 1 Reduce: 1 Cumulative CPU: 1.8 sec HDFS Read: 7232532 HDFS Write: 1126 SUCCESS
Total MapReduce CPU Time Spent: 48 seconds 10 msec
Time taken: 58.33 seconds, Fetched: 20 row(s)
Job 0: Map: 5 Reduce: 2 Cumulative CPU: 42.68 sec HDFS Read: 1061735802 HDFS Write: 6918992 SUCCESS
Job 1: Map: 2 Reduce: 1 Cumulative CPU: 5.0 sec HDFS Read: 6923192 HDFS Write: 7232163 SUCCESS
Job 2: Map: 1 Reduce: 1 Cumulative CPU: 1.82 sec HDFS Read: 7232532 HDFS Write: 1126 SUCCESS
Total MapReduce CPU Time Spent: 49 seconds 500 msec
Time taken: 62.355 seconds, Fetched: 20 row(s)
二:按city分区的weather表,三表联合查询,排序
select cy.number,wh.*,pm.pmlevel
from cityinfo cy join weatherpartion wh on (cy.name=wh.city)
join pminfo pm on (pm.pmvalue=wh.pmvalue)
where wh.pcity='hangzhou' and wh.weath='fine' and wh.minTemperat in
( -18,25,43) order by maxTemperat DESC limit 20;
执行5次,耗时如下:
Job 0: Map: 2 Reduce: 1 Cumulative CPU: 10.68 sec HDFS Read: 172140323 HDFS Write: 7793860 SUCCESS
Job 1: Map: 2 Reduce: 1 Cumulative CPU: 5.35 sec HDFS Read: 7797836 HDFS Write: 7997910 SUCCESS
Job 2: Map: 1 Reduce: 1 Cumulative CPU: 1.82 sec HDFS Read: 7998279 HDFS Write: 1306 SUCCESS
Total MapReduce CPU Time Spent: 17 seconds 850 msec
Time taken: 48.127 seconds, Fetched: 20 row(s)
MapReduce Jobs Launched:
Job 0: Map: 2 Reduce: 1 Cumulative CPU: 10.4 sec HDFS Read: 172140323 HDFS Write: 7793860 SUCCESS
Job 1: Map: 2 Reduce: 1 Cumulative CPU: 5.31 sec HDFS Read: 7797836 HDFS Write: 7997910 SUCCESS
Job 2: Map: 1 Reduce: 1 Cumulative CPU: 1.84 sec HDFS Read: 7998279 HDFS Write: 1306 SUCCESS
Total MapReduce CPU Time Spent: 17 seconds 550 msec
Time taken: 47.386 seconds, Fetched: 20 row(s)
MapReduce Jobs Launched:
Job 0: Map: 2 Reduce: 1 Cumulative CPU: 10.8 sec HDFS Read: 172140323 HDFS Write: 7793860 SUCCESS
Job 1: Map: 2 Reduce: 1 Cumulative CPU: 5.38 sec HDFS Read: 7797835 HDFS Write: 7997910 SUCCESS
Job 2: Map: 1 Reduce: 1 Cumulative CPU: 1.85 sec HDFS Read: 7998278 HDFS Write: 1306 SUCCESS
Total MapReduce CPU Time Spent: 18 seconds 30 msec
Time taken: 47.853 seconds, Fetched: 20 row(s)
三、结论
CPU消耗的时间大幅减少,但总时间提升的并不太多,因为还涉及到一些调度、通信、洗牌、切分等的时间
四、动态分区
静态分区要求导入数据的时候指定导入的分区,如果有大量不同分区的数据需要导入,则需要手动执行N次命令,相当麻烦,所以hive提供动态分区的功能。
也就是说,hive可以根据设定的分区,把数据分到对应的分区中,它包括严格模式和宽松模式。
默认情况下动态分区的功能是关闭的,需要用户手动打开,当打开动态分区后,默认情况下是严格模式
打开动态分区的命令:set hive.exec.dynamic.partition=true;
通过如下例子说明
需求:把天气表中的数据按照城市和天气状况(晴、雨)进行2级分区,例子使用严格模式,所以城市为静态分区,天气情况weath为动态分区,从而构成2级分区
第一步,创建目标表,指定分区字段:
create table weather_sub
(date string, pmvalue int) partitioned by (city string, weath string) //此处需要指定2个分区字段
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
STORED AS TEXTFILE;
再执行动态分区插入数据的语句:
insert overwrite table weather_sub
partition (city='hangzhou',weath)
select w.date,w.pmvalue,w.weath //此处需要注意,这里只填了3个字段,但目标表事实上有4个字段的,其中缺少的字段正是weather_sub.city
from weather w
where w.city='hangzhou';
//w.date,w.pmvalue分别对应目标表的date和pmvalue,city字段使用分区指定的hangzhou,w.weath对应weath
检验结果:
hive> select * from weather_sub where city='hangzhou' and weath='fine' limit 5;
OK
2014-05-23|07:33:58 200 hangzhou fine
2014-05-23|07:33:58 200 hangzhou fine
2014-05-23|07:33:58 200 hangzhou fine
2014-05-23|07:33:58 200 hangzhou fine
2014-05-23|07:33:58 200 hangzhou fine
Time taken: 0.101 seconds, Fetched: 5 row(s)
按分区查找,速度很快,没有用MP程序
查看dfs情况:
hive> dfs -ls /user/hive/warehouse/weather_sub/city=hangzhou;
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2014-06-06 22:42 /user/hive/warehouse/weather_sub/city=hangzhou/weath=cloudy
drwxr-xr-x - hadoop supergroup 0 2014-06-06 22:42 /user/hive/warehouse/weather_sub/city=hangzhou/weath=fine
最后附上删除表的命令:
drop table weather_sub;