关闭

02-Hive一个表创建另一个表,表分区,分桶

标签: hive
7611人阅读 评论(2) 收藏 举报
分类:

声明:如果你是初学者,看我这篇文章的时候,看我上一篇会更好。
Hive表的创建:http://blog.csdn.net/qq_29622761/article/details/51564680

这篇的主要内容目录是:

  1. 由一个表创建另一个表
  2. hive不同文件读取对比
  3. hive分区表
  4. hive分桶

你现在开始吧!
1. 由一个表创建另一个表
格式:ceate table test3 like test2;
我要做的:create table testtext_c like testtext;(这种方式不会把数据复制过来,只是创建了相同的数据格式)
我先加载数据到表testtext中:

[root@hadoop1 host]# cat testtext
wer 46
wer 89
weree   78
rr  89
hive> load data local inpath '/usr/host/testtext' into table testtext;
Copying data from file:/usr/host/testtext
Copying file: file:/usr/host/testtext
Loading data to table default.testtext
OK
Time taken: 0.294 seconds
hive> select * from testtext;
OK
wer 46
wer 89
weree   78
rr  89
Time taken: 0.186 seconds
hive> 

2 接着创建testtext_c吧(like方式)

hive> create table testtext_c like testtext;
OK
Time taken: 0.181 seconds
hive> select * from testtext;
OK
wer 46
wer 89
weree   78
rr  89
Time taken: 0.204 seconds
hive> select * from testtext_c;
OK
Time taken: 0.158 seconds
hive> 

哎,testtext_c中确实没有数据吧!真的没骗你啊!
3 客官,别急,还有一种方式(as)

hive> create table testtext_cc as select name,addr from testtext;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 1; number of reducers: 0
2016-06-01 20:49:59,404 null map = 0%,  reduce = 0%
2016-06-01 20:50:20,644 null map = 100%,  reduce = 0%, Cumulative CPU 1.3 sec
2016-06-01 20:50:21,735 null map = 100%,  reduce = 0%, Cumulative CPU 1.3 sec
MapReduce Total cumulative CPU time: 1 seconds 300 msec
Ended Job = job_1464828076391_0004
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Ended Job = 1011778050, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2016-06-01_20-49-43_516_5205177189363939745/-ext-10001
Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/testtext_cc
Table default.testtext_cc stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 29, raw_data_size: 0]
OK
Time taken: 48.014 seconds

又跑mapreduce,为啥?create table testtext_c like testtext;这个都不走mapreduce的啊!怎么这里就跑mapreduce?嘿嘿,其实这里有select关键字,只有select * from 啥的不走mapreduce,其余的select都是会跑mapreduce的,hive的底层设计原理其实就是走mapreduce的,不信你看看我前一篇博客。
查查有没有数据:

hive> select * from testtext_cc;
OK
wer 46
wer 89
weree   78
rr  89
Time taken: 0.116 seconds
hive> 

有啦有啦!
所以:create table testtext_cc as select name,addr from testtext;(这一种方式是走mapreduce形式,这种方式是把数据也会复制过来)

4 接下来呢,看看不同文件格式读取对比
有textfile文件格式,sequencefile格式,rcfile格式,还有自定义的文件格式。

hive> create table test_text(name string,val string) stored as textfile;
OK
Time taken: 0.098 seconds
hive> desc formatted test_text;
OK
# col_name              data_type               comment             

name                    string                  None                
val                     string                  None                

# Detailed Table Information         
Database:               default                  
Owner:                  root                     
CreateTime:             Wed Jun 01 21:11:15 PDT 2016     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop1:9000/user/hive/warehouse/test_text    
Table Type:             MANAGED_TABLE            
Table Parameters:        
    transient_lastDdlTime   1464840675          

# Storage Information        
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   
InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:         
    serialization.format    1                   
Time taken: 0.2 seconds
hive> 

看到Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

输入流是TextInputFormat;输出流是HiveIgnoreKeyTextOutputFormat

hive> create table test_seq(name string,val string) stored as sequencefile;
OK
Time taken: 0.097 seconds
hive> desc formatted test_s;
hive> create table test_rc(name string,val string) stored as rcfile;
OK
Time taken: 0.126 seconds
hive> desc formatted test_rc;

自定义的在这里就不讲了。等xielaoshi厉害一点了再来说。

5.为什么要分区?其实在hive select查询中一般会扫描整个表内容,会消耗很多时间做没必要的工作。
分区表指的是在创建时指定partition的分区空间
分区语法:
create table tablename(name string) partition by(key type,….)

6.砸门来创建一个分区表玩玩:
上一篇我们是创建了三个表:testtable,testtext,xielaoshi。先来show tables看看有哪些表存在:

hive> show tables;
OK
testtable
testtext
xielaoshi
Time taken: 0.264 seconds

如果你想删除表的话,这样:

hive> drop table testtable;

创建分区表:

hive> create table xielaoshi2(
    > name string,
    > salary float,
    > meinv array<string>,
    > haoche map<string,float>,
    > haoza struct<street:string,city:string,state:string,zip:int>
    > )
    > partitioned by (dt string,type string)
    > row format delimited
    > fields terminated by '\t'
    > collection items terminated by ','
    > map keys terminated by ':'
    > lines terminated by '\n'
    > stored as textfile;
OK
Time taken: 0.353 seconds
hive>

温馨小指南:你可以在记事本上敲好代码,然后贴到hive命令行上,这样更666哦!就像这样:
这里写图片描述

7 纳尼?不知道这语法是啥意思?好吧,你不懂的地方可能是collection items terminated by ‘,’map keys terminated by ‘:’ 。你想想,集合和map键值对里面的数据之间都是要分隔的呀,这里用逗号和冒号来分隔咯!
看看描述信息吧!

hive> desc formatted xielaoshi2;
OK
# col_name              data_type               comment             

name                    string                  None                
salary                  float                   None                
meinv                   array<string>           None                
haoche                  map<string,float>       None                
haoza                   struct<street:string,city:string,state:string,zip:int>  None                

# Partition Information      
# col_name              data_type               comment             

dt                      string                  None                
type                    string                  None                

# Detailed Table Information         
Database:               default                  
Owner:                  root                     
CreateTime:             Wed Jun 01 20:09:05 PDT 2016     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop1:9000/user/hive/warehouse/xielaoshi2   
Table Type:             MANAGED_TABLE            
Table Parameters:        
    transient_lastDdlTime   1464836945          

# Storage Information        
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   
InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:         
    colelction.delim        ,                   
    field.delim             \t                  
    line.delim              \n                  
    mapkey.delim            :                   
    serialization.format    \t                  
Time taken: 0.194 seconds
hive> 

看到多了 Partition Information信息没?分两个区。
8 添加分区

hive> alter table xielaoshi2 add if not exists partition(dt='20160518',type='test');
OK
Time taken: 0.188 seconds
hive> 

这里写图片描述

不过瘾对不对?砸门再来分区:

hive> alter table xielaoshi2 add if not exists partition(dt='20160518',type='test1');
OK
Time taken: 3.986 seconds
hive> alter table xielaoshi2 add if not exists partition(dt='20160518',type='test2');
OK
Time taken: 0.327 seconds
hive> show partitions xielaoshi2;
OK
dt=20160518/type=test
dt=20160518/type=test1
dt=20160518/type=test2
Time taken: 0.273 seconds
hive> 

这里写图片描述
纳尼?你说啥?还不够?那再分一下?好勒!

hive> alter table xielaoshi2 add if not exists partition(dt='20160519',type='test');
OK
Time taken: 0.224 seconds
hive> alter table xielaoshi2 add if not exists partition(dt='20160519',type='test1');
OK
Time taken: 0.275 seconds
hive> alter table xielaoshi2 add if not exists partition(dt='20160519',type='test2');
OK
Time taken: 0.323 seconds
hive> show partitions xielaoshi2;
OK
dt=20160518/type=test
dt=20160518/type=test1
dt=20160518/type=test2
dt=20160519/type=test
dt=20160519/type=test1
dt=20160519/type=test2
Time taken: 0.308 seconds
hive> 

看到没?dt下还有子分区type。
这里写图片描述

9.删除分区

hive> alter table xielaoshi2 drop if exists partition(dt='20160519',type='test2');
Dropping the partition dt=20160519/type=test2
OK
Time taken: 0.541 seconds
hive> 

删除一个分区下的所有子分区

hive> alter table xielaoshi2 drop if exists partition(dt='20160519');
Dropping the partition dt=20160519/type=test
Dropping the partition dt=20160519/type=test1
OK
Time taken: 4.24 seconds
hive> 

10.分桶
分桶:对于每一个表(table)或者分区,hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围
是怎么划分的?
hive是针对某一列进行分桶
hive采取对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中
好处:获得更高的查询处理效率;使取样(sampling)更高效(这才是重点!!!)
来吧,分桶:

hive> create table bucketed_user(
    > id string,
    > name string
    > )
    > clustered by(id) sorted by(name) into 4 buckets
    > row format delimited fields terminated by '\t' lines terminated by '\n'
    > stored as textfile;
OK
Time taken: 0.283 seconds
hive> 

查看描述信息:

hive> desc formatted bucketed_user;
OK
# col_name              data_type               comment             

id                      string                  None                
name                    string                  None                

# Detailed Table Information         
Database:               default                  
Owner:                  root                     
CreateTime:             Wed Jun 01 20:31:39 PDT 2016     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop1:9000/user/hive/warehouse/bucketed_user    
Table Type:             MANAGED_TABLE            
Table Parameters:        
    transient_lastDdlTime   1464838299          

# Storage Information        
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   
InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed:             No                       
Num Buckets:            4                        
Bucket Columns:         [id]                     
Sort Columns:           [Order(col:name, order:1)]   
Storage Desc Params:         
    field.delim             \t                  
    line.delim              \n                  
    serialization.format    \t                  
Time taken: 0.363 seconds
hive> 

看到Num Buckets:4,这里是分了4个桶

hive> select * from bucketed_user;
OK
Time taken: 0.533 seconds
hive> 

啥也没有?当然咯,没插入数据呀!那插入数据看看,把testtext表里的数据插入bucketed_user中:

hive>insert overwrite table bucketed_user select name,addr from testtext;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Hadoop job information for null: number of mappers: 1; number of reducers: 0
2016-06-01 21:17:07,755 null map = 0%,  reduce = 0%
2016-06-01 21:17:22,171 null map = 100%,  reduce = 0%, Cumulative CPU 1.22 sec
2016-06-01 21:17:23,308 null map = 100%,  reduce = 0%, Cumulative CPU 1.22 sec
2016-06-01 21:17:24,401 null map = 100%,  reduce = 0%, Cumulative CPU 1.22 sec
MapReduce Total cumulative CPU time: 1 seconds 220 msec
Ended Job = job_1464828076391_0005
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Ended Job = 180668474, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2016-06-01_21-16-49_815_8186991974761152344/-ext-10000
Loading data to table default.bucketed_user
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted /user/hive/warehouse/bucketed_user
Table default.bucketed_user stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 29, raw_data_size: 0]
OK
Time taken: 37.79 seconds

hive> select * from bucketed_user;
OK
wer 46
wer 89
weree   78
rr  89
Time taken: 0.273 seconds
hive> 

启动了两个job.
这里写图片描述
然而并没有分桶!这是为啥?
要插入这句话:hive> set hive.enforce.bucketing=true;
再执行这句话:

hive> insert overwrite table bucketed_user select name,addr from testtext;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 1; number of reducers: 4
2016-06-01 21:24:40,053 null map = 0%,  reduce = 0%
2016-06-01 21:24:54,729 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:24:55,909 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:24:57,256 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:24:58,531 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:24:59,631 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:00,930 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:02,208 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:03,485 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:04,781 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:05,983 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:07,272 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:08,697 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:09,782 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:11,017 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:12,292 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:13,606 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:14,870 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:17,433 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:18,929 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:20,801 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:22,429 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:24,508 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:26,192 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:27,256 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:31,612 null map = 100%,  reduce = 51%, Cumulative CPU 1.21 sec
2016-06-01 21:25:33,544 null map = 100%,  reduce = 51%, Cumulative CPU 2.94 sec
2016-06-01 21:25:35,433 null map = 100%,  reduce = 94%, Cumulative CPU 4.92 sec
2016-06-01 21:25:39,269 null map = 100%,  reduce = 100%, Cumulative CPU 6.23 sec
2016-06-01 21:25:40,312 null map = 100%,  reduce = 100%, Cumulative CPU 6.23 sec
2016-06-01 21:25:41,730 null map = 100%,  reduce = 100%, Cumulative CPU 6.23 sec
2016-06-01 21:25:42,927 null map = 100%,  reduce = 100%, Cumulative CPU 6.23 sec
2016-06-01 21:25:44,187 null map = 100%,  reduce = 100%, Cumulative CPU 6.23 sec
MapReduce Total cumulative CPU time: 6 seconds 230 msec
Ended Job = job_1464828076391_0006
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Loading data to table default.bucketed_user
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted /user/hive/warehouse/bucketed_user
Table default.bucketed_user stats: [num_partitions: 0, num_files: 4, num_rows: 0, total_size: 29, raw_data_size: 0]
OK
Time taken: 96.782 seconds
hive> 

看这句话Hadoop job information for null: number of mappers: 1; number of reducers: 4,因为分4个桶,出现了4个reducers。
这里写图片描述

看一下数据:

hive> select * from bucketed_user;
OK
rr  89
weree   78
wer 89
wer 46
Time taken: 1.112 seconds
hive> select * from testtext where name = 'wer';
OK
wer 46
wer 89
Time taken: 31.796 seconds
hive> 

,O(∩∩)O嗯!O(∩∩)O嗯!O(∩_∩)O嗯!今天就写到这里,休息一下。如果你看到此文,想进一步学习或者和我沟通,加我微信公众号:名字:五十年后 。
这里写图片描述
蟹蟹你啊!

0
0

查看评论
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
    个人资料
    • 访问:27952次
    • 积分:480
    • 等级:
    • 排名:千里之外
    • 原创:18篇
    • 转载:0篇
    • 译文:0篇
    • 评论:5条
    联系方式
    weichat:xiehuadong1 E_mail:xiehuadong1@qq.com qq:1025699566 微信公众号:五十年后
    文章分类
    最新评论