(7-9)hive创建4种表

(7-9)hive创建4种表


--------------------------------------------- hive 的数据类型 --------------------------------------

----------------------------- 1、管理表-------------------------------------
1.1、所有Table数据都保存在warehouse这个目录中;
1.2、删除表时,元数据与数据都会被删除;
1.3、创建过程和数据加载过程可在同一个语句中完成;


也叫内部表、受控表

//1.1、准备上传的数据
[root@i-love-you ~]# more inner_table.dat
hadoop
hdfs
mapreduce
hive
hbase
[root@i-love-you ~]#


//1.2、创建内部表
hive> create table inner_table(key string);
OK
Time taken: 0.29 seconds

//1.3、加载数据
hive> load data local inpath '/root/inner_table.dat' into table inner_table;
Loading data to table mydb.inner_table
Table mydb.inner_table stats: [numFiles=1, totalSize=33]
OK
Time taken: 4.917 seconds

//1.4、查询表全部数据
hive> select * from inner_table;
OK
hadoop
hdfs
mapreduce
hive
hbase
Time taken: 1.086 seconds, Fetched: 5 row(s)

//1.5、统计表有多少行记录,会跑一个mapreduce。
hive> select count(*) from inner_table;
Query ID = root_20150428214949_d311da69-73d7-4278-8c1e-667a461aa287
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1430222047693_0001, Tracking URL = http://i-love-you:8088/proxy/application_1430222047693_0001/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1430222047693_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-04-28 21:51:05,875 Stage-1 map = 0%,  reduce = 0%
2015-04-28 21:51:59,258 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.94 sec
2015-04-28 21:52:42,454 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.94 sec
MapReduce Total cumulative CPU time: 6 seconds 940 msec
Ended Job = job_1430222047693_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.94 sec   HDFS Read: 267 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 940 msec
OK
5
Time taken: 189.992 seconds, Fetched: 1 row(s)
hive>

//1.6、删除表
hive> drop table inner_table;
Moved: 'hdfs://192.168.1.10:9000/user/hive/warehouse/mydb.db/inner_table' to trash at: hdfs://192.168.1.10:9000/user/root/.Trash/Current
OK
Time taken: 21.272 seconds
hive> show tables;
OK
t1
t2
Time taken: 0.136 seconds, Fetched: 2 row(s)
hive>


----------------------------- 2、外部表-------------------------------------
外部表:包含 external 的表叫外部表
2.1、删除表只删除 metastore 的元数据,不删除hdfs的表数据;
2.2、外部表只是一个过程,加载数据和创建表同时完成,并不会移动数据到数据仓库中,只是与外部数据简历了一个链接,当删除一个表仅删除该链接;
2.3、指向已在HDFS中的数据,可以创建partition;
2.4、外部表与内部表在元数据组织上相同;

//待上传的数据
[root@i-love-you ~]# more external_table.dat
hadoop  23
hbase   24
[root@i-love-you ~]#

//创建外部表
hive> create external table external_table(name string,age int)row format delimited fields terminated by "\t";
OK
Time taken: 0.754 seconds

//查看表结构
hive> desc extended external_table;
OK
name                    string
age                     int

Detailed Table Information      Table(tableName:external_table, dbName:mydb, owner:root, createTime:1430230920, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), FieldSchema(name:age, type:int, comment:null)], location:hdfs://192.168.1.10:9000/user/hive/warehouse/mydb.db/external_table, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=        , field.delim=
Time taken: 0.208 seconds, Fetched: 4 row(s)
hive>

//加载数据
hive> load data local inpath "/root/external_table.dat" into table external_table;
Loading data to table mydb.external_table
Table mydb.external_table stats: [numFiles=1, totalSize=19]
OK
Time taken: 0.837 seconds
hive>

//查询数据
hive> select * from external_table;
OK
hadoop  23
hbase   24
Time taken: 0.177 seconds, Fetched: 2 row(s)

//删除表
hive> drop table external_table;
OK
Time taken: 20.949 seconds
hive> 

//表已被删除,再次创建表,不用加载数据,可以直接查到数据
hive> show tables;
OK
t1
t2
Time taken: 0.068 seconds, Fetched: 2 row(s)
hive> create external table external_table(name string,age int)row format delimited fields terminated by "\t";
OK
Time taken: 0.206 seconds
hive> select * from external_table;
OK
hadoop  23
hbase   24
Time taken: 0.154 seconds, Fetched: 2 row(s)
hive>







----------------------------- 3、分区表-------------------------------------

3.1、分区可以理解为分类,通过分类把不同类型的数据放到不同的目录下;
3.2、分类的标准就是分区字段,可以一个,也可以多个;
3.3、分区表的意义在于优化查询。查询时尽量利用分区字段,如果不使用分区字段,就会全部扫描;


//准备加载的数据
[root@baozi ~]# vim aa
[root@baozi ~]# more aa
hadoop  23
hive    24
hbase   25
hdfs    26
mapreduce       27
spark   28

[root@baozi ~]#


//创建表:
hive> create table t1(name string,age int) partitioned by (class string) row format delimited fields terminated by "\t";
OK
Time taken: 1.002 seconds

//加载数据
hive> load data local inpath "/root/aa" into table t1 partition(class="job1");
Loading data to table default.t1 partition (class=job1)
Partition default.t1{class=job1} stats: [numFiles=1, numRows=0, totalSize=58, rawDataSize=0]
OK
Time taken: 3.684 seconds
hive>


//查询指定分区的数据
hive> select * from t1 where class="job1";
OK
hadoop  23      job1
hive    24      job1
hbase   25      job1
hdfs    26      job1
mapreduce       27      job1
spark   28      job1
        NULL    job1
Time taken: 1.747 seconds, Fetched: 7 row(s)
hive>


//增加一个分区
hive> alter table t1 add partition(class="job2");
OK
Time taken: 0.247 seconds
hive> show partitions t1;
OK
class=job1
class=job2
Time taken: 0.323 seconds, Fetched: 2 row(s)
hive>



//删除一个分区
hive> alter table t1 drop partition(class="job2");
Dropped the partition class=job2
OK
Time taken: 1.907 seconds
hive> show partitions t1;
OK
class=job1
Time taken: 0.143 seconds, Fetched: 1 row(s)
hive>





//多个分区创建表:

hive> create table t2(name string,age int) partitioned by (class string,city string)row format delimited fields terminated by "\t";
OK
Time taken: 0.117 seconds

//查看表结构
hive> desc t2;
OK
name                    string
age                     int
class                   string
city                    string

# Partition Information
# col_name              data_type               comment

class                   string
city                    string
Time taken: 0.186 seconds, Fetched: 10 row(s)


//加载数据
hive> load data local inpath "/root/aa" into table t2 partition(class="job1",city="bj");
Loading data to table default.t2 partition (class=job1, city=bj)
Partition default.t2{class=job1, city=bj} stats: [numFiles=1, numRows=0, totalSize=58, rawDataSize=0]
OK
Time taken: 0.73 seconds

//查询数据
hive> select * from t2;
OK
hadoop  23      job1    bj
hive    24      job1    bj
hbase   25      job1    bj
hdfs    26      job1    bj
mapreduce       27      job1    bj
spark   28      job1    bj
        NULL    job1    bj
Time taken: 0.264 seconds, Fetched: 7 row(s)
hive>






----------------------------- 4、桶表-------------------------------------

桶表是对 数据进行 哈希 取值,然后放到不同的文件中存储。

4.1、数据加载到桶表时,会对字段取hash值,然后与桶的数量取模,把数据放到对应的文件中。
4.2、物理上,每个桶就是表(或分区)目录里的一个文件。
4.3、一个作业产生的桶(或输出文件)和reduce任务个数相同。


//创建表
hive> create table bucket_table (id string,num int) clustered by (id) into 4 buckets;
OK
Time taken: 0.134 seconds

//启用桶表
hive> set hive.enforce.bucketing=true;

//往桶表里上传数据,会跑一个mapreduce。
hive> insert into table bucket_table select name,age from t2;
Query ID = root_20150429174242_56181777-f50c-472e-b9d5-d1ce6870d949
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1430296708269_0001, Tracking URL = http://baozi:8088/proxy/application_1430296708269_0001/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1430296708269_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2015-04-29 17:43:39,453 Stage-1 map = 0%,  reduce = 0%
2015-04-29 17:44:39,732 Stage-1 map = 0%,  reduce = 0%
2015-04-29 17:45:39,891 Stage-1 map = 0%,  reduce = 0%
2015-04-29 17:46:07,371 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.16 sec
2015-04-29 17:47:08,351 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.75 sec
2015-04-29 17:48:30,177 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.75 sec
2015-04-29 17:49:30,097 Stage-1 map = 100%,  reduce = 58%, Cumulative CPU 4.95 sec
2015-04-29 17:49:44,561 Stage-1 map = 100%,  reduce = 60%, Cumulative CPU 10.85 sec
2015-04-29 17:49:49,900 Stage-1 map = 100%,  reduce = 88%, Cumulative CPU 15.02 sec
2015-04-29 17:50:16,669 Stage-1 map = 100%,  reduce = 89%, Cumulative CPU 16.2 sec
2015-04-29 17:50:19,006 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 16.67 sec
2015-04-29 17:51:20,269 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 21.87 sec
MapReduce Total cumulative CPU time: 21 seconds 870 msec
Ended Job = job_1430296708269_0001
Loading data to table default.bucket_table
Table default.bucket_table stats: [numFiles=4, numRows=7, totalSize=61, rawDataSize=54]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 4   Cumulative CPU: 21.87 sec   HDFS Read: 282 HDFS Write: 335 SUCCESS
Total MapReduce CPU Time Spent: 21 seconds 870 msec
OK
Time taken: 575.149 seconds
hive>


[root@baozi ~]# hdfs dfs -ls /user/hive/warehouse/bucket_table
Found 4 items
-rw-r--r--   1 root supergroup         12 2015-04-29 17:50 /user/hive/warehouse/bucket_table/000000_0
-rw-r--r--   1 root supergroup          0 2015-04-29 17:49 /user/hive/warehouse/bucket_table/000001_0
-rw-r--r--   1 root supergroup         13 2015-04-29 17:50 /user/hive/warehouse/bucket_table/000002_0
-rw-r--r--   1 root supergroup         36 2015-04-29 17:50 /user/hive/warehouse/bucket_table/000003_0
[root@baozi ~]#


//在HDFS上面查看桶表的数据:
[root@baozi ~]# hdfs dfs -text /user/hive/warehouse/bucket_table/000000_0
\N
hive24
[root@baozi ~]# hdfs dfs -text /user/hive/warehouse/bucket_table/000002_0
mapreduce27
[root@baozi ~]# hdfs dfs -text /user/hive/warehouse/bucket_table/000003_0
spark28
hdfs26
hbase25
hadoop23
[root@baozi ~]# hdfs dfs -text /user/hive/warehouse/bucket_table/000001_0
[root@baozi ~]#


//查看桶表数据
hive> select * from bucket_table;
OK
        NULL
hive    24
mapreduce       27
spark   28
hdfs    26
hbase   25
hadoop  23
Time taken: 1.136 seconds, Fetched: 7 row(s)
hive>


//查看指定 桶编号的 数据
hive> select * from bucket_table tablesample(bucket 3 out of 4 on id);
OK
mapreduce       27
Time taken: 0.241 seconds, Fetched: 1 row(s)
hive>


//用 limit 查看前三个桶表的 数据
hive> select * from bucket_table limit 3;
OK
        NULL
hive    24
mapreduce       27
Time taken: 0.072 seconds, Fetched: 3 row(s)
hive>









  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值