内部表和外部表的区别
- 执行删除表命令操作时,对于内部表,会删除元数据和存储的数据,而对于外部表只会删除元数据库里的元数据信息,而不会删除存储的数据。这样可以防止多个部门同时使用一个表时,一个部门删除数据文件导致其他部门不能使用的情况。方便共享数据。
- 内部表默认存储在数据仓库中,而外部表一般通过location自己指定目录,便于存放在仓库外。
Hive分区表
- 为什么有分区表?
如果把一年或者一个月的日志文件存放在一个表下,那么数据量会非常的大,当查询这个表中某一天的日志文件的时候,查询速度还非常的慢,这时候可以采用分区表的方式,把这个表根据时间点再划分为小表。这样划分后,查询某一个时间点的日志文件就会快很多,因为这是不需要进行全表扫描。 - hive中的分区表
hive中的分区是根据“分区列”的值对表的数据进行粗略的划分,hive中一个表对应一个目录,再根据分区列在这个表目录下创建子目录,每个子目录名就是分区列的名字。分区列定义与表中字段相似,但是与表中的字段无关,是独立的列。这样就加快了数据查询的速度,因为不会对这个表中进行全盘扫描了。
创建托管表:
hive (hive1)> create table hive1.test1(id int,name string,age int) row format delimited fields terminated by '\t' lines terminated by '\n';
hive (hive1)> desc formatted test1; //显示格式化的信息
OK
col_name data_type comment
# col_name data_type comment
id int
name string
age int
# Detailed Table Information
Database: hive1
Owner: hadoop
CreateTime: Sat Jun 30 00:29:22 CST 2018
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://master:9000/user/hive/warehouse/hive1.db/test1
Table Type: MANAGED_TABLE
Table Parameters:
numFiles 1
numRows 0
rawDataSize 0
totalSize 99
transient_lastDdlTime 1530289773
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
line.delim \n
serialization.format \t
Time taken: 0.139 seconds, Fetched: 33 row(s)
数据准备:
employees.txt
1201 Gopal 45000
1202 Manisha 45000
1203 Masthanvali 40000
1204 Krian 40000
1205 Kranthi 30000
向表中加载数据:
hive (hive1)> load data local inpath '/home/hadoop/employees.txt' into table hive1.test1;
Loading data to table hive1.test1
OK
Time taken: 0.725 seconds
查询表的信息:
hive (hive1)> select * from test1;
OK
test1.id test1.name test1.age
1201 Gopal 45000
1202 Manisha 45000
1203 Masthanvali 40000
1204 Krian 40000
1205 Kranthi 30000
查看表和删除表
hive (hive1)> dfs -lsr /user/hive/warehouse/hive1.db;
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxrwxrwx - hadoop supergroup 0 2018-06-30 00:29 /user/hive/warehouse/hive1.db/test1
-rwxrwxrwx 3 hadoop supergroup 99 2018-06-30 00:29 /user/hive/warehouse/hive1.db/test1/employees.txt
hive (hive1)> drop table test1;
OK
Time taken: 1.31 seconds
hive (hive1)> dfs -lsr /user/hive/warehouse/hive1.db;
lsr: DEPRECATED: Please use 'ls -R' instead.
外部表
---------------------------------
hive控制元数据,删除外部表的时候,数据不会被删除
创建外部表:
hive (hive1)> create external table hive1.test2 like hive1.test1; //创建外部表 like 它只复制表结构没有数据 | as 复制表结构有数据
OK
Time taken: 0.163 second
查看表的属性:
hive (hive1)> desc formatted test2; //显示格式化的信息
OK
col_name data_type comment
# col_name data_type comment
id int
name string
age int
# Detailed Table Information
Database: hive1
Owner: hadoop
CreateTime: Sat Jun 30 00:35:50 CST 2018
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://master:9000/user/hive/warehouse/hive1.db/test2
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
EXTERNAL TRUE
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
transient_lastDdlTime 1530290150
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
line.delim \n
serialization.format \t
Time taken: 0.101 seconds, Fetched: 35 row(s)
向表中加载数据:
hive (hive1)> load data local inpath '/home/hadoop/employees.txt' into table hive1.test2;
Loading data to table hive1.test2
OK
Time taken: 0.725 seconds
hive (hive1)> select * from test2;
OK
test2.id test2.name test2.age
1201 Gopal 45000
1202 Manisha 45000
1203 Masthanvali 40000
1204 Krian 40000
1205 Kranthi 30000
Time taken: 0.142 seconds, Fetched: 5 row(s)
hive (hive1)> dfs -lsr /user/hive/warehouse/hive1.db;
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxrwxrwx - hadoop supergroup 0 2018-06-30 00:35 /user/hive/warehouse/hive1.db/test1
drwxrwxrwx - hadoop supergroup 0 2018-06-30 00:36 /user/hive/warehouse/hive1.db/test2
-rwxrwxrwx 3 hadoop supergroup 99 2018-06-30 00:36 /user/hive/warehouse/hive1.db/test2/employees.txt
hive (hive1)> drop table test2;
OK
Time taken: 0.161 seconds
hive (hive1)> dfs -lsr /user/hive/warehouse/hive1.db;
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxrwxrwx - hadoop supergroup 0 2018-06-30 00:35 /user/hive/warehouse/hive1.db/test1
drwxrwxrwx - hadoop supergroup 0 2018-06-30 00:36 /user/hive/warehouse/hive1.db/test2
-rwxrwxrwx 3 hadoop supergroup 99 2018-06-30 00:36 /user/hive/warehouse/hive1.db/test2/employees.txt
分区表
----------------------------------------
创建分区表
hive (hive1)> create table hive1.test3(id int,name string,age int ) partitioned by (province string,city string) row format delimited fields terminated by '\t' lines terminated by '\n';
OK
Time taken: 0.098 second
数据准备:
employees2.txt
1206 Actualopal 45000
1207 SLF4JManisha 45000
1208 NMasthanv 40000
1206 Actualopal 45000
1207 SLF4JManisha 45000
1208 NMasthanv 40000
加载数据
hive (hive1)> load data local inpath '/home/hadoop/employees2.txt' into table hive1.test3 partition (province='henan',city ='zhengzhou');
Loading data to table hive1.test3 partition (province=henan, city=zhengzhou)
OK
Time taken: 1.115 seconds
通过shell命令进行查看
hive (hive1)> dfs -lsr /user/hive/warehouse/hive1.db/test3;
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxrwxrwx - hadoop supergroup 0 2018-06-30 00:50 /user/hive/warehouse/hive1.db/test3/province=henan
drwxrwxrwx - hadoop supergroup 0 2018-06-30 00:46 /user/hive/warehouse/hive1.db/test3/province=henan/city=shangqiu
-rwxrwxrwx 3 hadoop supergroup 99 2018-06-30 00:46 /user/hive/warehouse/hive1.db/test3/province=henan/city=shangqiu/employees.txt
drwxrwxrwx - hadoop supergroup 0 2018-06-30 00:50 /user/hive/warehouse/hive1.db/test3/province=henan/city=zhengzhou
-rwxrwxrwx 3 hadoop supergroup 71 2018-06-30 00:50 /user/hive/warehouse/hive1.db/test3/province=henan/city=zhengzhou/employees2.txt
【查询分区表】
hive (hive1)> select * from hive1.test3 where province='henan' and city='shangqiu';
OK
test3.id test3.name test3.age test3.province test3.city
1201 Gopal 45000 henan shangqiu
1202 Manisha 45000 henan shangqiu
1203 Masthanvali 40000 henan shangqiu
1204 Krian 40000 henan shangqiu
1205 Kranthi 30000 henan shangqiu
[查看分区表有哪些分区]
hive (hive1)> show partitions hive1.test3;
OK
partition
province=henan/city=shangqiu
province=henan/city=zhengzhou
Time taken: 0.122 seconds, Fetched: 2 row(s)
hive (hive1)> show partitions hive1.test3 partition(province='henan'); //查看具体分区的细节信息
OK
partition
province=henan/city=shangqiu
province=henan/city=zhengzhou
Time taken: 0.149 seconds, Fetched: 2 row(s)
[手动增加一个分区]
hive (hive1)> alter table hive1.test3 add partition (province='hebei',city='baodong');
OK
Time taken: 0.201 seconds
hive (hive1)> dfs -lsr /user/hive/warehouse/hive1.db/test3;
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxrwxrwx - hadoop supergroup 0 2018-06-30 00:55 /user/hive/warehouse/hive1.db/test3/province=hebei
drwxrwxrwx - hadoop supergroup 0 2018-06-30 00:55 /user/hive/warehouse/hive1.db/test3/province=hebei/city=baodong
drwxrwxrwx - hadoop supergroup 0 2018-06-30 00:50 /user/hive/warehouse/hive1.db/test3/province=henan
drwxrwxrwx - hadoop supergroup 0 2018-06-30 00:46 /user/hive/warehouse/hive1.db/test3/province=henan/city=shangqiu
-rwxrwxrwx 3 hadoop supergroup 99 2018-06-30 00:46 /user/hive/warehouse/hive1.db/test3/province=henan/city=shangqiu/employees.txt
drwxrwxrwx - hadoop supergroup 0 2018-06-30 00:50 /user/hive/warehouse/hive1.db/test3/province=henan/city=zhengzhou
-rwxrwxrwx 3 hadoop supergroup 71 2018-06-30 00:50 /user/hive/warehouse/hive1.db/test3/province=henan/city=zhengzhou/employees2.txt