1 Partitioned Tables
分区表就是对应一个 HDFS 文件系统上的独立的文件夹,该文件夹下是该分区所有的数据文件。Hive 中的分区就是分目录,把一个大的数据集根据业务需要分割成更小的数据集。
- 在查询时通过
WHERE
子句中的表达式来选择查询所需要的指定的分区,这样查询效率会提高很多;
1.1 创建分区表
create external table if not exists default.emp_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)
partitioned by (month string)
row format delimited fields terminated by '\t';
1.2 加载数据到分区表
load data local inpath '/home/hadoop/emp.txt' into table default.emp_partition
partition (month='201901');
1.3 查询分区表的数据
select * from emp_partition where month='201901';
1.3.1 分区查询案例
select count(distinct ip) from emp_partition where month='201810'
union
select count(distinct ip) from emp_partition where month='201811'
union
select count(distinct ip) from emp_partition where month='201812';
bin/hive -f xx.sql
1.4 多级分区
create external table if not exists default.emp_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)
partitioned by (month string,day string)
row format delimited fields terminated by '\t';
select * from emp_partition where month='201901' and day='1'
2 分区表注意事项
2.1 非分区表
- 创建非分区表
create table if not exists default.dept_nopart(
deptno int,
dname string,
loc string
) row format delimited fields terminated by '\t';
- 上传数据
hive (default)> dfs -put /home/hadoop/dept.txt /user/hive/warehouse/dept_nopart;
2.2 分区表
- 创建分区表
create table if not exists default.dept_part(
deptno int,
dname string,
loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';
-
创建目录
hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_part/day=20190117;
-
上传数据
hive (default)> dfs -put /home/hadoop/dept.txt /user/hive/warehouse/dept_part/day=20190117;
-
查询发现没有数据
2.3 查询 mysql 中 hive的分区信息
- 登录 mysql
mysql -uroot -proot
mysql> show tables
-> ;
+---------------------------+
| Tables_in_hivemetastore |
+---------------------------+
| BUCKETING_COLS |
| CDS |
| COLUMNS_V2 |
| DATABASE_PARAMS |
| DBS |
| DB_PRIVS |
| FUNCS |
| FUNC_RU |
| GLOBAL_PRIVS |
| IDXS |
| INDEX_PARAMS |
| PARTITIONS |
| PARTITION_KEYS |
| PARTITION_KEY_VALS |
| PARTITION_PARAMS |
| PART_COL_PRIVS |
| PART_COL_STATS |
| PART_PRIVS |
| ROLES |
| SDS |
| SD_PARAMS |
| SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
| SKEWED_COL_NAMES |
| SKEWED_COL_VALUE_LOC_MAP |
| SKEWED_STRING_LIST |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES |
| SORT_COLS |
| TABLE_PARAMS |
| TAB_COL_STATS |
| TBLS |
| TBL_COL_PRIVS |
| TBL_PRIVS |
| VERSION |
+---------------------------+
36 rows in set (0.00 sec)
- 发现没有刚刚的分区信息
2.4 修复分区表
2.4.1 方式1 (常用)
hive (default)> msck repair table dept_part;
hive (default)> msck repair table dept_part;
OK
Partitions not in metastore: dept_part:day=20190117
Repair: Added partition to metastore dept_part:day=20190117
Time taken: 0.103 seconds, Fetched: 2 row(s)
2.4.2 方式2 (不常用)
- 添加分区,
dfs -mkdir -p /user/hive/warehouse/dept_part/day=20190118;
- 上传数据,
dfs -put /home/hadoop/dept.txt /user/hive/warehouse/dept_part/day=20190118;
- 修复
alter table dept_part add partition(day='20190118')
2.5 查看分区数
hive (default)> show partitions dept_part;