当需要进行数据迁移的时候,可能需要在两个不同集群之间移动数据。Hive支持使用EXPORT和IMPORT语句在两个不同环境的HDFS集群中移动数据。
EXPORT
EXPORT语句可以从一个表或者分区中导出数据及其元数据,导出的元数据文件会存储在名为_metadata的文件中,而数据会存储在名为data的子目录中。
语法如下:
EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])]
TO 'export_target_path' [ FOR replication('eventid') ]
示例:导出表
> EXPORT TABLE employee_internal TO '/tmp/output5';
> !sh hadoop fs -ls -R /tmp/output5
-rw-r--r-- 3 hadoop hadoop 1854 2019-10-01 17:16 /tmp/output5/_metadata
drwxr-xr-x - hadoop hadoop 0 2019-10-01 17:16 /tmp/output5/data
-rw-r--r-- 3 hadoop hadoop 228 2019-10-01 17:16 /tmp/output5/data/000000_0
示例:导出分区
> EXPORT TABLE employee_partitioned PARTITION(year=2012,month=11) TO '/tmp/output6';
> !sh hadoop fs -ls -R /tmp/output6
-rw-r--r-- 3 hadoop hadoop 3195 2019-10-01 17:41 /tmp/output6/_metadata
drwxr-xr-x - hadoop hadoop 0 2019-10-01 17:41 /tmp/output6/year=2012
drwxr-xr-x - hadoop hadoop 0 2019-10-01 17:41 /tmp/output6/year=2012/month=11
-rw-r--r-- 3 hadoop hadoop 47 2019-10-01 17:41 /tmp/output6/year=2012/month=11/000000_0
IMPORT
当数据导出完成后,我们可以手动将导出的文件拷贝到其他集群或者使用 hadoop distcp <srcurl> <desturl> 命令。然后就可以使用IMPORT语句导入数据。如果目标表或分区不存在,IMPORT语句会创建它,表的所有属性和参数会使用由EXPORT导出的表的属性/参数。如果目标存在,则会检查其是否具有合适的schema, Input/OutputFormat等。如果目标表存在且不是分区表,则必须是空表;如果目标表存在且是分区表,则表中不能存在要导入的分区。
语法如下:
IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION (part_column="value"[, ...])]]
FROM 'source_path'
[LOCATION 'import_target_path']
示例:导入表
--切换到一个其他的数据库
> use test;
> IMPORT TABLE employee FROM '/tmp/output5';
> select * from employee;
+----------------+-------------------------+-------------------------------+------------------------+----------------------------------------+
| employee.name | employee.work_place | employee.gender_age | employee.skills_score | employee.depart_title |
+----------------+-------------------------+-------------------------------+------------------------+----------------------------------------+
| Michael | ["Montreal","Toronto"] | {"gender":"Male","age":30} | {"DB":80} | {"Product":["Developer^DLead"]} |
| Will | ["Montreal"] | {"gender":"Male","age":35} | {"Perl":85} | {"Product":["Lead"],"Test":["Lead"]} |
| Shelley | ["New York"] | {"gender":"Female","age":27} | {"Python":80} | {"Test":["Lead"],"COE":["Architect"]} |
| Lucy | ["Vancouver"] | {"gender":"Female","age":57} | {"Sales":89,"HR":94} | {"Sales":["Lead"]} |
+----------------+-------------------------+-------------------------------+------------------------+----------------------------------------+
示例:导入分区
> IMPORT TABLE employee_partitioned1 PARTITION(year=2012,month=11) FROM '/tmp/output6'
> show partitions employee_partitioned1
+---------------------+
| partition |
+---------------------+
| year=2012/month=11 |
+---------------------+
示例:以外部表形式导入
> IMPORT EXTERNAL TABLE empolyee_imported_external
FROM '/tmp/output5'
> show create table empolyee_imported_external
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
...
| LOCATION |
| 'hdfs://ns001/tmp/output5/data' |
...
+----------------------------------------------------+
示例:导入时指定路径(Hive3.1.1中测试,指定LOCATION报错)
> IMPORT TABLE empolyee_imported_locations
FROM '/tmp/output5'
LOCATION '/tmp/location_test';
Error: Error while compiling statement: FAILED: SemanticException [Error 10324]: Import Semantic Analyzer Error (state=42000,code=10324)
注意:对于EXPORT,可以在表名之前使用数据库名;对于IMPORT,数据库名是无用的,会被忽略。
参考
- https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport
- 书籍 Apache Hive Essentials Second Edition (by Dayong Du) Chapter 5