最近公司要做hadoop集群和hive数据的迁移,如何在保证原有业务稳定运行的情况下完成所有数据的迁移,也是一个不容易的过程。我总结了一下流程,给其他hadoop集群和hive数仓的管理员做参考。
1. 首先肯定要保证新的hadoop集群和hive的顺利安装和配置,为了减少遇到不必要的麻烦,新集群采用了跟之前一样的版本,都是hadoop2.6.3和hive1.2.1。如果是跨版本迁移可能要更注意细节了。
2. 复制hdfs数据到新的集群,直接采用distcp命令即可在2个集群间方便的进行数据迁移和同步。参数参考如下:
[hadoop@master1 hive]$ hadoop distcp
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-append Reuse existing data in target files and append new
data to them if possible
-async Should distcp execution be blocking
-atomic Commit all changes or none
-bandwidth <arg> Specify bandwidth per map in MB
-delete Delete from target, files missing in source
-f <arg> List of files that need to be copied
-filelimit <arg> (Deprecated!) Limit number of files copied to <= n
-i Ignore failures during copy
-log <arg> Folder on DFS where distcp execution logs are
saved
-m <arg> Max number of concurrent maps to use for copy
-mapredSslConf <arg> Configuration for ssl config file, to use with
hftps://
-overwrite Choose to overwrite target files unconditionally,
even if they exist.
-p <arg> preserve status (rbugpcaxt)(replication,
block-size, user, group, permission,
checksum-type, ACL, XATTR, timestamps). If -p is
specified with no <arg>, then preserves
replication, block size, user, group, permission,
checksum type and timestamps. raw.* xattrs are
preserved when both the source and destination
paths are in the /.reserved/raw hierarchy (HDFS
only). raw.* xattrpreservation is independent of
the -p flag. Refer to the DistCp documentation for
more details.
-sizelimit <arg> (Deprecated!) Limit number of files copied to <= n
bytes
-skipcrccheck Whether to skip CRC checks between source and
target paths.
-strategy <arg> Copy strategy to use. Default is dividing work
based on file sizes
-tmp <arg> Intermediate work path to be used for atomic
commit
-update Update target, copying only missingfiles or
directories
具体执行命令如下,nn1和nn2分别为集群的namenode节点(如果是hadoop1.x到2.x的迁移,hdfs可能会报错,只能使用hftp的方式进行数据传输,具体方法自行搜索)
hadoop distcp hdfs://nn1/user/hive/ hdfs://nn2:8020/user
由于distcp命令也是转换成MR任务来执行,会消耗集群大量的资源,所以如果数据量较大,最好选择在MR job运行较少的时间段做迁移。
3. 大家知道Hive数据就是存在HDFS里的,而元数据(metastore)一般会选择存在mysql里,如果新集群的元数据存储数据库还是mysql的话,可以直接选择mysqldump把数据同步过去就ok了。当然要把warehouse全部distcp到目标集群对应目录下,都是存储在/somepath/hive/warehouse/dbname.db/tablename目录下。
我试过数据和元数据同步之后,可以直接在新的hive命令行直接查询所有数据,即使是分区表也没问题。
当然如果你的元数据存储有变化或者元数据出错的话,可能只有采用建表语句+动态分区的方法来迁移hive数据了,具体方案可以参考
https://odinliu.com/2016/02/02/%E6%9C%80%E8%BF%91%E6%90%9EHadoop%E9%9B%86%E7%BE%A4%E8%BF%81%E7%A7%BB%E8%B8%A9%E7%9A%84%E5%9D%91%E6%9D%82%E8%AE%B0/