hadoop集群和hive数据迁移方案

最新推荐文章于 2024-05-16 17:16:27 发布

weixin_34220963

最新推荐文章于 2024-05-16 17:16:27 发布

阅读量482

点赞数

文章标签：大数据数据库 python

原文链接：https://my.oschina.net/aibati2008/blog/874976

版权

2019独角兽企业重金招聘Python工程师标准>>>

最近公司要做hadoop集群和hive数据的迁移，如何在保证原有业务稳定运行的情况下完成所有数据的迁移，也是一个不容易的过程。我总结了一下流程，给其他hadoop集群和hive数仓的管理员做参考。

1. 首先肯定要保证新的hadoop集群和hive的顺利安装和配置，为了减少遇到不必要的麻烦，新集群采用了跟之前一样的版本，都是hadoop2.6.3和hive1.2.1。如果是跨版本迁移可能要更注意细节了。

2. 复制hdfs数据到新的集群，直接采用distcp命令即可在2个集群间方便的进行数据迁移和同步。参数参考如下：

[hadoop@master1 hive]$ hadoop distcp 
usage: distcp OPTIONS [source_path...] <target_path>
              OPTIONS
 -append                Reuse existing data in target files and append new
                        data to them if possible
 -async                 Should distcp execution be blocking
 -atomic                Commit all changes or none
 -bandwidth <arg>       Specify bandwidth per map in MB
 -delete                Delete from target, files missing in source
 -f <arg>               List of files that need to be copied
 -filelimit <arg>       (Deprecated!) Limit number of files copied to <= n
 -i                     Ignore failures during copy
 -log <arg>             Folder on DFS where distcp execution logs are
                        saved
 -m <arg>               Max number of concurrent maps to use for copy
 -mapredSslConf <arg>   Configuration for ssl config file, to use with
                        hftps://
 -overwrite             Choose to overwrite target files unconditionally,
                        even if they exist.
 -p <arg>               preserve status (rbugpcaxt)(replication,
                        block-size, user, group, permission,
                        checksum-type, ACL, XATTR, timestamps). If -p is
                        specified with no <arg>, then preserves
                        replication, block size, user, group, permission,
                        checksum type and timestamps. raw.* xattrs are
                        preserved when both the source and destination
                        paths are in the /.reserved/raw hierarchy (HDFS
                        only). raw.* xattrpreservation is independent of
                        the -p flag. Refer to the DistCp documentation for
                        more details.
 -sizelimit <arg>       (Deprecated!) Limit number of files copied to <= n
                        bytes
 -skipcrccheck          Whether to skip CRC checks between source and
                        target paths.
 -strategy <arg>        Copy strategy to use. Default is dividing work
                        based on file sizes
 -tmp <arg>             Intermediate work path to be used for atomic
                        commit
 -update                Update target, copying only missingfiles or
                        directories

具体执行命令如下，nn1和nn2分别为集群的namenode节点（如果是hadoop1.x到2.x的迁移，hdfs可能会报错，只能使用hftp的方式进行数据传输，具体方法自行搜索）

hadoop distcp hdfs://nn1/user/hive/ hdfs://nn2:8020/user

由于distcp命令也是转换成MR任务来执行，会消耗集群大量的资源，所以如果数据量较大，最好选择在MR job运行较少的时间段做迁移。

3. 大家知道Hive数据就是存在HDFS里的，而元数据（metastore）一般会选择存在mysql里，如果新集群的元数据存储数据库还是mysql的话，可以直接选择mysqldump把数据同步过去就ok了。当然要把warehouse全部distcp到目标集群对应目录下，都是存储在/somepath/hive/warehouse/dbname.db/tablename目录下。

我试过数据和元数据同步之后，可以直接在新的hive命令行直接查询所有数据，即使是分区表也没问题。

当然如果你的元数据存储有变化或者元数据出错的话，可能只有采用建表语句+动态分区的方法来迁移hive数据了，具体方案可以参考

https://odinliu.com/2016/02/02/%E6%9C%80%E8%BF%91%E6%90%9EHadoop%E9%9B%86%E7%BE%A4%E8%BF%81%E7%A7%BB%E8%B8%A9%E7%9A%84%E5%9D%91%E6%9D%82%E8%AE%B0/

转载于:https://my.oschina.net/aibati2008/blog/874976