clickhouse-copier 数据迁移工具介绍

最新推荐文章于 2024-09-18 19:09:37 发布

MYSQL轻松学

最新推荐文章于 2024-09-18 19:09:37 发布

阅读量1.2w

点赞数 2

分类专栏： clickhouse

本文链接：https://blog.csdn.net/liang_0609/article/details/86707834

版权

clickhouse 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

在使用clickhouse的时候，可能会有不同集群间迁移数据需求，这里可以使用如下几种方式：

DETACH/FREEZE分区，进行SCP拷贝，然后再ATTACH
alter table db.table DETACH PARTITION [partition]; #下线分区
alter table db.table FREEZE PARTITION [partition]; #备份分区
alter table db.table ATTACH PARTITION [partition]; #上线分区
利用remote函数
insert into ... select * from remote('ip',db.table,'user','password')
clickhouse-copier工具
这个工具是标准发布的clickhouse server的一部分，它可以在完全并行的模式下工作, 并以最有效的方式分发数据

三种方式的优缺点：

方式	优点	缺点
DETACH/FREEZE	适用小表；	源和目标集群分区数量需要一样；操作较繁琐；
remote	适用小表；操作方便；	大表速度较慢；
clickhouse-copier	并行操作；可以变更表名主键；可以变更分区；	配置繁琐；需要借助zookeeper使用；

本文主要介绍Clickhouse-copier的使用方式

Clickhouse-copier是在安装clickhouse软件后自带的工具命令。

> clickhouse-copier --help

usage: clickhouse-copier --config-file <config-file> --task-path <task-path> Copies tables from one cluster to another
--daemon	★守护进程
--umask=mask	设置守护进程的umask
--pidfile=path	Pid文件路径
-C<file>, --config-file=<file>	★配置文件，zookeeper等信息
-L<file>, --log-file=<file>	日志文件
-E<file>, --errorlog-file=<file>	错误日志文件
-P<file>, --pid-file=<file>	Pid文件
--task-path=task-path	★Zookeeper中的任务路径
--safe-mode	★禁止ALTER DROP PARTITION
--copy-fault-probability=copy-fault-probability	指定分区时，测试分区状态
--log-level=log-level	日志级别，debug
--base-dir=base-dir	★默认当前路径，生成目录clickhouse-copier_日期_Pid
--help	查看帮助

标★的比较重要，通常情况只需指定--daemon、--config和--task-path ，其他采用默认即可。

使用Clickhouse-copier需要借助zookeeper，为减少网络流量，建议clickhouse-copier在源数据所在的服务器上运行。

一、首先需要准备一个schema.xml配置

包括源和目标的集群分片信息，以及需要同步的表信息

<yandex>

<remote_servers>

<source_cluster>

<shard>
<weight>1</weight>
<replica>
<host>10.10.1.1</host>
<user>user</user>
<password>password</password>
<port>9000</port>
</replica>
</shard>

<shard>
<weight>1</weight>
<replica>
<host>10.10.1.2</host>
<user>user</user>
<password>password</password>
<port>9000</port>
</replica>
</shard>
</source_cluster>

<destination_cluster>
<shard>
<weight>1</weight>
<replica>
<host>10.10.1.3</host>
<user>user</user>
<password>password</password>
<port>9000</port>
</replica>
</shard>
<shard>
<weight>1</weight>
<replica>
<host>10.10.1.4</host>
<user>user</user>
<password>password</password>
<port>9000</port>
</replica>
</shard>
</destination_cluster>
</remote_servers>


<max_workers>2</max_workers>

<settings_pull>
    <readonly>1</readonly>
</settings_pull>

<settings_push>
    <readonly>0</readonly>
</settings_push>

<settings>
    <connect_timeout>3</connect_timeout>
    
    <insert_distributed_sync>1</insert_distributed_sync>
</settings>

<tables>
    <test1>
        
        <cluster_pull>source_cluster</cluster_pull>
        <database_pull>default</database_pull>
        <table_pull>test1</table_pull>
        
        <cluster_push>destination_cluster</cluster_push>
        <database_push>default</database_push>
        <table_push>test1</table_push>
        <engine>
            ENGINE=ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/default/test1', '{replica}')
            PARTITION BY toMonday(EventDate)
            ORDER BY (ID, EventDate)
        </engine>
        <sharding_key>rand()</sharding_key>
        
    </test1>
    
</tables>
</yandex>

关于schema.xml配置格式可参考： https://clickhouse.yandex/docs/en/operations/utils/clickhouse-copier/

二、完成schema.xml配置后，需要将此配置上载至 Zookeeper 节点的特定路径下 (/<task-path>/description)

可以创建多个任务

在zookeeper随便一个节点机器执行以下命令：

> ./zkCli.sh create /clickhouse/copytasks ""
> ./zkCli.sh create /clickhouse/copytasks/task1 ""
> ./zkCli.sh create /clickhouse/copytasks/task1/description "`cat schema.xml`"

三、准备zookeeper.xml配置文件

</node>

</zookeeper>

<level>trace</level>

<stderr>./log/stderr.log</stderr>

<stdout>./log/stdout.log</stdout>

</logger>

</yandex>

四、在clickhouse机器启动

源和目标都可以，为减少网络流量，建议clickhouse-copier在源数据所在的服务器上运行。

> clickhouse-copier --config zookeeper.xml --task-path /clickhouse/copytasks/task1 --daemon

工具启动后，需要一段时间才能完成任务，具体取决于要复制的表的大小。若未指定--base-dir，则在当前所在目录下生成 clickhouse-copier_时间_pid 格式的目录，目录下包含两个日志文件，可以通过这两个日志文件查看复制错误及详情。