利用DataX将HDFS中数据同步到ClickHouse中

1. 假设

假设HDFS中有一个data.csv文件,每行有8列数据,共有10行,如下所示:

1257,4,"2921-75370692","10:02:30",	"2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30",	"2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30",	"2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30",	"2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30",	"2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30",	"2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30",	"2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30",	"2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30",	"2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30",	"2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"

现在想要将这个csv文件从HDFS中添加到ClickHouse中的表中,首先需要在ClickHouse中创建一张表user_table(使用ClickHouse默认的default数据库),命令如下:

CREATE TABLE default.user_table(id UInt16, alert_level UInt16, caller String, duration String, end_time String, found_time String, start_time String, victim_phone_number String) ENGINE = TinyLog();

现在就可以使用datax工具将上面所描述的hdfs中的data.csv文件添加到ClickHouse中的表中。这需要利用到clickhousewriter插件和hdfsreader插件

2. 安装DataX

安装请参考阿里的官方文档如下,建议使用源码安装,最新源码中包含了clickhousewriter插件https://github.com/alibaba/DataX/blob/master/userGuid.md
安装完成后可以看到其中有一个bin目录,job目录,plugin目录,bin目录下放了运行时使用的datax.py文件,job目录下是我们要编写的任务,都是用jso文件进行配置,plugin目录下包含了各种数据源的读写插件。

3. 编写hdfs2clickhuse.json文件(将hdfs中的data.csv文件添加到ClickHouse中的user_table表中)

hdfs2clickhuse.json文件内容如下所示,将该文件放到job下:

{
    "job": {
        "setting": {
            "speed": {
                "channel": 3
            }
        },
        "content": [
            {
                "reader": {
                    "name": "hdfsreader",
                    "parameter": {
                        "path": "/userdata/fzdata2.csv",
                        "defaultFS": "hdfs://172.17.0.3:8020",
                        "column": ["*"],
                        "fileType": "csv",
                        "encoding": "UTF-8"
                    }

                },
				"writer":{
					"name": "clickhousewriter",
					"parameter": {
						"username": "default",
						"password": "123456",
						"column": ["id","alert_level","caller","duration","end_time","found_time","start_time","victim_phone_number"],
						"connection": [
							{
								"jdbcUrl": "jdbc:clickhouse://172.17.0.5:8123/default",
								"table": ["user_table"]
							}
						],
						"preSql": [],
						"postSql": [],
						"batchSize": 65536,
						"batchByteSize": 134217728,
						"dryRun": false,
						"writeMode": "insert"
					}
				}

            }
        ]
    }
}

下面说明一下“reader”和“writer”里一些属性的配置。
先说“reader”中的:

 "reader": {
 					//对应plugin目录reader目录下的hdfsreader目录
                    "name": "hdfsreader",
                    "parameter": {
                    	//hdfs目录中的文件路径
                        "path": "/userdata/fzdata2.csv",
                        //hdfs namenode路径,可用hdfs getconf -confKey fs.default.name命令查看
                        "defaultFS": "hdfs://172.17.0.3:8020",
                        //每行的所有列
                        "column": ["*"],
                        //文件格式
                        "fileType": "csv",
                        //分隔符,不写默认用,分隔
                        "fieldDelimiter":",",
                        "encoding": "UTF-8"
                    }

下面miaoshu ”writer“中的一些属性:

	"writer":{
					//对应plugin目录writer目录下的clickhousewriter目录
					"name": "clickhousewriter",
					"parameter": {
						//clickhouse数据库用户名
						"username": "default",
						//clickhouse数据库用户密码
						"password": "123456",
						//使用的clickhouse中user_table表中的字段
						"column": ["id","alert_level","caller","duration","end_time","found_time","start_time","victim_phone_number"],
						"connection": [
							{
								"jdbcUrl": "jdbc:clickhouse://172.17.0.5:8123/default",
								"table": ["user_table"]
							}
						],
						"preSql": [],
						"postSql": [],
						"batchSize": 65536,
						"batchByteSize": 134217728,
						"dryRun": false,
						"writeMode": "insert"
					}
				}

            }

4. 执行读写任务(将hdfs中的data.csv文件添加到ClickHouse中的user_table表中)

//执行如下命令,假设在datax目录下,最后成功读写截图如下图。
python bin/datax.py job/hdfs2clickhouse.json

在这里插入图片描述
在这里插入图片描述

5. 参考文献

阿里云的datax开源库
https://github.com/alibaba/DataX

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值