1. 假设
假设HDFS中有一个data.csv文件,每行有8列数据,共有10行,如下所示:
1257,4,"2921-75370692","10:02:30", "2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30", "2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30", "2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30", "2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30", "2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30", "2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30", "2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30", "2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30", "2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
1257,4,"2921-75370692","10:02:30", "2022-02-17 18:30:08","2022-01-10 11:13:48","2022-01-06 13:37:50","14704902236"
现在想要将这个csv文件从HDFS中添加到ClickHouse中的表中,首先需要在ClickHouse中创建一张表user_table(使用ClickHouse默认的default数据库),命令如下:
CREATE TABLE default.user_table(id UInt16, alert_level UInt16, caller String, duration String, end_time String, found_time String, start_time String, victim_phone_number String) ENGINE = TinyLog();
现在就可以使用datax工具将上面所描述的hdfs中的data.csv文件添加到ClickHouse中的表中。这需要利用到clickhousewriter插件和hdfsreader插件
2. 安装DataX
安装请参考阿里的官方文档如下,建议使用源码安装,最新源码中包含了clickhousewriter插件https://github.com/alibaba/DataX/blob/master/userGuid.md
安装完成后可以看到其中有一个bin目录,job目录,plugin目录,bin目录下放了运行时使用的datax.py文件,job目录下是我们要编写的任务,都是用jso文件进行配置,plugin目录下包含了各种数据源的读写插件。
3. 编写hdfs2clickhuse.json文件(将hdfs中的data.csv文件添加到ClickHouse中的user_table表中)
hdfs2clickhuse.json文件内容如下所示,将该文件放到job下:
{
"job": {
"setting": {
"speed": {
"channel": 3
}
},
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/userdata/fzdata2.csv",
"defaultFS": "hdfs://172.17.0.3:8020",
"column": ["*"],
"fileType": "csv",
"encoding": "UTF-8"
}
},
"writer":{
"name": "clickhousewriter",
"parameter": {
"username": "default",
"password": "123456",
"column": ["id","alert_level","caller","duration","end_time","found_time","start_time","victim_phone_number"],
"connection": [
{
"jdbcUrl": "jdbc:clickhouse://172.17.0.5:8123/default",
"table": ["user_table"]
}
],
"preSql": [],
"postSql": [],
"batchSize": 65536,
"batchByteSize": 134217728,
"dryRun": false,
"writeMode": "insert"
}
}
}
]
}
}
下面说明一下“reader”和“writer”里一些属性的配置。
先说“reader”中的:
"reader": {
//对应plugin目录reader目录下的hdfsreader目录
"name": "hdfsreader",
"parameter": {
//hdfs目录中的文件路径
"path": "/userdata/fzdata2.csv",
//hdfs namenode路径,可用hdfs getconf -confKey fs.default.name命令查看
"defaultFS": "hdfs://172.17.0.3:8020",
//每行的所有列
"column": ["*"],
//文件格式
"fileType": "csv",
//分隔符,不写默认用,分隔
"fieldDelimiter":",",
"encoding": "UTF-8"
}
下面miaoshu ”writer“中的一些属性:
"writer":{
//对应plugin目录writer目录下的clickhousewriter目录
"name": "clickhousewriter",
"parameter": {
//clickhouse数据库用户名
"username": "default",
//clickhouse数据库用户密码
"password": "123456",
//使用的clickhouse中user_table表中的字段
"column": ["id","alert_level","caller","duration","end_time","found_time","start_time","victim_phone_number"],
"connection": [
{
"jdbcUrl": "jdbc:clickhouse://172.17.0.5:8123/default",
"table": ["user_table"]
}
],
"preSql": [],
"postSql": [],
"batchSize": 65536,
"batchByteSize": 134217728,
"dryRun": false,
"writeMode": "insert"
}
}
}
4. 执行读写任务(将hdfs中的data.csv文件添加到ClickHouse中的user_table表中)
//执行如下命令,假设在datax目录下,最后成功读写截图如下图。
python bin/datax.py job/hdfs2clickhouse.json
5. 参考文献
阿里云的datax开源库
https://github.com/alibaba/DataX