首先,在MySQL中准备数据,以lineitem表为例。然后通过DataX将MySQL库中的lineitem表中的数据,迁移到RapidsDB的表lineitem中。
1、在mysql创建lineitem表
CREATE TABLE `lineitem` (
`l_orderkey` bigint(11) NOT NULL,
`l_partkey` int(11) NOT NULL,
`l_suppkey` int(11) NOT NULL,
`l_linenumber` int(11) NOT NULL,
`l_quantity` decimal(15,2) NOT NULL,
`l_extendedprice` decimal(15,2) NOT NULL,
`l_discount` decimal(15,2) NOT NULL,
`l_tax` decimal(15,2) NOT NULL,
`l_returnflag` char(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`l_linestatus` char(1) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`l_shipdate` date NOT NULL,
`l_commitdate` date NOT NULL,
`l_receiptdate` date NOT NULL,
`l_shipinstruct` char(25) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`l_shipmode` char(10) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`l_comment` varchar(44) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL);
2、导入数据
mysql> LOAD DATA LOCAL INFILE '/opt/lineitem.csv' into table tpch.lineitem FIELDS TERMINATED BY '|' lines TERMINATED by '|\n' ;
Query OK, 119994608 rows affected (27 min 17.10 sec)
Records: 119994608 Deleted: 0 Skipped: 0 Warnings: 0
mysql> select count(1) from lineitem;
+-----------+
| count(1) |
+-----------+
| 119994608 |
+-----------+
1 row in set (2 min 14.93 sec)
3、编辑datax配置文件
[root@mysql5 bin]# cat stream2stream_writer.json
{
"job": {
"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123456",
"column": [
"*",
],
"connection": [
{
"table": [
"lineitem"
],
"jdbcUrl": [
"jdbc:mysql://192.168.1.67:3306/tpch"
]
}
]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"writeMode": "replace",
"username": "test1",
"password":"123456",
"column": ["*"],
"session": [],
"preSql": [],
"connection": [
{
"jdbcUrl": "jdbc:mysql://192.168.1.152:13306/tpch",
"table": [
"lineitem"
]
}
]
}
}
}
]
}
}
4、在rpdsql中创建lineitem表(参考1步骤)
5、执行DataX脚本文件
[root@mysql5 bin]# ./datax.py stream2stream_writer.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2022-04-28 12:45:36.649 [main] WARN ConfigParser - 插件[mysqlreader,mysqlwriter]加载失败,1s后重试... Exception:Code:[Common-00], Describe:[您提供的配置文件存在错 误信息,请检查您的作业配置 .] - 配置信息错误,您提供的配置文件[/opt/datax/plugin/reader/._drdsreader/plugin.json]不存在. 请检查您的配置文件.
2022-04-28 12:45:37.657 [main] ERROR Engine -
经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[Common-00], Describe:[您提供的配置文件存在错误信息,请检查您的作业配置 .] - 配置信息错误,您提供的配置文件[ /opt/datax/plugin/reader/._drdsreader/plugin.json]不存在. 请检查您的配置文件.
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:26)
at com.alibaba.datax.common.util.Configuration.from(Configuration.java:95)
at com.alibaba.datax.core.util.ConfigParser.parseOnePluginConfig(ConfigParser.java:153)
at com.alibaba.datax.core.util.ConfigParser.parsePluginConfig(ConfigParser.java:125)
at com.alibaba.datax.core.util.ConfigParser.parse(ConfigParser.java:63)
at com.alibaba.datax.core.Engine.entry(Engine.java:137)
at com.alibaba.datax.core.Engine.main(Engine.java:204)
经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[Common-00], Describe:[您提供的配置文件存在错误信息,请检查您的作业配置 .] - 配置信息错误,您提供的配置文件[ /opt/datax/plugin/reader/._drdsreader/plugin.json]不存在. 请检查您的配置文件.
查看/opt/datax/plugin/reader/目录下有._*文件,删除即可。
删除步骤如下:
[root@mysql5 reader]# find /opt/datax/plugin/reader/ -type f -name "._*er" | xargs rm -rf
[root@mysql5 reader]# find /opt/datax/plugin/writer/ -type f -name "._*er" | xargs rm -rf
6、继续执行脚本
[root@mysql5 bin]# ./datax.py ./stream2stream_writer.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2022-04-28 12:50:13.418 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemI mpl
2022-04-28 12:50:13.430 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.8 25.171-b11
jvmInfo: Linux amd64 3.10.0-1160.15.2.el7.x86_64
cpu num: 8
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [PS MarkSweep, PS Scavenge]
MEMORY_NAME | allocation_size | init_size
PS Eden Space | 256.00MB | 256.00MB
Code Cache | 240.00MB | 2.44MB
Compressed Class Space | 1,024.00MB | 0.00MB
PS Survivor Space | 42.50MB | 42.50MB
PS Old Gen | 683.00MB | 683.00MB
Metaspace | -0.00MB | 0.00MB
2022-04-28 12:50:13.454 [main] INFO Engine -
{
"content":[
{
"reader":{
"name":"mysqlreader",
"parameter":{
"column":[
"*"
],
"connection":[
{
"jdbcUrl":[
"jdbc:mysql://192.168.30.67:3306/tpch"
],
"table":[
"lineitem1"
]
}
],
"password":"******",
"username":"root"
}
},
"writer":{
"name":"mysqlwriter",
"parameter":{
"column":[
"*"
],
"connection":[
{
"jdbcUrl":"jdbc:mysql://192.168.30.152:13306/tpch",
"table":[
"lineitem"
]
}
],
"password":"******",
"preSql":[],
"session":[],
"username":"test1",
"writeMode":"replace"
}
}
}
],
"setting":{
"errorLimit":{
"percentage":0.02,
"record":0
},
"speed":{
"channel":3
}
}
}
2022-04-28 12:50:13.475 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: n ull
2022-04-28 12:50:13.478 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2022-04-28 12:50:13.478 [main] INFO JobContainer - DataX jobContainer starts job.
2022-04-28 12:50:13.480 [main] INFO JobContainer - Set jobId = 0
2022-04-28 12:50:14.270 [job-0] INFO OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:mysql://192.168.30 .67:3306/tpch?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatem ents=true.
2022-04-28 12:50:14.271 [job-0] WARN OriginalConfPretreatmentUtil - 您的配置文件中的列配置存在一定的风险. 因 为您未配置读取数据库表的列,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并 作出修改.
2022-04-28 12:50:14.860 [job-0] INFO OriginalConfPretreatmentUtil - table:[lineitem] all columns:[
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linesta tus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
].
2022-04-28 12:50:14.860 [job-0] WARN OriginalConfPretreatmentUtil - 您的配置文件中的列配置信息存在风险. 因为 您配置的写入数据库表的列为*,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置 并作出修改.
2022-04-28 12:50:14.862 [job-0] INFO OriginalConfPretreatmentUtil - Write data [
replace INTO %s (l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_ret urnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment) VALUES(?,?,?,? ,?,?,?,?,?,?,?,?,?,?,?,?)
], which jdbcUrl like:[jdbc:mysql://192.168.30.152:13306/tpch?yearIsDateType=false&zeroDateTimeBehavior=conver tToNull&tinyInt1isBit=false&rewriteBatchedStatements=true]
2022-04-28 12:50:14.863 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2022-04-28 12:50:14.863 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] do prepare work .
2022-04-28 12:50:14.864 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] do prepare work .
2022-04-28 12:50:14.864 [job-0] INFO JobContainer - jobContainer starts to do split ...
2022-04-28 12:50:14.865 [job-0] INFO JobContainer - Job set Channel-Number to 3 channels.
2022-04-28 12:50:14.870 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] splits to [1] tasks.
2022-04-28 12:50:14.871 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] splits to [1] tasks.
2022-04-28 12:50:14.895 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2022-04-28 12:50:14.901 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2022-04-28 12:50:14.903 [job-0] INFO JobContainer - Running by standalone Mode.
2022-04-28 12:50:14.914 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] ta sks.
2022-04-28 12:50:14.919 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2022-04-28 12:50:14.920 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2022-04-28 12:50:14.929 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is sta rted
2022-04-28 12:50:14.934 [0-0-0-reader] INFO CommonRdbmsReader$Task - Begin to read record by Sql: [select * f rom lineitem1
] jdbcUrl:[jdbc:mysql://192.168.30.67:3306/tpch?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyIn t1isBit=false&rewriteBatchedStatements=true].
2022-04-28 12:50:14.960 [0-0-0-reader] INFO CommonRdbmsReader$Task - Finished read record by Sql: [select * f rom lineitem1
] jdbcUrl:[jdbc:mysql://192.168.30.67:3306/tpch?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyIn t1isBit=false&rewriteBatchedStatements=true].
2022-04-28 12:50:15.231 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[303 ]ms
2022-04-28 12:50:15.232 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2022-04-28 12:50:24.928 [job-0] INFO StandAloneJobContainerCommunicator - Total 4 records, 424 bytes | Speed 42B/s, 0 records/s | Error 0 records, 0 bytes | All T ask WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2022-04-28 12:50:24.929 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2022-04-28 12:50:24.930 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] do post work.
2022-04-28 12:50:24.930 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] do post work.
2022-04-28 12:50:24.930 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2022-04-28 12:50:24.931 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /opt/datax/hook
2022-04-28 12:50:24.932 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
PS Scavenge | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
2022-04-28 12:50:24.933 [job-0] INFO JobContainer - PerfTrace not enable!
2022-04-28 12:50:24.933 [job-0] INFO StandAloneJobContainerCommunicator - Total 4 records, 424 bytes | Speed 42B/s, 0 records/s | Error 0 records, 0 bytes | All T ask WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2022-04-28 12:50:24.934 [job-0] INFO JobContainer -
任务启动时刻 : 2022-04-28 12:50:13
任务结束时刻 : 2022-04-28 12:50:24
任务总计耗时 : 11s
任务平均流量 : 42B/s
记录写入速度 : 0rec/s
读出记录总数 : 4
读写失败总数 : 0
7、查询rpdsql下的lineitem表查看已经有数据同步过来。