DataX 导出 Hive 数据到 Doris 时,经常遇到某条数据不合规范导致整个任务失败,有时 Doris 的 URL 提供的信息也难以定位到这条数据,此时如果能定位到是哪个 Hive File 导致的失败就能缩小排查的范围。
查看DataX 报错日志:
2023-12-28 16:21:09.556 [0-0-113-writer] INFO DorisWriterEmitter - StreamLoad response: {"Status":"Fail","BeginTxnTimeMs":0,"Message":"[INTERNAL_ERROR]too many filtered rows","NumberUnselectedRows":0,"CommitAndPublishTimeMs":0,"Label":"dwd_c_order_goods_detail_v2_dt_4b3d8caa-c191-4f44-98a3-72f758ad5396_6","LoadBytes":42505201,"StreamLoadPutTimeMs":4,"NumberTotalRows":19387,"WriteDataTimeMs":4495,"TxnId":137186993,"LoadTimeMs":4502,"ErrorURL":"http://10.0.xx.xx:8040/api/_load_error_log?file=__shard_0/error_log_insert_stmt_6541d8d41f2b4b09-1584eac064efdf9d_6541d8d41f2b4b09_1584eac064efdf9d","TwoPhaseCommit":"false","ReadDataTimeMs":2689,"NumberLoadedRows":19386,"NumberFilteredRows":1}, cost(ms): 4855
2023-12-28 16:21:09.560 [0-0-113-writer] ERROR WriterRunner - Writer Runner Received Exceptions:
com.alibaba.datax.common.exception.DataXException: Code:[DBUtilErrorCode-05], Description:[往您配置的写入表中写入数据时失败.]. - Failed to flush data to doris.
{"Status":"Fail","BeginTxnTimeMs":0,"Message":"[INTERNAL_ERROR]too many filtered rows","NumberUnselectedRows":0,"CommitAndPublishTimeMs":0,"Label":"dwd_c_order_goods_detail_v2_dt_4b3d8caa-c191-4f44-98a3-72f758ad5396_6","LoadBytes":42505201,"StreamLoadPutTimeMs":4,"NumberTotalRows":19387,"WriteDataTimeMs":4495,"TxnId":137186993,"LoadTimeMs":4502,"ErrorURL":"http://10.0.xx.xx:8040/api/_load_error_log?file=__shard_0/error_log_insert_stmt_6541d8d41f2b4b09-1584eac064efdf9d_6541d8d41f2b4b09_1584eac064efdf9d","TwoPhaseCommit":"false","ReadDataTimeMs":2689,"NumberLoadedRows":19386,"NumberFilteredRows":1}
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:30) ~[datax-common-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.plugin.writer.doriswriter.DorisWriterEmitter.doStreamLoad(DorisWriterEmitter.java:117) ~[doriswriter-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.plugin.writer.doriswriter.DorisWriter$Task.flush(DorisWriter.java:108) ~[doriswriter-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.plugin.writer.doriswriter.DorisWriter$Task.startWrite(DorisWriter.java:92) ~[doriswriter-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56) ~[datax-core-0.0.1-SNAPSHOT.jar:na]
at java.lang.Thread.run(Thread.java:750) [na:1.8.0_352]
Exception in thread "taskGroup-0" com.alibaba.datax.common.exception.DataXException: Code:[DBUtilErrorCode-05], Description:[往您配置的写入表中写入数据时失败.]. - Failed to flush data to doris.
日志中可以看到,[0-0-113-writer]
在写数据时发生错误,往 DataX 日志上面找 0-0-113
读的是哪个文件,即找 0-0-113-reader
,找到的日志如下:
2023-12-28 16:20:28.808 [0-0-113-reader] INFO HdfsReader$Job - hadoopConfig details:{"finalParameters":["mapreduce.job.end-notification.max.retry.interval","mapreduce.job.end-notification.max.attempts"]}
2023-12-28 16:20:28.808 [0-0-113-reader] INFO Reader$Task - read start
2023-12-28 16:20:28.808 [0-0-113-reader] INFO Reader$Task - reading file : [ofs://xxx.com/usr/hive/warehouse/db_middle.db/xxx/statdate=20231227/000360_0]
2023-12-28 16:20:28.808 [0-0-113-reader] INFO HdfsReader$Job - Start Read orcfile [ofs://xxx.com/usr/hive/warehouse/db_middle.db/xxx/statdate=20231227/000360_0].
定位到出错的文件是 ofs://xxx.com/usr/hive/warehouse/db_middle.db/xxx/statdate=20231227/000360_0
。
如果是 Text 文件,直接查看即可,如果是 ORC 文件,可以使用 Hive 命令解析 ORC 文件。
Hive 提供了一个命令查看 HDFS 上 ORC 文件的信息,使用方法为:
hive --orcfiledump <location-of-orc-file>
将结果重定向到文件中以便查看:
hive --orcfiledump ofs://xxx.com/usr/hive/warehouse/db_middle.db/xxx/statdate=20231227/000360_0 > orc.txt
查看文件可以发现,文件大概分为 4 个部分:
Structure for ofs://xxx.com/usr/hive/warehouse/db_middle.db/xxx/statdate=20231227/000360_0
......
Stripe Statistics:
......
File Statistics:
......
Stripes:
......
每个部分的含义:
- Structure for xx file
- 记录整张表的记录数,压缩方式,压缩大小,以及表结构。
- 在表结构部分,将整张表的所有字段构造成一个大的 struct 结构。
- Stripe Statistics
- ORC 文件本身就是通过一个一个的 Stripe(条带)存储信息的,这里显示的就是对应 Stripe 的统计信息,包括各个字段的
count
,min
,max
,sum
信息。对于最外层的 Struct,只统计其 count 值。
- ORC 文件本身就是通过一个一个的 Stripe(条带)存储信息的,这里显示的就是对应 Stripe 的统计信息,包括各个字段的
- File Statistics
- 统计内容和 Stripe 的一致,不过范围是整张表。
- Stripes
- 存储每个 Stripe 的索引,和字段的索引与编码方式
这里我只关注 File Statistics 中能够帮助我定位的字段的最大最小值,比如下面:
File Statistics:
Column 0: count: 968776 hasNull: false
Column 1: count: 968776 hasNull: false min: app001 max: app002 sum: 14531656
Column 1
能够帮助定位数据,可以看到有问题的数据就出现在 app001
到 app002
之间,这一下子就缩小了排查范围。
数据开发工作中免不了要导大量的数据,这些数据有时包含了几百个 orc 文件,上面的方法可以在任务失败时将排查范围一下子缩小到具体的某个文件,知道是哪段数据出了问题,是很有帮助的。