SteamLoad批量写入doris

有一个将过车记录实时插入doris的需求。程序使用java11+springboot2.7,doris版本是2.1。

建表

CREATE TABLE IF NOT EXISTS db_ga1400.ods_ga1400_violation_vehicle (
    pass_time datetime not null,
    motor_vehicle_id VARCHAR(144),
    ......省略
)
duplicate key(`pass_time`)
partition by range(`pass_time`) ()
DISTRIBUTED BY HASH(`pass_time`)
PROPERTIES (
    "replication_allocation" = "tag.location.default: 1",
    "dynamic_partition.enable" = "true",
    "dynamic_partition.time_unit" = "MONTH",
    "dynamic_partition.start" = "-12",
    "dynamic_partition.end" = "2",
    "dynamic_partition.prefix" = "vio",
    "dynamic_partition.buckets" = "1"
);

第一版 jdbc批量插入

doris使用mysql驱动,通过mybatis工具生成mapper。 由于数据相对较大,采用批量插入手动提交的方式

try {
    SqlSession sqlSession = sqlSessionFactory.openSession(ExecutorType.BATCH, false);
    GA1400ViolationMapper mapper = sqlSession.getMapper(GA1400ViolationMapper.class);
    for (int i = 0; i < violationList.size(); i++) {
        affectedRows += mapper.insertSelective(violationList.get(i));
        if (i != 0 && i % 1000 == 0) {
            sqlSession.commit();
        }
    }
    sqlSession.commit();
    ......
} catch (Exception ex) {
    ......
}

接收程序会在队列中攒一批数据批量插入,速度非常慢。

网上查找资料,提示需要在数据库链接后面增加rewriteBatchedStatements=true,开启doris批量提交。(rewriteBatchedStatements=true 会确保 Driver 执行批处理

开启之后,刚开始速度还行,但是速度很快就慢下来了,效果也不理想。

[ INFO ] [2024-10-14 14:29:02] c.c.p.s.GA1400ViolationService [45] - 批量插入成功, 记录数:100,耗时:725ms
[ INFO ] [2024-10-14 14:29:03] c.c.p.s.GA1400ViolationService [45] - 批量插入成功, 记录数:100,耗时:457ms
[ INFO ] [2024-10-14 14:29:03] c.c.p.s.GA1400ViolationService [45] - 批量插入成功, 记录数:100,耗时:481ms
[ INFO ] [2024-10-14 14:29:04] c.c.p.s.GA1400ViolationService [45] - 批量插入成功, 记录数:100,耗时:550ms
[ INFO ] [2024-10-14 14:29:04] c.c.p.s.GA1400ViolationService [45] - 批量插入成功, 记录数:100,耗时:449ms
[ INFO ] [2024-10-14 14:29:05] c.c.p.s.GA1400ViolationService [45] - 批量插入成功, 记录数:100,耗时:457ms
[ INFO ] [2024-10-14 14:29:05] c.c.p.s.GA1400ViolationService [45] - 批量插入成功, 记录数:100,耗时:407ms
[ INFO ] [2024-10-14 14:29:12] c.c.p.s.GA1400ViolationService [45] - 批量插入成功, 记录数:100,耗时:6765ms
[ INFO ] [2024-10-14 14:29:25] c.c.p.s.GA1400ViolationService [45] - 批量插入成功, 记录数:100,耗时:13111ms
[ INFO ] [2024-10-14 14:29:41] c.c.p.s.GA1400ViolationService [45] - 批量插入成功, 记录数:100,耗时:15972ms
[ INFO ] [2024-10-14 14:29:49] c.c.p.s.GA1400ViolationService [45] - 批量插入成功, 记录数:100,耗时:8707ms
[ INFO ] [2024-10-14 14:29:58] c.c.p.s.GA1400ViolationService [45] - 批量插入成功, 记录数:100,耗时:8815ms
[ INFO ] [2024-10-14 14:30:18] c.c.p.s.GA1400ViolationService [45] - 批量插入成功, 记录数:100,耗时:19679ms

也有人提出同样的问题,官方的回复是,不要用jdbc,用StreamLoad。(https://github.com/apache/doris/discussions/19336

第二版 StreamLoad批量插入

在使用 Stream Load 时,需要通过 HTTP 协议发起导入作业给 FE 节点,FE 会以轮询方式,重定向(redirect)请求给一个 BE 节点以达到负载均衡的效果。也可以直接发送 HTTP 请求作业给指定的 BE 节点。在 Stream Load 中,Doris 会选定一个节点做为 Coordinator 节点。Coordinator 节点负责接受数据并分发数据到其他节点上,所以使用http客户端时要注意对307重定向的处理。

Stream Load 的主要流程:

StreamLoad客户端配置如下:

package com.cy.platform.service;

import cn.hutool.core.util.IdUtil;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.annotation.JSONField;
import kotlin.text.Charsets;
import org.apache.commons.codec.binary.Base64;
import org.apache.commons.lang3.RandomStringUtils;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPut;
import org.apache.http.entity.ContentType;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.DefaultRedirectStrategy;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.http.HttpHeaders;
import org.springframework.stereotype.Service;

import javax.annotation.PostConstruct;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.Calendar;
import java.util.UUID;

/**
 * @className: DorisStreamLoadClient
 * @description: 
 * @author: tangxzh
 * @date: 2024/10/14 15:29
 **/
@Service
public class DorisStreamLoadClient {
    @Value("${doris.ip}")
    private String dorisIp;
    @Value("${doris.port:8030}")
    private String dorisPort;
    @Value("${doris.stream-load.user}")
    private String user;
    @Value("${doris.stream-load.password}")
    private String password;
    @Value("${doris.stream-load.db}")
    private String dbName;
    //basic认证头
    private String basicAuthHeader;

    @PostConstruct
    public void init() {
        this.basicAuthHeader = constructAuthHeader(user, password);
    }
    
    //构建HTTP客户端,设置开启重定向策略
    private final static HttpClientBuilder httpClientBuilder = HttpClients
        .custom()
        .setRedirectStrategy(new DefaultRedirectStrategy() {
            @Override
            protected boolean isRedirectable(String method) {
                // If the connection target is FE, you need to deal with 307 redirect。
                return true;
            }
        });
    
    //StreamLoad可以写入csv,json等多种格式。我们写入的是json
    public StreamLoadResult loadJson(String table, String jsonData) throws IOException {
        String loadUrl = String.format("http://%s:%s/api/%s/%s/_stream_load", dorisIp, dorisPort, dbName, table);
        try (CloseableHttpClient client =  HttpClients.createDefault()) {
            HttpPut put = new HttpPut(loadUrl);
            put.setHeader(HttpHeaders.EXPECT, "100-continue");
            put.setHeader(HttpHeaders.AUTHORIZATION, basicAuthHeader);

            // 可以在 Header 中设置 stream load 相关属性。
            Calendar calendar = Calendar.getInstance();
            String label = String.format("%s_%s",table, IdUtil.getSnowflake().nextIdStr());
            put.setHeader("label", label);
            put.setHeader("format", "json");
            put.setHeader("Content-Type", ContentType.APPLICATION_JSON.toString());
            put.setHeader("strip_outer_array", "true");
            // Array 中的每行数据的字段顺序完全一致。Doris 仅会根据第一行的字段顺序做解析,然后以下标的形式访问之后的数据。该方式可以提升 3-5X 的导入效率。
            put.setHeader("fuzzy_parse", "true");

            StringEntity entity = new StringEntity(jsonData, Charsets.UTF_8);
            put.setEntity(entity);
            
            try (CloseableHttpResponse response = client.execute(put)) {
                if (response.getEntity() == null) {
                    return new StreamLoadResult("Fail");
                } else {
                    try {
                        return JSON.parseObject(EntityUtils.toString(response.getEntity()), StreamLoadResult.class);
                    } catch (Exception e) {
                        return new StreamLoadResult("Fail");
                    }
                }
            }
        }
    }

    private String constructAuthHeader(String user, String password) {
        final String tobeEncode = user + ":" + password;
        byte[] encoded = Base64.encodeBase64(tobeEncode.getBytes(StandardCharsets.UTF_8));
        return "Basic " + new String(encoded);
    }

    //StreamLoad调用返回的结果,当有错误发生,ErrorURL显示的很详细
    public static class StreamLoadResult {
        @JSONField(name = "Txnid")
        private long txnid;
        @JSONField(name = "Label")
        private String label;
        @JSONField(name = "Comment")
        private String comment;
        @JSONField(name = "TwoPhaseCommit")
        private boolean twoPhaseCommit;
        @JSONField(name = "Status")
        private String status;
        @JSONField(name = "Message")
        private String message;
        @JSONField(name = "NumberTotalRows")
        private long numberTotalRows;
        @JSONField(name = "NumberLoadedRows")
        private long numberLoadedRows;
        @JSONField(name = "NumberFilteredRows")
        private long numberFilteredRows;
        @JSONField(name = "NumberUnselectedRows")
        private long numberUnselectedRows;
        @JSONField(name = "LoadBytes")
        private long loadBytes;
        @JSONField(name = "LoadTimeMs")
        private long loadTimeMs;
        @JSONField(name = "BeginTxnTimeMs")
        private long beginTxnTimeMs;
        @JSONField(name = "StreamLoadPutTimeMs")
        private long streamLoadPutTimeMs;
        @JSONField(name = "ReadDataTimeMs")
        private long readDataTimeMs;
        @JSONField(name = "WriteDataTimeMs")
        private long writeDataTimeMs;
        @JSONField(name = "CommitAndPublishTimeMs")
        private long commitAndPublishTimeMs;
        @JSONField(name = "ErrorURL")
        private String errorURL;

        public StreamLoadResult() {}
        public StreamLoadResult(String status) {
            this.status = status;
        }
        ......省略getter/setter
    }
}

从队列中取一批数据发送

long t1 = System.currentTimeMillis();
try {
    String jsonData = JSON.toJSONString(violationList,  new SerializerFeature[]{SerializerFeature.WriteMapNullValue, SerializerFeature.SortField} );
    DorisStreamLoadClient.StreamLoadResult loadResponse = dorisStreamLoadClient.loadJson("ods_ga1400_violation_vehicle", jsonData);
    long t2 = System.currentTimeMillis();

    if (loadResponse.getStatus().equalsIgnoreCase("success")) {
        logger.info("向doris上传违法数据成功,数量{}条,用时{}毫秒", violationList.size(), t2 - t1);
    } else {
        logger.error("向doris上传违法数据失败,数量{}条,用时{}毫秒,错误描述:{}\r\n数据:{}" , violationList.size(), t2 - t1, loadResponse, jsonData);
    }
} catch (Exception ex) {
    logger.error("向doris上传违法数据异常:{},数量{}条,用时{}毫秒 " ,ex.getMessage(), violationList.size(), System.currentTimeMillis() - t1);
}

每次发送100+条,用时300毫秒左右,比较稳定

[ INFO ] [2024-10-16 16:54:37] c.c.p.service.MotorVehicleReceiver [70] - 向doris上传违法数据成功,数量203条,用时291毫秒
[ INFO ] [2024-10-16 16:54:37] c.c.p.service.MotorVehicleReceiver [70] - 向doris上传违法数据成功,数量98条,用时283毫秒
[ INFO ] [2024-10-16 16:54:37] c.c.p.service.MotorVehicleReceiver [70] - 向doris上传违法数据成功,数量133条,用时191毫秒
[ INFO ] [2024-10-16 16:54:37] c.c.p.service.MotorVehicleReceiver [70] - 向doris上传违法数据成功,数量100条,用时345毫秒
[ INFO ] [2024-10-16 16:54:38] c.c.p.service.MotorVehicleReceiver [70] - 向doris上传违法数据成功,数量136条,用时354毫秒
[ INFO ] [2024-10-16 16:54:38] c.c.p.service.MotorVehicleReceiver [70] - 向doris上传违法数据成功,数量148条,用时241毫秒
[ INFO ] [2024-10-16 16:54:38] c.c.p.service.MotorVehicleReceiver [70] - 向doris上传违法数据成功,数量127条,用时374毫秒
[ INFO ] [2024-10-16 16:54:39] c.c.p.service.MotorVehicleReceiver [70] - 向doris上传违法数据成功,数量127条,用时283毫秒

遇到的问题

问题1、StreamLoad返回too many filtered rows

从StreamLoadResult.ErrorUrl查看错误明细。

错误1: 导入的json列不匹配。

解决办法:在对象生成json字符串时,将空值也带上。

JSON.toJSONString(deviceList);

修改为

JSON.toJSONString(deviceList,  new SerializerFeature[]{SerializerFeature.WriteMapNullValue, SerializerFeature.SortField} ));

SerializerFeature.WriteMapNullValue:是否输出值为null的字段,默认为false。SerializerFeature.SortField :按字段名称排序后输出。默认为false。 对于StreamLoad来说,每行数据的字段顺序完全一致,Doris 仅会根据第一行的字段顺序做解析,然后以下标的形式访问之后的数据。该方式可以提升 3-5X 的导入效率(通过在Header头中设置fuzzy_parse=true开启)

错误2:no partition for this tuple

建表时候使用提取记录时间按月动态分区的方式,插入失败的记录时间比当前时间早5个月,是历史数据,查看表的分区情况,只有当前和后续的分区,没有历史的。

解决办法:建表时或修改表属性,将dynamic_partition.create_history_partition置为true。

ALTER TABLE db_ga1400.ods_ga1400_violation_vehicle SET
(
     "dynamic_partition.create_history_partition" = "true"
);

查看分区建立情况,确认分区正常建立。

show partitions from db_ga1400.ods_ga1400_violation_vehicle

问题2、导入的数据数量对不上,原因是label重复了,原先是用日期+前缀生成的label,确保其唯一即可,例如使用前缀+雪花id

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值