Doris对接消费kafka数据方案实现

         本篇主要讲述消费kafka中的数据同步到Doris中。其他olap分析型数据库中,如clickhouse中有对应的kafka引擎表消费kafka的数据而后再通过物化视图的方式将消费的数据同步到对应的物理表中。但在doris中没有对应的kafka引擎表将要如何来实现同步kafka的数据呢?

   

接下来该篇将讲述两种方案来实现同步kafka的数据到Doris中:

通过Routine Load Doris带有的数据导入的方式来实现

  • kafka中数据为普通间隔字符串,如 ‘|’

 创建接收数据表

CREATE TABLE IF NOT EXISTS sea.user
(
    siteid INT DEFAULT '10',
    citycode SMALLINT,
    username VARCHAR(32) DEFAULT '',
    pv BIGINT SUM DEFAULT '0'
)
AGGREGATE KEY(siteid, citycode, username)
DISTRIBUTED BY HASH(siteid) BUCKETS 10
PROPERTIES("replication_num" = "1");

对接kafka语句

 CREATE ROUTINE LOAD sea.test ON 
 user COLUMNS TERMINATED BY "|",
 COLUMNS(siteid,citycode,username,pv)
 PROPERTIES(
 "desired_concurrent_number"="1",
 "max_batch_interval"="20",
 "max_batch_rows"="300000",
 "max_batch_size"="209715200")
 FROM KAFKA(
 "kafka_broker_list"="192.168.18.129:9092",
 "kafka_topic"="doris",
 "property.group.id"="gid",
 "property.clinet.id"="cid",
 "property.kafka_default_offsets"="OFFSET_BEGINNING");


 

要注意的是:sea为库名,必须在导入test别名指定,同时user表不能在指定否则会不识别报错。

  • kafka中数据为JSON数据

    创建接收数据的表

  • create table dev_ods.ods_user_log(
        `distinct_id` String not null COMMENT '会员id',
        `time`          bigint not null COMMENT '时间戳',
        event_at      datetime comment '事件时间;年月日,时分秒',
        `_track_id` string COMMENT '追踪id',
        `login_id` string COMMENT '登录号',
        `lib`       String COMMENT 'lib',
        `anonymous_id` String COMMENT '匿名id',
        `_flush_time` bigint COMMENT '刷新时间',
        `type`      String COMMENT '类型',
        `event`     String COMMENT '事件类型',
        `properties` String COMMENT '具备的属性',
        `identities` string comment '身份信息',
        `dt`         Date COMMENT '事件时间'
    )
    primary key (distinct_id,`time`)
    distributed by hash(distinct_id);

    对接kafka的语句,解析JSON数据

CREATE ROUTINE LOAD dev_ods.user_log ON ods_user_log
COLUMNS(distinct_id,time,_track_id,login_id,lib,anonymous_id,_flush_time,type,event,properties,identities,dt = from_unixtime(time/1000, '%Y%m%d'),event_at=from_unixtime(time/1000, 'yyyy-MM-dd HH:mm:ss'))
PROPERTIES
(
    "desired_concurrent_number"="3",
    "max_batch_interval" = "20",
    "max_batch_rows" = "300000",
    "max_batch_size" = "209715200",
    "strict_mode" = "false",
    "format" = "json"
 )FROM KAFKA
(
    "kafka_broker_list"= "10.150.20.12:9092",
    "kafka_topic" = "bigDataSensorAnalyse",
    "property.group.id"="test_group_2",
    "property.kafka_default_offsets" = "OFFSET_BEGINNING",
    "property.enable.auto.commit"="false"
 );

 JSON结构:

其中properties的字段值为:JSON对象,dt,event_at不是kafka中的数据的值,为处理后写入表中。

说明:1)如果json数据是以数组开始,并且数组中每个对象是一条记录,则需要将strip_outer_array设置成true,表示展平数组。

   2)如果json数据是以数组开始,并且数组中每个对象是一条记录,在设置jsonpath时,我们的ROOT节点实际上是数组中对象。

   支持两种json数据格式:
  1){"category":"a9jadhx","author":"test","price":895}
  2)[
            {"category":"a9jadhx","author":"test","price":895},
            {"category":"axdfa1","author":"EvelynWaugh","price":1299}
     ]


 

 这也是目前Doris所支持的两种JSON数据格式的解析。

JSON格式为如下数组结构时:

{  "RECORDS": [    {      "category": "11",      "title": "SayingsoftheCentury",      "price": 895,      "timestamp": 1589191587    },    {      "category": "22",      "author": "2avc",      "price": 895,      "timestamp": 1589191487    },    {      "category": "33",      "author": "3avc",      "title": "SayingsoftheCentury",      "timestamp": 1589191387    }  ]}

对应解析SQL语句为:

6. 用户指定根节点json_root
    CREATE ROUTINE LOAD example_db.test1 ON example_tbl
    COLUMNS(category, author, price, timestamp, dt=from_unixtime(timestamp, '%Y%m%d'))
    PROPERTIES
    (
        "desired_concurrent_number"="3",
        "max_batch_interval" = "20",
        "max_batch_rows" = "300000",
        "max_batch_size" = "209715200",
        "strict_mode" = "false",
        "format" = "json",
        "jsonpaths" = "[\"$.category\",\"$.author\",\"$.price\",\"$.timestamp\"]",
        "strip_outer_array" = "true",
        "json_root" = "$.RECORDS"
    )
    FROM KAFKA
    (
        "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
        "kafka_topic" = "my_topic",
        "kafka_partitions" = "0,1,2",
        "kafka_offsets" = "0,0,0"
    );


 

查看对应的查看routine load状态​​​​​​​

显示所有的example_db库下的状态
use example_db;
SHOW ALL ROUTINE LOAD;
 
Ⅴ).查看routine load状态
SHOW ALL ROUTINE LOAD FOR datasource_name.kafka_load;
Ⅵ).常用routine load命令
a).暂停routine load
PAUSE ROUTINE LOAD FOR datasource_name.kafka_load;
b).恢复routine load
RESUME ROUTINE LOAD FOR datasource_name.kafka_load;
c).停止routine load
STOP ROUTINE LOAD FOR datasource_name.kafka_load;
d).查看所有routine load
SHOW [ALL] ROUTINE LOAD FOR datasource_name.kafka_load;
e).查看routine load任务
SHOW ROUTINE LOAD TASK datasource_name.kafka_load;
Ⅶ).查看数据
SELECT * FROM datasource_name.table_name LIMIT 10;

参数解读​​​​​​​

1) OFFSET_BEGINNING: 从有数据的位置开始订阅。
2) OFFSET_END: 从末尾开始订阅

​​​​​​​

注:上述对接kafka为无认证的kafka对接方式,更多可以参看文章底部官网链接查看。

2.通过FlinkSQL的方式对接kafka写入Doris​​​​​​​

create table flink_test_1 ( 
    id BIGINT,
    day_time VARCHAR,
    amnount BIGINT,
    proctime AS PROCTIME ()
)
with ( 
    'connector' = 'kafka',
    'topic' = 'flink_test',
    'properties.bootstrap.servers' = '10.150.60.5:9092', 
    'properties.group.id' = 'flink_gp_test1',
    'scan.startup.mode' = 'earliest-offset',
    'format' = 'json',
    'json.fail-on-missing-field' = 'false',
    'json.ignore-parse-errors' = 'true' 
);
CREATE TABLE sync_test_1( 
        day_time string,
        total_gmv bigint,
    PRIMARY KEY (day_time) NOT ENFORCED
    ) WITH (  
        'connector' = 'starrocks', 
        'jdbc-url'='jdbc:mysql://10.150.60.2:9030', 
        'load-url'='10.150.60.2:8040;10.150.60.11:8040;10.150.60.17:8040', 
        'database-name' = 'test', 
        'table-name' = 'sync_test_1', 
        'username' = 'root', 
        'password' = 'bigdata1234', 
        'sink.buffer-flush.max-rows' = '1000000', 
        'sink.buffer-flush.max-bytes' = '300000000', 
        'sink.buffer-flush.interval-ms' = '5000',
        'sink.max-retries' = '3'
    
);

INSERT INTO sync_test_1 
SELECT day_time,SUM(amnount) AS total_gmv FROM flink_test_1 GROUP BY day_time;

​​​​​​​

    以上FlinkSQL同步数据方式提供参考,更多Flink sql相关内容会在后期文章中逐步讲解。

Doris官网链接

参考文章

 kafka 导入数据到 doris​​​​​​​

  • 7
    点赞
  • 27
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
首先,你需要在 Scala 代码中引入以下依赖: ```scala libraryDependencies += "org.apache.flink" %% "flink-scala" % flinkVersion libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" % flinkVersion libraryDependencies += "org.apache.flink" %% "flink-connector-kafka" % flinkVersion libraryDependencies += "org.apache.flink" %% "flink-connector-hive" % flinkVersion libraryDependencies += "org.apache.flink" %% "flink-connector-jdbc" % flinkVersion ``` 然后,你可以使用以下代码来消费 Kafka 数据: ```scala import org.apache.flink.streaming.api.scala._ import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer val env = StreamExecutionEnvironment.getExecutionEnvironment val kafkaConsumer = new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties) val stream = env.addSource(kafkaConsumer) // 对数据进行处理 val result = stream.map(...) ``` 其中,`properties` 是一个 `Properties` 对象,用于配置 Kafka 的连接信息。 接下来,你需要将处理后的数据写入到 Hive 和 Doris 中。可以使用以下代码: ```scala import org.apache.flink.table.api.bridge.scala._ import org.apache.flink.table.catalog.hive.HiveCatalog import org.apache.flink.streaming.api.scala.StreamTableEnvironment val tableEnv = StreamTableEnvironment.create(env) val hiveCatalog = new HiveCatalog("myHiveCatalog", "myDatabase", "/path/to/hive/conf", "2.3.4") tableEnv.registerCatalog("myHiveCatalog", hiveCatalog) tableEnv.useCatalog("myHiveCatalog") tableEnv.executeSql("CREATE TABLE myHiveTable (...) WITH (...)") result.toTable(tableEnv, "myResultTable") tableEnv.executeSql("INSERT INTO myHiveTable SELECT * FROM myResultTable") val jdbcUrl = "jdbc:mysql://localhost:3306/my_database" tableEnv.executeSql(s"CREATE TABLE myDorisTable (...) WITH (...)") tableEnv.executeSql(s"INSERT INTO myDorisTable SELECT * FROM myResultTable") ``` 其中,`myHiveCatalog` 是 Hive 的 Catalog 名称,`myDatabase` 是 Hive 中的数据库名称,`/path/to/hive/conf` 是 Hive 的配置文件所在路径,`2.3.4` 是 Hive 的版本号。 `myHiveTable` 和 `myDorisTable` 是你要写入数据的表名,`(...)` 是表的列定义和其他属性,`myResultTable` 是处理后的数据表名。 `jdbcUrl` 是 Doris 数据库的连接信息,你需要根据实际情况进行修改。 你需要将上述代码中的 `...` 替换为实际的处理逻辑和表定义。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值