Streamsets Postgresql 实时同步到Kudu

最新推荐文章于 2024-05-28 14:33:56 发布

无极_之道

最新推荐文章于 2024-05-28 14:33:56 发布

阅读量689

点赞数

分类专栏： kudu streamsets postgresql 文章标签： postgresql etl

本文链接：https://blog.csdn.net/weixin_40817778/article/details/125426508

版权

kudu 同时被 3 个专栏收录

3 篇文章 0 订阅

订阅专栏

streamsets

2 篇文章 2 订阅

订阅专栏

postgresql

1 篇文章 0 订阅

订阅专栏

Streamsets提供两种方式同步Postgresql，一种是JDBC、query，另一种是CDC方式，实时同步需要两者结合来首次同步。

首先需要全表同步，采用JDBC方式比较好：

这个比同步Mysql方便，可以写多个模式多个表同时同步。

这个是完成一次同步就触发，不至于没有数据进来报错。下一次事务继续同步。

这个一定要配置，不然_int json 格式就会报错。

勾选一下，然后Type转换主要是把时间格式转String。 kudu里面记录时间的字段全部是string格式。直接开始就行。

JDBC Multitable Consumer 里面有性能配置，可以参考官网。

等待全量同步的时候，可以配置PostgreSQL CDC Client pipeline流：

不认识的格式直接当成string 传入流中。

Jython Evaluator 配置（ETL重点）：

import time
import datetime

for record in records:
    try:
      for change in record.value['change']:
        newRecord = sdcFunctions.createRecord(record.sourceId + str(time.time()))
        newRecord.value = {}
        newRecord.attributes['xid'] = str(record.value['xid'])
        newRecord.attributes['nextlsn'] = record.value['nextlsn']
        newRecord.attributes['timestamp'] = record.value['timestamp']
        newRecord.attributes['kind'] = change['kind']
        newRecord.attributes['schema'] = change['schema']        
        newRecord.attributes['jdbc.tables'] = change['table']
        
        if change['kind'] == 'insert':
          newRecord.attributes['sdc.operation.type'] = '1'
        if change['kind'] == 'delete':
          newRecord.attributes['sdc.operation.type'] = '2'
        if change['kind'] == 'update':
          newRecord.attributes['sdc.operation.type'] = '3'

        if 'columnnames' in change:
          columns = change['columnnames']
          types = change['columntypes']
          values = change['columnvalues']
        else:
          columns = change['oldkeys']['keynames']
          types = change['oldkeys']['keytypes']
          values = change['oldkeys']['keyvalues']
        
        for j in range(len(columns)):
          name = columns[j]
          type = types[j]
          value = values[j]
          newRecord.value[name] = value

        output.write(newRecord)
          ## optional, if we want to keep the original record,
          ## otherwise we just put the new record in the batch.
          #output.write(record)
    except Exception as e:
        # Send record to error
        error.write(record, str(e))

Stream Selector 配置：分流