cdc工具 postgresql_Debezium for PostgreSQL to Kafka

In this article, we discuss the necessity of segregate data model for read and write and use event sourcing for capture detailed data changing. These two aspects are critical for data analysis in big data world. We will compare some candidate solutions and draw a conclusion that CDC strategy is a perfect match for CQRS pattern.

Context and Problem

To support business decision-making, we demand fresh and accurate data that’s available where and when we need it, often in real-time.

But,

as business analysts try to run analysis, the production databases are (will be) overloaded;

some process details (transaction stream) valuable for analysis may have been overwritten;

OLTP data models may not be friendly to analysis purpose.

We hope to come out with a efficient solution to capture detailed transaction stream and ingest data to Hadoop for analysis.

f86219b1ab98

State VS Stream

CQRS and Event Sourcing Pattern

CQRS-based systems use separate read and write data models, each tailored to relevant tasks and often located in physically separate stores.

Event-sourcing: Instead of storing just the current state of the data in a domain, use an append-only store to record the full series of actions taken on that data.

f86219b1ab98

CQRS

Decouple: one team of developers can focus on the complex domain model that is part of the write model, and another team can focus on the read model and the user interfaces.

Ingest Solutions - dual writes

Dual Write

brings complexity in business system

is less fault tolerant when backend message queue is blocked or under maintenance

suffers from race conditions and consistency problems

Business log

concerns of data sensitivity

brings complexity in business system

f86219b1ab98

Dual Write

Ingest Solutions - database operations

Snapshot

data in the database is constantly changing, so the snapshot is already out-of-date by the time it’s loaded

even if you take a snapshot once a day, you still have one-day-old data in the downstream system

on a large database those snapshots and bulk loads can become very expensive

Data offload

brings operational complexity

is inability to meet low-latency requirements

can’t handle delete operations

Ingest Solutions - capture data change

process only “diff” of changes

write all your data to only one primary DB;

extract two things from that database:

a consistent snapshot and

a real-time stream of changes

Benefits:

decouple with business system

get a latency of less than a second

stream is ordering of writes, less race conditions

pull strategy is robust to data corruption (log replaying)

support as many variant data consumers as required

f86219b1ab98

CDC

Ingest Solutions - wrapup

Considering data application under the picture of business application, we will focus on the ‘capture changes to data’ components.

f86219b1ab98

image.png

Open Source for Postgres to Kafka

**Sqoop **

can only take full snapshots of a database, and not capture an ongoing stream of changes. Also, transactional consistency of its snapshots is not wells supported (Apache).

pg_kafka

is a Kafka producer client in a Postgres function, so we could potentially produce to Kafka from a trigger. (MIT license)

bottledwater-pg

is a change data capture (CDC) specifically from PostgreSQL into Kafka (Apache License 2.0, from confluent inc.)

debezium-pg

is a change data capture for a variety of databases (Apache License 2.0, from redhat)

f86219b1ab98

image.png

Debezium for Postgres is comparatively better.

Debezium for Postgres Architecture

debezium/postgres-decoderbufs

manually build the output plugin

change PG configuration, preload the lib file and restart PG service

debezium/debezium

compile and package the dependent jar files

Kafka connect

deploy distributed kafka connect service

start a debezium connector in Kafka connect

HBase connect

development work: implement a hbase connect for PG CDC events

Start a hbase connector in Kafka connect

Spark streaming

development work: implement data process functions atop Spark streaming

f86219b1ab98

image.png

Considerations

Reliability

For example

be aware of data source exception or source relocation, and automatically/manually restart data capture tasks or redirect data source;

monitor data quality and latency;

Scalability

be aware of data source load pressure, and automatically/manually scale out data capture tasks;

Maintainability

GUI for system monitoring, data quality check, latency statistics etc.;

GUI for configuring data capture task scale out

Other CDC solutions

Databus (linkedIn): no native support for PG

Wormhole (facebook): not opensource

**Sherpa (yahoo!) **: not opensource

BottledWater (confluent): postgres Only

Maxwell: mysql Only

Debezium (redhat): good

Mongoriver: only for MongiDB

GoldenGate (Oracle): for Oracle and mysql, free but not opensource

Canal & otter (alibaba): for mysql world replication

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值