CDC框架之Debezium使用

CDC框架之Debezium

简介

Debezium 是一组分布式服务,用于捕获数据库中的更改,以便您的应用程序可以看到这些更改并对其做出响应。Debezium 在更改事件流中记录每个数据库表中的所有行级更改,应用程序只需读取这些流以查看更改事件发生的相同顺序。

Debezium 构建在Apache Kafka之上,并提供与Kafka Connect兼容的连接器,用于监控特定的数据库管理系统。Debezium 在 Kafka 日志中记录数据更改的历史记录,从您的应用程序使用它们的位置。这使您的应用程序可以轻松正确且完整地使用所有事件。即使您的应用程序意外停止,它也不会错过任何内容:当应用程序重新启动时,它将继续使用它停止的事件。

Kafka Connect

Kafka Connect是一种用于在Kafka和其他系统之间可扩展的、可靠的流式传输数据的工具。它使得能够快速定义将大量数据集合移入和移出Kafka的连接器变得简单。 Kafka Connect可以获取整个数据库或从所有应用程序服务器收集指标到Kafka主题,使数据可用于低延迟的流处理。导出作业可以将数据从Kafka topic传输到二次存储和查询系统,或者传递到批处理系统以进行离线分析。 Kafka Connect功能包括:

  • Kafka connector通用框架,提供统一的集成API
  • 同时支持分布式模式和单机模式
  • REST 接口,用来查看和管理Kafka connectors
  • 自动化的offset管理,开发人员不必担心错误处理的影响
  • 分布式、可扩展
  • 流/批处理集成

KafkaCnnect有两个核心概念:Source和Sink。 Source负责导入数据到Kafka,Sink负责从Kafka导出数据,它们都被称为Connector。

Kafka connect的几个重要的概念包括:connectors、tasks、workers和converters。

  • Connectors-通过管理任务来细条数据流的高级抽象

  • Tasks- 数据写入kafka和数据从kafka读出的实现

  • Workers-运行connectors和tasks的进程

  • Converters- kafka connect和其他存储系统直接发送或者接受数据之间转换数据

    1. Connectors:在kafka connect中,connector决定了数据应该从哪里复制过来以及数据应该写入到哪里去,一个connector实例是一个需要负责在kafka和其他系统之间复制数据的逻辑作业,connector plugin是jar文件,实现了kafka定义的一些接口来完成特定的任务。

    2. Tasks:task是kafka connect数据模型的主角,每一个connector都会协调一系列的task去执行任务,connector可以把一项工作分割成许多的task,然后再把task分发到各个worker中去执行(分布式模式下),task不自己保存自己的状态信息,而是交给特定的kafka 主题去保存(config.storage.topic 和status.storage.topic)。在分布式模式下有一个概念叫做任务再平衡(Task Rebalancing),当一个connector第一次提交到集群时,所有的worker都会做一个task rebalancing从而保证每一个worker都运行了差不多数量的工作,而不是所有的工作压力都集中在某个worker进程中,而当某个进程挂了之后也会执行task rebalance。

    3. Workers:connectors和tasks都是逻辑工作单位,必须安排在进程中执行,而在kafka connect中,这些进程就是workers,分别有两种worker:standalone和distributed。这里不对standalone进行介绍,具体的可以查看官方文档。我个人觉得distributed worker很棒,因为它提供了可扩展性以及自动容错的功能,你可以使用一个group.ip来启动很多worker进程,在有效的worker进程中它们会自动的去协调执行connector和task,如果你新加了一个worker或者挂了一个worker,其他的worker会检测到然后在重新分配connector和task。

    4. Converters: converter会把bytes数据转换成kafka connect内部的格式,也可以把kafka connect内部存储格式的数据转变成bytes,converter对connector来说是解耦的,所以其他的connector都可以重用,例如,使用了avro converter,那么jdbc connector可以写avro格式的数据到kafka,当然,hdfs connector也可以从kafka中读出avro格式的数据。

Kafka Connect启动方式

  • standalone

    bin/connect-standalone.shconfig/connect-standalone.properties connector1.properties[connector2.properties ...]
    
  • distributed(集群模式)

    bin/connect-distributed.shconfig/connect-distributed.properties
    
  • 容器环境

启动示例

启动zookeeper

docker run -it --rm --name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper:1.7

启动kafka

$ docker run -it --rm --name kafka -p 9092:9092 --link zookeeper:zookeeper debezium/kafka:1.7

启动一个被监控的数据库(MySQL)

$ docker run -it --rm --name mysql -p 3306:3306 -e MYSQL_ROOT_PASSWORD=debezium -e MYSQL_USER=mysqluser -e MYSQL_PASSWORD=mysqlpw debezium/example-mysql:1.7

启动Kafka Connect

$ docker run -it --rm --name connect -p 8083:8083 -e GROUP_ID=1 -e CONFIG_STORAGE_TOPIC=my_connect_configs -e OFFSET_STORAGE_TOPIC=my_connect_offsets -e STATUS_STORAGE_TOPIC=my_connect_statuses --link zookeeper:zookeeper --link kafka:kafka --link mysql:mysql debezium/connect:1.7

注册连接器

通过Debezium连接器部分中提供的参数,使用Kafka Connect提供的REST API注册连接器

  • 使用curl:

    $ curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '{ "name": "inventory-connector", "config": { "connector.class": "io.debezium.connector.mysql.MySqlConnector", "tasks.max": "1", "database.hostname": "mysql", "database.port": "3306", "database.user": "debezium", "database.password": "dbz", "database.server.id": "184054", "database.server.name": "dbserver1", "database.include.list": "inventory", "database.history.kafka.bootstrap.servers": "kafka:9092", "database.history.kafka.topic": "dbhistory.inventory" } }'
    

    Windows 用户可能需要转义双引号。例如:

    $ curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '{ \"name\": \"inventory-connector\", \"config\": { \"connector.class\": \"io.debezium.connector.mysql.MySqlConnector\", \"tasks.max\": \"1\", \"database.hostname\": \"mysql\", \"database.port\": \"3306\", \"database.user\": \"debezium\", \"database.password\": \"dbz\", \"database.server.id\": \"184054\", \"database.server.name\": \"dbserver1\", \"database.include.list\": \"inventory\", \"database.history.kafka.bootstrap.servers\": \"kafka:9092\", \"database.history.kafka.topic\": \"dbhistory.inventory\" } }'
    
  • example

    {
        "name": "inventory-connector", 
        "config": {
            "connector.class": "io.debezium.connector.mysql.MySqlConnector", 
            "database.hostname": "192.168.99.100", 
            "database.port": "3306", 
            "database.user": "debezium-user", 
            "database.password": "debezium-user-pw", 
            "database.server.id": "184054", 
            "database.server.name": "fullfillment", 
            "database.include.list": "inventory", 
            "database.history.kafka.bootstrap.servers": "kafka:9092", 
            "database.history.kafka.topic": "dbhistory.fullfillment", 
            "include.schema.changes": "true" 
        }
    }
    

验证连接器

curl -H "Accept:application/json" localhost:8083/connectors/

Debezium连接器

Debezium connector for MySQL

简介

​ MySQL 有一个二进制日志(binlog),它按照提交到数据库的顺序记录所有操作。这包括对表模式的更改以及对表中数据的更改。MySQL 使用 binlog 进行复制和恢复。

​ 该Debezium的MySQL连接器读取二进制日志,产生变化的事件行级INSERTUPDATEDELETE运营,并发出更改事件kafka的话题。客户端应用程序读取这些 Kafka 主题。

​ 由于 MySQL 通常设置为在指定的时间段后清除 binlog,因此 MySQL 连接器会为您的每个数据库执行初始一致的快照。MySQL 连接器从创建快照的点读取二进制日志。

官方文档:connector for MySQL

注册参数表
基础参数
PropertyDefaultDescription
nameNo defaultUnique name for the connector. Attempting to register again with the same name fails. This property is required by all Kafka Connect connectors.
connector.classNo defaultThe name of the Java class for the connector. Always specify io.debezium.connector.mysql.MySqlConnector for the MySQL connector.
tasks.max1The maximum number of tasks that should be created for this connector. The MySQL connector always uses a single task and therefore does not use this value, so the default is always acceptable.
database.hostnameNo defaultIP address or host name of the MySQL database server.
database.port3306Integer port number of the MySQL database server.
database.userNo defaultName of the MySQL user to use when connecting to the MySQL database server.
database.passwordNo defaultPassword to use when connecting to the MySQL database server.
database.server.nameNo defaultLogical name that identifies and provides a namespace for the particular MySQL database server/cluster in which Debezium is capturing changes. The logical name should be unique across all other connectors, since it is used as a prefix for all Kafka topic names that receive events emitted by this connector. Only alphanumeric characters, hyphens, dots and underscores must be used in the database server logical name.
database.server.idrandomA numeric ID of this database client, which must be unique across all currently-running database processes in the MySQL cluster. This connector joins the MySQL database cluster as another server (with this unique ID) so it can read the binlog. By default, a random number between 5400 and 6400 is generated, though the recommendation is to explicitly set a value.
database.include.listempty stringAn optional, comma-separated list of regular expressions that match the names of the databases for which to capture changes. The connector does not capture changes in any database whose name is not in database.include.list. By default, the connector captures changes in all databases. Do not also set the database.exclude.list connector confiuration property.
database.exclude.listempty stringAn optional, comma-separated list of regular expressions that match the names of databases for which you do not want to capture changes. The connector captures changes in any database whose name is not in the database.exclude.list. Do not also set the database.include.list connector configuration property.
table.include.listempty stringAn optional, comma-separated list of regular expressions that match fully-qualified table identifiers of tables whose changes you want to capture. The connector does not capture changes in any table not included in table.include.list. Each identifier is of the form databaseName.tableName. By default, the connector captures changes in every non-system table in each database whose changes are being captured. Do not also specify the table.exclude.list connector configuration property.
table.exclude.listempty stringAn optional, comma-separated list of regular expressions that match fully-qualified table identifiers for tables whose changes you do not want to capture. The connector captures changes in any table not included in table.exclude.list. Each identifier is of the form databaseName.tableName. Do not also specify the table.include.list connector configuration property.
column.exclude.listempty stringAn optional, comma-separated list of regular expressions that match the fully-qualified names of columns to exclude from change event record values. Fully-qualified names for columns are of the form databaseName.tableName.columnName.
column.include.listempty stringAn optional, comma-separated list of regular expressions that match the fully-qualified names of columns to include in change event record values. Fully-qualified names for columns are of the form databaseName.tableName.columnName.
column.truncate.to._length_.charsn/aAn optional, comma-separated list of regular expressions that match the fully-qualified names of character-based columns whose values should be truncated in the change event record values if the field values are longer than the specified number of characters. You can configure multiple properties with different lengths in a single configuration. The length must be a positive integer. Fully-qualified names for columns are of the form databaseName.tableName.columnName.
column.mask.with._length_.charsn/aAn optional, comma-separated list of regular expressions that match the fully-qualified names of character-based columns whose values should be replaced in the change event message values with a field value consisting of the specified number of asterisk (*) characters. You can configure multiple properties with different lengths in a single configuration. Each length must be a positive integer or zero. Fully-qualified names for columns are of the form databaseName.tableName.columnName.
column.mask.hash.*hashAlgorithm*.with.salt.*salt*n/aAn optional, comma-separated list of regular expressions that match the fully-qualified names of character-based columns. Fully-qualified names for columns are of the form *<databaseName>*.*<tableName>*.*<columnName>*. In the resulting change event record, the values for the specified columns are replaced with pseudonyms. A pseudonym consists of the hashed value that results from applying the specified hashAlgorithm and salt. Based on the hash function that is used, referential integrity is maintained, while column values are replaced with pseudonyms. Supported hash functions are described in the MessageDigest section of the Java Cryptography Architecture Standard Algorithm Name Documentation. In the following example, CzQMA0cB5K is a randomly selected salt. column.mask.hash.SHA-256.with.salt.CzQMA0cB5K = inventory.orders.customerName, inventory.shipment.customerNameIf necessary, the pseudonym is automatically shortened to the length of the column. The connector configuration can include multiple properties that specify different hash algorithms and salts. Depending on the hashAlgorithm used, the salt selected, and the actual data set, the resulting data set might not be completely masked.
column.propagate.source.typen/aAn optional, comma-separated list of regular expressions that match the fully-qualified names of columns whose original type and length should be added as a parameter to the corresponding field schemas in the emitted change event records. These schema parameters:__Debezium.source.column.type``__Debezium.source.column.length``__Debezium.source.column.scaleare used to propagate the original type name and length for variable-width types, respectively. This is useful to properly size corresponding columns in sink databases. Fully-qualified names for columns are of one of these forms:databaseName.tableName.columnName**databaseName.schemaName.tableName.columnName
datatype.propagate.source.typen/aAn optional, comma-separated list of regular expressions that match the database-specific data type name of columns whose original type and length should be added as a parameter to the corresponding field schemas in the emitted change event records. These schema parameters:__debezium.source.column.type``__debezium.source.column.length``__debezium.source.column.scaleare used to propagate the original type name and length for variable-width types, respectively. This is useful to properly size corresponding columns in sink databases. Fully-qualified data type names are of one of these forms:databaseName.tableName.typeName**databaseName.schemaName.tableName.typeNameSee how MySQL connectors map data types for the list of MySQL-specific data type names.
time.precision.modeadaptive_time_microsecondsTime, date, and timestamps can be represented with different kinds of precision, including: adaptive_time_microseconds (the default) captures the date, datetime and timestamp values exactly as in the database using either millisecond, microsecond, or nanosecond precision values based on the database column’s type, with the exception of TIME type fields, which are always captured as microseconds. adaptive (deprecated) captures the time and timestamp values exactly as in the database using either millisecond, microsecond, or nanosecond precision values based on the database column’s type. connect always represents time and timestamp values using Kafka Connect’s built-in representations for Time, Date, and Timestamp, which use millisecond precision regardless of the database columns’ precision.
decimal.handling.modepreciseSpecifies how the connector should handle values for DECIMAL and NUMERIC columns: precise (the default) represents them precisely using java.math.BigDecimal values represented in change events in a binary form. double represents them using double values, which may result in a loss of precision but is easier to use. string encodes values as formatted strings, which is easy to consume but semantic information about the real type is lost.
bigint.unsigned.handling.modelongSpecifies how BIGINT UNSIGNED columns should be represented in change events. Possible settings are: long represents values by using Java’s long, which might not offer the precision but which is easy to use in consumers. long is usually the preferred setting. precise uses java.math.BigDecimal to represent values, which are encoded in the change events by using a binary representation and Kafka Connect’s org.apache.kafka.connect.data.Decimal type. Use this setting when working with values larger than 2^63, because these values cannot be conveyed by using long.
include.schema.changestrueBoolean value that specifies whether the connector should publish changes in the database schema to a Kafka topic with the same name as the database server ID. Each schema change is recorded by using a key that contains the database name and whose value includes the DDL statement(s). This is independent of how the connector internally records database history.
include.queryfalseBoolean value that specifies whether the connector should include the original SQL query that generated the change event. If you set this option to true then you must also configure MySQL with the binlog_rows_query_log_events option set to ON. When include.query is true, the query is not present for events that the snapshot process generates. Setting include.query to true might expose tables or fields that are explicitly excluded or masked by including the original SQL statement in the change event. For this reason, the default setting is false.
event.deserialization.failure.handling.modefailSpecifies how the connector should react to exceptions during deserialization of binlog events. fail propagates the exception, which indicates the problematic event and its binlog offset, and causes the connector to stop. warn logs the problematic event and its binlog offset and then skips the event. ignore passes over the problematic event and does not log anything.
inconsistent.schema.handling.modefailSpecifies how the connector should react to binlog events that relate to tables that are not present in internal schema representation. That is, the internal representation is not consistent with the database. fail throws an exception that indicates the problematic event and its binlog offset, and causes the connector to stop. warn logs the problematic event and its binlog offset and skips the event. skip passes over the problematic event and does not log anything.
max.queue.size8192Positive integer value that specifies the maximum size of the blocking queue into which change events read from the database log are placed before they are written to Kafka. This queue can provide backpressure to the binlog reader when, for example, writes to Kafka are slow or if Kafka is not available. Events that appear in the queue are not included in the offsets periodically recorded by this connector. Defaults to 8192, and should always be larger than the maximum batch size specified by the max.batch.size property.
max.batch.size2048Positive integer value that specifies the maximum size of each batch of events that should be processed during each iteration of this connector. Defaults to 2048.
max.queue.size.in.bytes0Long value for the maximum size in bytes of the blocking queue. The feature is disabled by default, it will be active if it’s set with a positive long value.
poll.interval.ms1000Positive integer value that specifies the number of milliseconds the connector should wait for new change events to appear before it starts processing a batch of events. Defaults to 1000 milliseconds, or 1 second.
connect.timeout.ms30000A positive integer value that specifies the maximum time in milliseconds this connector should wait after trying to connect to the MySQL database server before timing out. Defaults to 30 seconds.
gtid.source.includesNo defaultA comma-separated list of regular expressions that match source UUIDs in the GTID set used to find the binlog position in the MySQL server. Only the GTID ranges that have sources that match one of these include patterns are used. Do not also specify a setting for gtid.source.excludes.
gtid.source.excludesNo defaultA comma-separated list of regular expressions that match source UUIDs in the GTID set used to find the binlog position in the MySQL server. Only the GTID ranges that have sources that do not match any of these exclude patterns are used. Do not also specify a value for gtid.source.includes.
gtid.new.channel.position deprecated and scheduled for removalearliestWhen set to latest, when the connector sees a new GTID channel, it starts consuming from the last executed transaction in that GTID channel. If set to earliest (default), the connector starts reading that channel from the first available (not purged) GTID position. earliest is useful when you have an active-passive MySQL setup where Debezium is connected to the primary server. In this case, during failover, the replica with the new UUID (and GTID channel) starts receiving writes before Debezium is connected. These writes would be lost when using latest.
tombstones.on.deletetrueControls whether a delete event is followed by a tombstone event. true - a delete operation is represented by a delete event and a subsequent tombstone event. false - only a delete event is emitted. After a source record is deleted, emitting a tombstone event (the default behavior) allows Kafka to completely delete all events that pertain to the key of the deleted row in case log compaction is enabled for the topic.
message.key.columnsn/aA list of expressions that specify the columns that the connector uses to form custom message keys for change event records that it publishes to the Kafka topics for specified tables.By default, Debezium uses the primary key column of a table as the message key for records that it emits. In place of the default, or to specify a key for tables that lack a primary key, you can configure custom message keys based on one or more columns. To establish a custom message key for a table, list the table, followed by the columns to use as the message key. Each list entry takes the following format: *<fully-qualified_tableName>*:_<keyColumn>_,*<keyColumn>* To base a table key on multiple column names, insert commas between the column names.Each fully-qualified table name is a regular expression in the following format: *<databaseName>*.*<tableName>* The property can include entries for multiple tables. Use a semicolon to separate table entries in the list. The following example sets the message key for the tables inventory.customers and purchase.orders: inventory.customers:pk1,pk2;(.*).purchaseorders:pk3,pk4 For the table inventory.customer, the columns pk1 and pk2 are specified as the message key. For the purchaseorders tables in any database, the columns pk3 and pk4 server as the message key.There is no limit to the number of columns that you use to create custom message keys. However, it’s best to use the minimum number that are required to specify a unique key.
binary.handling.modebytesSpecifies how binary columns, for example, blob, binary, varbinary, should be represented in change events. Possible settings: bytes represents binary data as a byte array. base64 represents binary data as a base64-encoded String. hex represents binary data as a hex-encoded (base16) String.
高级参数
PropertyDefaultDescription
connect.keep.alivetrueA Boolean value that specifies whether a separate thread should be used to ensure that the connection to the MySQL server/cluster is kept alive.
table.ignore.builtintrueA Boolean value that specifies whether built-in system tables should be ignored. This applies regardless of the table include and exclude lists. By default, system tables are excluded from having their changes captured, and no events are generated when changes are made to any system tables.
database.ssl.modedisabledSpecifies whether to use an encrypted connection. Possible settings are: disabled specifies the use of an unencrypted connection. preferred establishes an encrypted connection if the server supports secure connections. If the server does not support secure connections, falls back to an unencrypted connection. required establishes an encrypted connection or fails if one cannot be made for any reason. verify_ca behaves like required but additionally it verifies the server TLS certificate against the configured Certificate Authority (CA) certificates and fails if the server TLS certificate does not match any valid CA certificates. verify_identity behaves like verify_ca but additionally verifies that the server certificate matches the host of the remote connection.
binlog.buffer.size0The size of a look-ahead buffer used by the binlog reader. The default setting of 0 disables buffering. Under specific conditions, it is possible that the MySQL binlog contains uncommitted data finished by a ROLLBACK statement. Typical examples are using savepoints or mixing temporary and regular table changes in a single transaction. When a beginning of a transaction is detected then Debezium tries to roll forward the binlog position and find either COMMIT or ROLLBACK so it can determine whether to stream the changes from the transaction. The size of the binlog buffer defines the maximum number of changes in the transaction that Debezium can buffer while searching for transaction boundaries. If the size of the transaction is larger than the buffer then Debezium must rewind and re-read the events that have not fit into the buffer while streaming. NOTE: This feature is incubating. Feedback is encouraged. It is expected that this feature is not completely polished.
snapshot.modeinitialSpecifies the criteria for running a snapshot when the connector starts. Possible settings are: initial - the connector runs a snapshot only when no offsets have been recorded for the logical server name. initial_only - the connector runs a snapshot only when no offsets have been recorded for the logical server name and then stops; i.e. it will not read change events from the binlog. when_needed - the connector runs a snapshot upon startup whenever it deems it necessary. That is, when no offsets are available, or when a previously recorded offset specifies a binlog location or GTID that is not available in the server. never - the connector never uses snapshots. Upon first startup with a logical server name, the connector reads from the beginning of the binlog. Configure this behavior with care. It is valid only when the binlog is guaranteed to contain the entire history of the database. schema_only - the connector runs a snapshot of the schemas and not the data. This setting is useful when you do not need the topics to contain a consistent snapshot of the data but need them to have only the changes since the connector was started. schema_only_recovery - this is a recovery setting for a connector that has already been capturing changes. When you restart the connector, this setting enables recovery of a corrupted or lost database history topic. You might set it periodically to “clean up” a database history topic that has been growing unexpectedly. Database history topics require infinite retention.
snapshot.locking.modeminimalControls whether and how long the connector holds the global MySQL read lock, which prevents any updates to the database, while the connector is performing a snapshot. Possible settings are: minimal - the connector holds the global read lock for only the initial portion of the snapshot during which the connector reads the database schemas and other metadata. The remaining work in a snapshot involves selecting all rows from each table. The connector can do this in a consistent fashion by using a REPEATABLE READ transaction. This is the case even when the global read lock is no longer held and other MySQL clients are updating the database. minimal_percona - the connector holds the global backup lock for only the initial portion of the snapshot during which the connector reads the database schemas and other metadata. The remaining work in a snapshot involves selecting all rows from each table. The connector can do this in a consistent fashion by using a REPEATABLE READ transaction. This is the case even when the global backup lock is no longer held and other MySQL clients are updating the database. This mode does not flush tables to disk, is not blocked by long-running reads, and is available only in Percona Server. extended - blocks all writes for the duration of the snapshot. Use this setting if there are clients that are submitting operations that MySQL excludes from REPEATABLE READ semantics. none - prevents the connector from acquiring any table locks during the snapshot. While this setting is allowed with all snapshot modes, it is safe to use if and only if no schema changes are happening while the snapshot is running. For tables defined with MyISAM engine, the tables would still be locked despite this property being set as MyISAM acquires a table lock. This behavior is unlike InnoDB engine, which acquires row level locks.
snapshot.include.collection.listAll tables specified in table.include.listAn optional, comma-separated list of regular expressions that match names of schemas specified in table.include.list for which you want to take the snapshot.
snapshot.select.statement.overridesNo defaultSpecifies the table rows to include in a snapshot. Use the property if you want a snapshot to include only a subset of the rows in a table. This property affects snapshots only. It does not apply to events that the connector reads from the log.The property contains a comma-separated list of fully-qualified table names in the form *<databaseName>.<tableName>*. For example, "snapshot.select.statement.overrides": "inventory.products,customers.orders" For each table in the list, add a further configuration property that specifies the SELECT statement for the connector to run on the table when it takes a snapshot. The specified SELECT statement determines the subset of table rows to include in the snapshot. Use the following format to specify the name of this SELECT statement property: snapshot.select.statement.overrides.*<databaseName>*.*<tableName>*. For example, snapshot.select.statement.overrides.customers.orders. Example:From a customers.orders table that includes the soft-delete column, delete_flag, add the following properties if you want a snapshot to include only those records that are not soft-deleted:"snapshot.select.statement.overrides": "customer.orders", "snapshot.select.statement.overrides.customer.orders": "SELECT * FROM [customers].[orders] WHERE delete_flag = 0 ORDER BY id DESC"In the resulting snapshot, the connector includes only the records for which delete_flag = 0.
min.row.count.to.stream.results1000During a snapshot, the connector queries each table for which the connector is configured to capture changes. The connector uses each query result to produce a read event that contains data for all rows in that table. This property determines whether the MySQL connector puts results for a table into memory, which is fast but requires large amounts of memory, or streams the results, which can be slower but work for very large tables. The setting of this property specifies the minimum number of rows a table must contain before the connector streams results. To skip all table size checks and always stream all results during a snapshot, set this property to 0.
heartbeat.interval.ms0Controls how frequently the connector sends heartbeat messages to a Kafka topic. The default behavior is that the connector does not send heartbeat messages. Heartbeat messages are useful for monitoring whether the connector is receiving change events from the database. Heartbeat messages might help decrease the number of change events that need to be re-sent when a connector restarts. To send heartbeat messages, set this property to a positive integer, which indicates the number of milliseconds between heartbeat messages.
heartbeat.topics.prefix__debezium-heartbeatControls the name of the topic to which the connector sends heartbeat messages. The topic name has this pattern: heartbeat.topics.prefix.server.name For example, if the database server name is fulfillment, the default topic name is __debezium-heartbeat.fulfillment.
database.initial.statementsNo defaultA semicolon separated list of SQL statements to be executed when a JDBC connection, not the connection that is reading the transaction log, to the database is established. To specify a semicolon as a character in a SQL statement and not as a delimiter, use two semicolons, (;;). The connector might establish JDBC connections at its own discretion, so this property is ony for configuring session parameters. It is not for executing DML statements.
snapshot.delay.msNo defaultAn interval in milliseconds that the connector should wait before performing a snapshot when the connector starts. If you are starting multiple connectors in a cluster, this property is useful for avoiding snapshot interruptions, which might cause re-balancing of connectors.
snapshot.fetch.sizeNo defaultDuring a snapshot, the connector reads table content in batches of rows. This property specifies the maximum number of rows in a batch.
snapshot.lock.timeout.ms10000Positive integer that specifies the maximum amount of time (in milliseconds) to wait to obtain table locks when performing a snapshot. If the connector cannot acquire table locks in this time interval, the snapshot fails. See how MySQL connectors perform database snapshots.
enable.time.adjustertrueBoolean value that indicates whether the connector converts a 2-digit year specification to 4 digits. Set to false when conversion is fully delegated to the database. MySQL allows users to insert year values with either 2-digits or 4-digits. For 2-digit values, the value gets mapped to a year in the range 1970 - 2069. The default behavior is that the connector does the conversion.
source.struct.versionv2Schema version for the source block in Debezium events. Debezium 0.10 introduced a few breaking changes to the structure of the source block in order to unify the exposed structure across all the connectors. By setting this option to v1, the structure used in earlier versions can be produced. However, this setting is not recommended and is planned for removal in a future Debezium version.
sanitize.field.namestrue if connector configuration sets the key.converter or value.converter property to the Avro converter. false if not.Indicates whether field names are sanitized to adhere to Avro naming requirements.
skipped.operationsNo defaultComma-separated list of operation types to skip during streaming. The following values are possible: c for inserts/create, u for updates, d for deletes. By default, no operations are skipped.
signal.data.collectionNo defaultFully-qualified name of the data collection that is used to send signals to the connector. The name format is database-name.table-name.
read.onlyfalseSwitch to alternative incremental snapshot watermarks implementation to avoid writes to signal data collection
provide.transaction.metadatafalseDetermines whether the connector generates events with transaction boundaries and enriches change event envelopes with transaction metadata. Specify true if you want the connector to do this. See Transaction metadata for details.
注册示例
{
    "name": "inventory-connector", 
    "config": {
        "connector.class": "io.debezium.connector.mysql.MySqlConnector", 
        "database.hostname": "192.168.99.100", 
        "database.port": "3306", 
        "database.user": "debezium-user", 
        "database.password": "debezium-user-pw", 
        "database.server.id": "184054", 
        "database.server.name": "fullfillment", 
        "database.include.list": "inventory", 
        "database.history.kafka.bootstrap.servers": "kafka:9092", 
        "database.history.kafka.topic": "dbhistory.fullfillment", 
        "include.schema.changes": "true" 
    }
}
事件Schame

可以使用参数converter.schemas.enable=false,设置数据不附带schame信息

INSERT
{
  "schema": { ... },
  "payload": {
    "op": "c",
    "ts_ms": 1465491411815,
    "before": null,
    "after": {
      "id": 1004,
      "first_name": "Anne",
      "last_name": "Kretchmar",
      "email": "annek@noanswer.org"
    },
    "source": {
      "version": "1.7.1.Final",
      "connector": "mysql",
      "name": "mysql-server-1",
      "ts_ms": 0,
      "snapshot": false,
      "db": "inventory",
      "table": "customers",
      "server_id": 0,
      "gtid": null,
      "file": "mysql-bin.000003",
      "pos": 154,
      "row": 0,
      "thread": 7,
      "query": "INSERT INTO customers (first_name, last_name, email) VALUES ('Anne', 'Kretchmar', 'annek@noanswer.org')"
    }
  }
}
UPDATE
{
  "schema": { ... },
  "payload": {
    "before": { 
      "id": 1004,
      "first_name": "Anne",
      "last_name": "Kretchmar",
      "email": "annek@noanswer.org"
    },
    "after": { 
      "id": 1004,
      "first_name": "Anne Marie",
      "last_name": "Kretchmar",
      "email": "annek@noanswer.org"
    },
    "source": { 
      "version": "1.7.1.Final",
      "name": "mysql-server-1",
      "connector": "mysql",
      "name": "mysql-server-1",
      "ts_ms": 1465581029100,
      "snapshot": false,
      "db": "inventory",
      "table": "customers",
      "server_id": 223344,
      "gtid": null,
      "file": "mysql-bin.000003",
      "pos": 484,
      "row": 0,
      "thread": 7,
      "query": "UPDATE customers SET first_name='Anne Marie' WHERE id=1004"
    },
    "op": "u", 
    "ts_ms": 1465581029523 
  }
}
DELETE
{
  "schema": { ... },
  "payload": {
    "before": { 
      "id": 1004,
      "first_name": "Anne Marie",
      "last_name": "Kretchmar",
      "email": "annek@noanswer.org"
    },
    "after": null, 
    "source": { 
      "version": "1.7.1.Final",
      "connector": "mysql",
      "name": "mysql-server-1",
      "ts_ms": 1465581902300,
      "snapshot": false,
      "db": "inventory",
      "table": "customers",
      "server_id": 223344,
      "gtid": null,
      "file": "mysql-bin.000003",
      "pos": 805,
      "row": 0,
      "thread": 7,
      "query": "DELETE FROM customers WHERE id=1004"
    },
    "op": "d", 
    "ts_ms": 1465581902461 
  }
}
SNAPSHOT
{
   "before":null,
   "after": {
      "pk":"1",
      "value":"New data"
   },
   "source": {
      "snapshot":"incremental"
   },
   "op":"r",
   "ts_ms":"1620393591654",
   "transaction":null
}

Debezium connector for MongoDB

简介

​ MongoDB 的复制机制提供了冗余和高可用性,是在生产中运行 MongoDB 的首选方式。MongoDB 连接器捕获副本集或分片集群中的更改。MongoDB副本集由一组服务器组成,这些服务器都具有相同数据的副本,并且复制确保客户端对副本集服务器上的文档所做的所有更改都正确应用于其他副本集的服务器,称为辅助服务器。MongoDB 复制的工作原理是让主节点在其oplog(或操作日志)中记录更改,然后每个辅助节点读取主节点的 oplog 并按顺序将所有操作应用于它们自己的文档。将新服务器添加到副本集时,该服务器首先执行快照主数据库上的所有数据库和集合,然后读取主数据库的 oplog 以应用自它开始快照以来可能进行的所有更改。当这个新服务器赶上主服务器 oplog 的尾部时,它就变成了辅助服务器(并且能够处理查询)。

MongoDB 连接器使用相同的复制机制,尽管它实际上并未成为副本集的成员。然而,就像 MongoDB 辅助节点一样,连接器总是读取副本集主节点的 oplog。而且,当连接器第一次看到副本集时,它会查看 oplog 以获取最后记录的事务,然后执行主数据库和集合的快照。当所有数据都被复制后,连接器从它之前从 oplog 读取的位置开始流式传输更改。MongoDB oplog 中的操作是幂等的,因此无论应用多少次操作,它们都会导致相同的结束状态。

随着 MongoDB 连接器进程的变化,它会定期记录事件发生的 oplog 中的位置。当 MongoDB 连接器停止时,它会记录它处理的最后一个 oplog 位置,以便在重新启动时它只是从该位置开始流式传输。换句话说,连接器可以停止、升级或维护,并在一段时间后重新启动,它会从停止的地方恢复,而不会丢失任何事件。当然,MongoDB 的 oplog 通常有一个最大大小的上限,这意味着连接器不应停止太长时间,否则 oplog 中的某些操作可能会在连接器有机会读取它们之前被清除。在这种情况下,在重新启动时,连接器将检测丢失的 oplog 操作,执行快照,然后继续流式传输更改。

MongoDB 连接器也非常容忍副本集的成员资格和领导权的变化、分片集群中分片的添加或删除以及可能导致通信故障的网络问题。连接器始终使用副本集的主节点来流式传输更改,因此当副本集进行选举并且不同的节点成为主节点时,连接器将立即停止流式传输更改,连接到新的主节点,并使用新的主节点开始流式传输更改节点。同样,如果连接器在与主副本集通信时遇到任何问题,它将尝试重新连接(使用指数退避以避免网络或副本集不堪重负)并从上次停止的地方继续流式传输更改。

其他资源

MongoDB 连接器的工作原理

连接器支持的 MongoDB 拓扑的概述对于规划您的应用程序很有用。

配置和部署 MongoDB 连接器时,它首先连接到种子地址处的 MongoDB 服务器,并确定有关每个可用副本集的详细信息。由于每个副本集都有自己独立的 oplog,因此连接器将尝试为每个副本集使用单独的任务。连接器可以限制它将使用的最大任务数,如果没有足够的任务可用,连接器将为每个任务分配多个副本集,尽管该任务仍将为每个副本集使用单独的线程。针对分片集群运行连接器时,请使用tasks.max大于副本集数量的值。这将允许连接器为每个副本集创建一个任务,并让 Kafka Connect 协调、分发和管理所有可用工作进程之间的任务。

注册参数表
基础参数
PropertyDefaultDescription
nameUnique name for the connector. Attempting to register again with the same name will fail. (This property is required by all Kafka Connect connectors.)
connector.classThe name of the Java class for the connector. Always use a value of io.debezium.connector.mongodb.MongoDbConnector for the MongoDB connector.
mongodb.hostsThe comma-separated list of hostname and port pairs (in the form ‘host’ or ‘host:port’) of the MongoDB servers in the replica set. The list can contain a single hostname and port pair. If mongodb.members.auto.discover is set to false, then the host and port pair should be prefixed with the replica set name (e.g., rs0/localhost:27017).
mongodb.nameA unique name that identifies the connector and/or MongoDB replica set or sharded cluster that this connector monitors. Each server should be monitored by at most one Debezium connector, since this server name prefixes all persisted Kafka topics emanating from the MongoDB replica set or cluster. Only alphanumeric characters, hyphens, dots and underscores must be used.
mongodb.userName of the database user to be used when connecting to MongoDB. This is required only when MongoDB is configured to use authentication.
mongodb.passwordPassword to be used when connecting to MongoDB. This is required only when MongoDB is configured to use authentication.
mongodb.authsourceadminDatabase (authentication source) containing MongoDB credentials. This is required only when MongoDB is configured to use authentication with another authentication database than admin.
mongodb.ssl.enabledfalseConnector will use SSL to connect to MongoDB instances.
mongodb.ssl.invalid.hostname.allowedfalseWhen SSL is enabled this setting controls whether strict hostname checking is disabled during connection phase. If true the connection will not prevent man-in-the-middle attacks.
database.include.listempty stringAn optional comma-separated list of regular expressions that match database names to be monitored; any database name not included in database.include.list is excluded from monitoring. By default all databases are monitored. Must not be used with database.exclude.list.
database.exclude.listempty stringAn optional comma-separated list of regular expressions that match database names to be excluded from monitoring; any database name not included in database.exclude.list is monitored. Must not be used with database.include.list.
collection.include.listempty stringAn optional comma-separated list of regular expressions that match fully-qualified namespaces for MongoDB collections to be monitored; any collection not included in collection.include.list is excluded from monitoring. Each identifier is of the form databaseName.collectionName. By default the connector will monitor all collections except those in the local and admin databases. Must not be used with collection.exclude.list.
collection.exclude.listempty stringAn optional comma-separated list of regular expressions that match fully-qualified namespaces for MongoDB collections to be excluded from monitoring; any collection not included in collection.exclude.list is monitored. Each identifier is of the form databaseName.collectionName. Must not be used with collection.include.list.
snapshot.modeinitialSpecifies the criteria for running a snapshot upon startup of the connector. The default is initial, and specifies the connector reads a snapshot when either no offset is found or if the oplog no longer contains the previous offset. The never option specifies that the connector should never use snapshots, instead the connector should proceed to tail the log.
snapshot.include.collection.listAll collections specified in collection.include.listAn optional, comma-separated list of regular expressions that match names of schemas specified in collection.include.list for which you want to take the snapshot.
field.exclude.listempty stringAn optional comma-separated list of the fully-qualified names of fields that should be excluded from change event message values. Fully-qualified names for fields are of the form databaseName.collectionName.fieldName.nestedFieldName, where databaseName and collectionName may contain the wildcard (*) which matches any characters.
field.renamesempty stringAn optional comma-separated list of the fully-qualified replacements of fields that should be used to rename fields in change event message values. Fully-qualified replacements for fields are of the form databaseName.collectionName.fieldName.nestedFieldName:newNestedFieldName, where databaseName and collectionName may contain the wildcard (*) which matches any characters, the colon character (😃 is used to determine rename mapping of field. The next field replacement is applied to the result of the previous field replacement in the list, so keep this in mind when renaming multiple fields that are in the same path.
tasks.max1The maximum number of tasks that should be created for this connector. The MongoDB connector will attempt to use a separate task for each replica set, so the default is acceptable when using the connector with a single MongoDB replica set. When using the connector with a MongoDB sharded cluster, we recommend specifying a value that is equal to or more than the number of shards in the cluster, so that the work for each replica set can be distributed by Kafka Connect.
snapshot.max.threads1Positive integer value that specifies the maximum number of threads used to perform an intial sync of the collections in a replica set. Defaults to 1.
tombstones.on.deletetrueControls whether a delete event is followed by a tombstone event. true - a delete operation is represented by a delete event and a subsequent tombstone event. false - only a delete event is emitted. After a source record is deleted, emitting a tombstone event (the default behavior) allows Kafka to completely delete all events that pertain to the key of the deleted row in case log compaction is enabled for the topic.
snapshot.delay.msAn interval in milliseconds that the connector should wait before taking a snapshot after starting up; Can be used to avoid snapshot interruptions when starting multiple connectors in a cluster, which may cause re-balancing of connectors.
snapshot.fetch.size0Specifies the maximum number of documents that should be read in one go from each collection while taking a snapshot. The connector will read the collection contents in multiple batches of this size. Defaults to 0, which indicates that the server chooses an appropriate fetch size.
高级参数
PropertyDefaultDescription
max.queue.size8192Positive integer value that specifies the maximum size of the blocking queue into which change events read from the database log are placed before they are written to Kafka. This queue can provide backpressure to the oplog reader when, for example, writes to Kafka are slower or if Kafka is not available. Events that appear in the queue are not included in the offsets periodically recorded by this connector. Defaults to 8192, and should always be larger than the maximum batch size specified in the max.batch.size property.
max.batch.size2048Positive integer value that specifies the maximum size of each batch of events that should be processed during each iteration of this connector. Defaults to 2048.
max.queue.size.in.bytes0Long value for the maximum size in bytes of the blocking queue. The feature is disabled by default, it will be active if it’s set with a positive long value.
poll.interval.ms1000Positive integer value that specifies the number of milliseconds the connector should wait during each iteration for new change events to appear. Defaults to 1000 milliseconds, or 1 second.
connect.backoff.initial.delay.ms1000Positive integer value that specifies the initial delay when trying to reconnect to a primary after the first failed connection attempt or when no primary is available. Defaults to 1 second (1000 ms).
connect.backoff.max.delay.ms1000Positive integer value that specifies the maximum delay when trying to reconnect to a primary after repeated failed connection attempts or when no primary is available. Defaults to 120 seconds (120,000 ms).
connect.max.attempts16Positive integer value that specifies the maximum number of failed connection attempts to a replica set primary before an exception occurs and task is aborted. Defaults to 16, which with the defaults for connect.backoff.initial.delay.ms and connect.backoff.max.delay.ms results in just over 20 minutes of attempts before failing.
mongodb.members.auto.discovertrueBoolean value that specifies whether the addresses in ‘mongodb.hosts’ are seeds that should be used to discover all members of the cluster or replica set (true), or whether the address(es) in mongodb.hosts should be used as is (false). The default is true and should be used in all cases except where MongoDB is fronted by a proxy.
source.struct.versionv2Schema version for the source block in CDC events. Debezium 0.10 introduced a few breaking changes to the structure of the source block in order to unify the exposed structure across all the connectors. By setting this option to v1 the structure used in earlier versions can be produced. Note that this setting is not recommended and is planned for removal in a future Debezium version.
heartbeat.interval.ms0Controls how frequently heartbeat messages are sent. This property contains an interval in milliseconds that defines how frequently the connector sends messages into a heartbeat topic. This can be used to monitor whether the connector is still receiving change events from the database. You also should leverage heartbeat messages in cases where only records in non-captured collections are changed for a longer period of time. In such situation the connector would proceed to read the oplog from the database but never emit any change messages into Kafka, which in turn means that no offset updates are committed to Kafka. This will cause the oplog files to be rotated out but connector will not notice it so on restart some events are no longer available which leads to the need of re-execution of the initial snapshot.Set this parameter to 0 to not send heartbeat messages at all. Disabled by default.
heartbeat.topics.prefix__debezium-heartbeatControls the naming of the topic to which heartbeat messages are sent. The topic is named according to the pattern <heartbeat.topics.prefix>.<server.name>.
sanitize.field.namestrue when connector configuration explicitly specifies the key.converter or value.converter parameters to use Avro, otherwise defaults to false.Whether field names are sanitized to adhere to Avro naming requirements. See Avro naming for more details.
skipped.operationscomma-separated list of operation types that will be skipped during streaming. The operations include: c for inserts/create, u for updates, and d for deletes. By default, no operations are skipped.
snapshot.collection.filter.overridesControls which collection items are included in snapshot. This property affects snapshots only. Specify a comma-separated list of collection names in the form databaseName.collectionName.For each collection that you specify, also specify another configuration property: snapshot.collection.filter.overrides.*databaseName*.*collectionName*. For example, the name of the other configuration property might be: snapshot.collection.filter.overrides.customers.orders. Set this property to a valid filter expression that retrieves only the items that you want in the snapshot. When the connector performs a snapshot, it retrieves only the items that matches the filter expression.
provide.transaction.metadatafalseWhen set to true Debezium generates events with transaction boundaries and enriches data events envelope with transaction metadata.See Transaction Metadata for additional details.
retriable.restart.connector.wait.ms10000 (10 seconds)The number of milliseconds to wait before restarting a connector after a retriable error occurs.
mongodb.poll.interval.ms30000The interval in which the connector polls for new, removed, or changed replica sets.
mongodb.connect.timeout.ms10000 (10 seconds)The number of milliseconds the driver will wait before a new connection attempt is aborted.
mongodb.socket.timeout.ms0The number of milliseconds before a send/receive on the socket can take before a timeout occurs. A value of 0 disables this behavior.
mongodb.server.selection.timeout.ms30000 (30 seconds)The number of milliseconds the driver will wait to select a server before it times out and throws an error.
cursor.max.await.time.ms0Specifies the maximum number of milliseconds the oplog cursor will wait for the server to produce a result before causing an execution timeout exception. A value of 0 indicates using the server/driver default wait timeout.
注册示例
{
  "name": "inventory-connector", 
  "config": {
    "connector.class": "io.debezium.connector.mongodb.MongoDbConnector", 
    "mongodb.hosts": "rs0/192.168.99.100:27017", 
    "mongodb.name": "fullfillment", 
    "collection.include.list": "inventory[.]*", 
  }
}
事件schema
INSERT
{
  "schema": { ... },
  "payload": {
    "after": "{\"_id\" : {\"$numberLong\" : \"1004\"},\"first_name\" : \"Anne\",\"last_name\" : \"Kretchmar\",\"email\" : \"annek@noanswer.org\"}",
    "patch": null,
    "source": {
      "version": "1.7.1.Final",
      "connector": "mongodb",
      "name": "fulfillment",
      "ts_ms": 1558965508000,
      "snapshot": false,
      "db": "inventory",
      "rs": "rs0",
      "collection": "customers",
      "ord": 31,
      "h": 1546547425148721999
    },
    "op": "c",
    "ts_ms": 1558965515240
  }
}
UPDATE
{
    "schema": { ... },
    "payload": {
      "op": "u", 
      "ts_ms": 1465491461815, 
      "patch": "{\"$set\":{\"first_name\":\"Anne Marie\"}}", 
      "filter": "{\"_id\" : {\"$numberLong\" : \"1004\"}}", 
      "source": { 
        "version": "1.7.1.Final",
        "connector": "mongodb",
        "name": "fulfillment",
        "ts_ms": 1558965508000,
        "snapshot": true,
        "db": "inventory",
        "rs": "rs0",
        "collection": "customers",
        "ord": 6,
        "h": 1546547425148721999
      }
    }
  }

DELETE
{
    "schema": { ... },
    "payload": {
      "op": "d", 
      "ts_ms": 1465495462115, 
      "filter": "{\"_id\" : {\"$numberLong\" : \"1004\"}}", 
      "source": { 
        "version": "1.7.1.Final",
        "connector": "mongodb",
        "name": "fulfillment",
        "ts_ms": 1558965508000,
        "snapshot": true,
        "db": "inventory",
        "rs": "rs0",
        "collection": "customers",
        "ord": 6,
        "h": 1546547425148721999
      }
    }
  }

Debezium connector for PostgreSQL

简介

Debezium 的 PostgreSQL 连接器捕获 PostgreSQL 数据库模式中的行级更改。支持 PostgreSQL 版本 9.6、10、11、12 和 13。

第一次连接到 PostgreSQL 服务器或集群时,连接器会拍摄所有模式的一致快照。快照完成后,连接器会持续捕获插入、更新和删除数据库内容以及提交到 PostgreSQL 数据库的行级更改。连接器生成数据更改事件记录并将它们流式传输到 Kafka 主题。对于每个表,默认行为是连接器将所有生成的事件流式传输到该表的单独 Kafka 主题。应用程序和服务使用来自该主题的数据更改事件记录。

PostgreSQL 的逻辑解码特性是在 9.4 版本中引入的。它是一种机制,允许提取提交到事务日志的更改,并在输出插件的帮助下以用户友好的方式处理这些更改。输出插件使客户端能够使用更改。

PostgreSQL 连接器包含两个主要部分,它们协同工作以读取和处理数据库更改:

  • 一个逻辑解码输出插件。您可能需要安装您选择使用的输出插件。在运行 PostgreSQL 服务器之前,您必须配置一个使用您选择的输出插件的复制槽。插件可以是以下之一:
    • decoderbufs 基于 Protobuf 并由 Debezium 社区维护。
    • wal2json 基于 JSON 并由 wal2json 社区维护。
    • pgoutput是 PostgreSQL 10+ 中标准的逻辑解码输出插件。它由 PostgreSQL 社区维护,由 PostgreSQL 本身用于逻辑复制。此插件始终存在,因此无需安装其他库。Debezium 连接器将原始复制事件流直接解释为更改事件。
  • 读取所选逻辑解码输出插件产生的更改的 Java 代码(实际的 Kafka Connect 连接器)。它使用 PostgreSQL 的流式复制协议,通过 PostgreSQL JDBC 驱动程序

连接器为捕获的每个行级插入、更新和删除操作生成一个更改事件,并为单独的 Kafka 主题中的每个表发送更改事件记录。客户端应用程序读取与感兴趣的数据库表对应的 Kafka 主题,并且可以对它们从这些主题接收到的每个行级事件做出反应。

PostgreSQL 通常会在一段时间后清除预写日志 (WAL) 段。这意味着连接器没有对数据库所做的所有更改的完整历史记录。因此,当 PostgreSQL 连接器首次连接到特定的 PostgreSQL 数据库时,它首先对每个数据库模式执行一致的快照。连接器完成快照后,它会从创建快照的确切点继续流式传输更改。这样,连接器从所有数据的一致视图开始,并且不会忽略在拍摄快照时所做的任何更改。

连接器可以容忍故障。当连接器读取更改并生成事件时,它会记录每个事件的 WAL 位置。如果连接器因任何原因(包括通信故障、网络问题或崩溃)而停止,则在重新启动时,连接器会继续读取上次停止的 WAL。这包括快照。如果连接器在快照期间停止,则连接器在重新启动时会开始一个新的快照。

连接器的工作原理

要优化配置和运行 Debezium PostgreSQL 连接器,了解连接器如何执行快照、流更改事件、确定 Kafka 主题名称以及使用元数据会很有帮助。

安全

要使用 Debezium 连接器从 PostgreSQL 数据库流式传输更改,连接器必须在数据库中以特定权限运行。尽管授予必要权限的一种方法是为用户提供superuser权限,但这样做可能会将您的 PostgreSQL 数据暴露给未经授权的访问。最好创建一个专用的 Debezium 复制用户,向其授予特定权限,而不是向 Debezium 用户授予过多的权限。

有关为 Debezium PostgreSQL 用户配置权限的更多信息,请参阅设置权限。有关 PostgreSQL 逻辑复制安全性的更多信息,请参阅PostgreSQL 文档

快照

大多数 PostgreSQL 服务器配置为不在 WAL 段中保留数据库的完整历史记录。这意味着 PostgreSQL 连接器将无法通过仅读取 WAL 来查看数据库的整个历史记录。因此,连接器第一次启动时,它会执行数据库的初始一致快照。执行快照的默认行为包括以下步骤。您可以通过将snapshot.mode连接器配置属性设置为以外的值来更改此行为initial

  1. 使用SERIALIZABLE、READ ONLY、DEFERRABLE隔离级别启动事务,以确保此事务中的后续读取针对数据的单个一致版本。由于其他客户端的后续INSERTUPDATEDELETE操作而对数据进行的任何更改对于此事务都是不可见的。
  2. 读取服务器事务日志中的当前位置。
  3. 扫描数据库表和模式,READ为每一行生成一个事件并将该事件写入适当的表特定的 Kafka 主题。
  4. 提交交易。
  5. 在连接器偏移中记录快照的成功完成。

如果连接器出现故障、则连接器在重新启动时会开始一个新的快照。连接器完成其初始快照后,PostgreSQL 连接器从读取的位置继续流式传输。这确保连接器不会错过任何更新。如果连接器因任何原因再次停止,则在重新启动时,连接器会继续从先前停止的位置流式传输更改。

环境描述
always连接器在启动时总是执行快照。快照完成后,连接器会继续按顺序开始流式传输更改。此模式在以下情况下很有用: 众所周知,一些 WAL 段已被删除,不再可用。 集群故障后,一个新的主节点被提升。该always快照模式确保连接器不会错过新的主已得到晋升,而是在连接器重新启动后在新的主后所进行的任何更改。
never连接器从不执行快照。当以这种方式配置连接器时,其启动时的行为如下。如果 Kafka offsets 主题中存在先前存储的 LSN,则连接器会继续从该位置流式传输更改。如果未存储 LSN,则连接器从在服务器上创建 PostgreSQL 逻辑复制槽的时间点开始流式传输更改。never仅当您知道所有感兴趣的数据仍反映在 WAL 中时,快照模式才有用。
initial_only连接器执行数据库快照并在流式传输任何更改事件记录之前停止。如果连接器已启动但在停止之前未完成快照,则连接器将重新启动快照过程并在快照完成时停止。
exported已弃用,所有模式都是无锁的。
customcustom快照模式下,您可以注入自己实现的io.debezium.connector.postgresql.spi.Snapshotter接口。将snapshot.custom.class配置属性设置为Kafka Connect 集群类路径上的类,或者如果使用EmbeddedEngine. 有关更多详细信息,请参阅自定义快照程序 SPI
注册参数表
基础参数
PropertyDefaultDescription
nameNo defaultUnique name for the connector. Attempting to register again with the same name will fail. This property is required by all Kafka Connect connectors.
connector.classNo defaultThe name of the Java class for the connector. Always use a value of io.debezium.connector.postgresql.PostgresConnector for the PostgreSQL connector.
tasks.max1The maximum number of tasks that should be created for this connector. The PostgreSQL connector always uses a single task and therefore does not use this value, so the default is always acceptable.
plugin.namedecoderbufsThe name of the PostgreSQL logical decoding plug-in installed on the PostgreSQL server.Supported values are decoderbufs, wal2json, wal2json_rds, wal2json_streaming, wal2json_rds_streaming and pgoutput.If you are using a wal2json plug-in and transactions are very large, the JSON batch event that contains all transaction changes might not fit into the hard-coded memory buffer, which has a size of 1 GB. In such cases, switch to a streaming plug-in, by setting the plugin-name property to wal2json_streaming or wal2json_rds_streaming. With a streaming plug-in, PostgreSQL sends the connector a separate message for each change in a transaction.
slot.namedebeziumThe name of the PostgreSQL logical decoding slot that was created for streaming changes from a particular plug-in for a particular database/schema. The server uses this slot to stream events to the Debezium connector that you are configuring.Slot names must conform to PostgreSQL replication slot naming rules, which state: “Each replication slot has a name, which can contain lower-case letters, numbers, and the underscore character.”
slot.drop.on.stopfalseWhether or not to delete the logical replication slot when the connector stops in a graceful, expected way. The default behavior is that the replication slot remains configured for the connector when the connector stops. When the connector restarts, having the same replication slot enables the connector to start processing where it left off.Set to true in only testing or development environments. Dropping the slot allows the database to discard WAL segments. When the connector restarts it performs a new snapshot or it can continue from a persistent offset in the Kafka Connect offsets topic.
publication.namedbz_publicationThe name of the PostgreSQL publication created for streaming changes when using pgoutput.This publication is created at start-up if it does not already exist and it includes all tables. Debezium then applies its own include/exclude list filtering, if configured, to limit the publication to change events for the specific tables of interest. The connector user must have superuser permissions to create this publication, so it is usually preferable to create the publication before starting the connector for the first time.If the publication already exists, either for all tables or configured with a subset of tables, Debezium uses the publication as it is defined.
database.hostnameNo defaultIP address or hostname of the PostgreSQL database server.
database.port5432Integer port number of the PostgreSQL database server.
database.userNo defaultName of the PostgreSQL database user for connecting to the PostgreSQL database server.
database.passwordNo defaultPassword to use when connecting to the PostgreSQL database server.
database.dbnameNo defaultThe name of the PostgreSQL database from which to stream the changes.
database.server.nameNo defaultLogical name that identifies and provides a namespace for the particular PostgreSQL database server or cluster in which Debezium is capturing changes. Only alphanumeric characters, hyphens, dots and underscores must be used in the database server logical name. The logical name should be unique across all other connectors, since it is used as a topic name prefix for all Kafka topics that receive records from this connector.
schema.include.listNo defaultAn optional, comma-separated list of regular expressions that match names of schemas for which you want to capture changes. Any schema name not included in schema.include.list is excluded from having its changes captured. By default, all non-system schemas have their changes captured. Do not also set the schema.exclude.list property.
schema.exclude.listNo defaultAn optional, comma-separated list of regular expressions that match names of schemas for which you do not want to capture changes. Any schema whose name is not included in schema.exclude.list has its changes captured, with the exception of system schemas. Do not also set the schema.include.list property.
table.include.listNo defaultAn optional, comma-separated list of regular expressions that match fully-qualified table identifiers for tables whose changes you want to capture. Any table not included in table.include.list does not have its changes captured. Each identifier is of the form schemaName.tableName. By default, the connector captures changes in every non-system table in each schema whose changes are being captured. Do not also set the table.exclude.list property.
table.exclude.listNo defaultAn optional, comma-separated list of regular expressions that match fully-qualified table identifiers for tables whose changes you do not want to capture. Any table not included in table.exclude.list has it changes captured. Each identifier is of the form schemaName.tableName. Do not also set the table.include.list property.
column.include.listNo defaultAn optional, comma-separated list of regular expressions that match the fully-qualified names of columns that should be included in change event record values. Fully-qualified names for columns are of the form schemaName.tableName.columnName. Do not also set the column.exclude.list property.
column.exclude.listNo defaultAn optional, comma-separated list of regular expressions that match the fully-qualified names of columns that should be excluded from change event record values. Fully-qualified names for columns are of the form schemaName.tableName.columnName. Do not also set the column.include.list property.
time.precision.modeadaptiveTime, date, and timestamps can be represented with different kinds of precision: adaptive captures the time and timestamp values exactly as in the database using either millisecond, microsecond, or nanosecond precision values based on the database column’s type. adaptive_time_microseconds captures the date, datetime and timestamp values exactly as in the database using either millisecond, microsecond, or nanosecond precision values based on the database column’s type. An exception is TIME type fields, which are always captured as microseconds. connect always represents time and timestamp values by using Kafka Connect’s built-in representations for Time, Date, and Timestamp, which use millisecond precision regardless of the database columns’ precision. See temporal values.
decimal.handling.modepreciseSpecifies how the connector should handle values for DECIMAL and NUMERIC columns: precise represents values by using java.math.BigDecimal to represent values in binary form in change events. double represents values by using double values, which might result in a loss of precision but which is easier to use. string encodes values as formatted strings, which are easy to consume but semantic information about the real type is lost. See Decimal types.
hstore.handling.modemapSpecifies how the connector should handle values for hstore columns: map represents values by using MAP. json represents values by using json string. This setting encodes values as formatted strings such as {"key" : "val"}. See PostgreSQL HSTORE type.
interval.handling.modenumericSpecifies how the connector should handle values for interval columns: numeric represents intervals using approximate number of microseconds. string represents intervals exactly by using the string pattern representation P<years>Y<months>M<days>DT<hours>H<minutes>M<seconds>S. For example: P1Y2M3DT4H5M6.78S. See PostgreSQL basic types.
database.sslmodedisableWhether to use an encrypted connection to the PostgreSQL server. Options include: disable uses an unencrypted connection. require uses a secure (encrypted) connection, and fails if one cannot be established. verify-ca behaves like require but also verifies the server TLS certificate against the configured Certificate Authority (CA) certificates, or fails if no valid matching CA certificates are found. verify-full behaves like verify-ca but also verifies that the server certificate matches the host to which the connector is trying to connect. See the PostgreSQL documentation for more information.
database.sslcertNo defaultThe path to the file that contains the SSL certificate for the client. See the PostgreSQL documentation for more information.
database.sslkeyNo defaultThe path to the file that contains the SSL private key of the client. See the PostgreSQL documentation for more information.
database.sslpasswordNo defaultThe password to access the client private key from the file specified by database.sslkey. See the PostgreSQL documentation for more information.
database.sslrootcertNo defaultThe path to the file that contains the root certificate(s) against which the server is validated. See the PostgreSQL documentation for more information.
database.tcpKeepAlivetrueEnable TCP keep-alive probe to verify that the database connection is still alive. See the PostgreSQL documentation for more information.
tombstones.on.deletetrueControls whether a delete event is followed by a tombstone event. true - a delete operation is represented by a delete event and a subsequent tombstone event. false - only a delete event is emitted. After a source record is deleted, emitting a tombstone event (the default behavior) allows Kafka to completely delete all events that pertain to the key of the deleted row in case log compaction is enabled for the topic.
column.truncate.to._length_.charsn/aAn optional, comma-separated list of regular expressions that match the fully-qualified names of character-based columns. Fully-qualified names for columns are of the form schemaName.tableName.columnName. In change event records, values in these columns are truncated if they are longer than the number of characters specified by length in the property name. You can specify multiple properties with different lengths in a single configuration. Length must be a positive integer, for example, +column.truncate.to.20.chars.
column.mask.with._length_.charsn/aAn optional, comma-separated list of regular expressions that match the fully-qualified names of character-based columns. Fully-qualified names for columns are of the form schemaName.tableName.columnName. In change event values, the values in the specified table columns are replaced with length number of asterisk (*) characters. You can specify multiple properties with different lengths in a single configuration. Length must be a positive integer or zero. When you specify zero, the connector replaces a value with an empty string.
column.mask.hash.*hashAlgorithm*.with.salt.*salt*n/aAn optional, comma-separated list of regular expressions that match the fully-qualified names of character-based columns. Fully-qualified names for columns are of the form ... In the resulting change event record, the values for the specified columns are replaced with pseudonyms. A pseudonym consists of the hashed value that results from applying the specified hashAlgorithm and salt. Based on the hash function that is used, referential integrity is maintained, while column values are replaced with pseudonyms. Supported hash functions are described in the MessageDigest section of the Java Cryptography Architecture Standard Algorithm Name Documentation. In the following example, CzQMA0cB5K is a randomly selected salt. column.mask.hash.SHA-256.with.salt.CzQMA0cB5K = inventory.orders.customerName, inventory.shipment.customerNameIf necessary, the pseudonym is automatically shortened to the length of the column. The connector configuration can include multiple properties that specify different hash algorithms and salts. Depending on the hashAlgorithm used, the salt selected, and the actual data set, the resulting data set might not be completely masked.
column.propagate.source.typen/aAn optional, comma-separated list of regular expressions that match the fully-qualified names of columns. Fully-qualified names for columns are of the form databaseName.tableName.columnName, or databaseName.schemaName.tableName.columnName. For each specified column, the connector adds the column’s original type and original length as parameters to the corresponding field schemas in the emitted change records. The following added schema parameters propagate the original type name and also the original length for variable-width types: __debezium.source.column.type + __debezium.source.column.length + __debezium.source.column.scale This property is useful for properly sizing corresponding columns in sink databases.
datatype.propagate.source.typen/aAn optional, comma-separated list of regular expressions that match the database-specific data type name for some columns. Fully-qualified data type names are of the form databaseName.tableName.typeName, or databaseName.schemaName.tableName.typeName. For these data types, the connector adds parameters to the corresponding field schemas in emitted change records. The added parameters specify the original type and length of the column: __debezium.source.column.type + __debezium.source.column.length + __debezium.source.column.scale These parameters propagate a column’s original type name and length, for variable-width types, respectively. This property is useful for properly sizing corresponding columns in sink databases. See the list of PostgreSQL-specific data type names.
message.key.columnsempty stringA list of expressions that specify the columns that the connector uses to form custom message keys for change event records that it publishes to the Kafka topics for specified tables.By default, Debezium uses the primary key column of a table as the message key for records that it emits. In place of the default, or to specify a key for tables that lack a primary key, you can configure custom message keys based on one or more columns. To establish a custom message key for a table, list the table, followed by the columns to use as the message key. Each list entry takes the following format: *<fully-qualified_tableName>*:_<keyColumn>_,*<keyColumn>* To base a table key on multiple column names, insert commas between the column names.Each fully-qualified table name is a regular expression in the following format: *<schemaName>*.*<tableName>* The property can include entries for multiple tables. Use a semicolon to separate table entries in the list. The following example sets the message key for the tables inventory.customers and purchase.orders: inventory.customers:pk1,pk2;(.*).purchaseorders:pk3,pk4 For the table inventory.customer, the columns pk1 and pk2 are specified as the message key. For the purchaseorders tables in any schema, the columns pk3 and pk4 server as the message key.There is no limit to the number of columns that you use to create custom message keys. However, it’s best to use the minimum number that are required to specify a unique key.
publication.autocreate.modeall_tablesApplies only when streaming changes by using the pgoutput plug-in. The setting determines how creation of a publication should work. Possible settings are: all_tables - If a publication exists, the connector uses it. If a publication does not exist, the connector creates a publication for all tables in the database for which the connector is capturing changes. This requires that the database user that has permission to perform replications also has permission to create a publication. This is granted with CREATE PUBLICATION <publication_name> FOR ALL TABLES;. disabled - The connector does not attempt to create a publication. A database administrator or the user configured to perform replications must have created the publication before running the connector. If the connector cannot find the publication, the connector throws an exception and stops. filtered - If a publication exists, the connector uses it. If no publication exists, the connector creates a new publication for tables that match the current filter configuration as specified by the database.exclude.list, schema.include.list, schema.exclude.list, and table.include.list connector configuration properties. For example: CREATE PUBLICATION <publication_name> FOR TABLE <tbl1, tbl2, tbl3>.
binary.handling.modebytesSpecifies how binary (bytea) columns should be represented in change events: bytes represents binary data as byte array. base64 represents binary data as base64-encoded strings. hex represents binary data as hex-encoded (base16) strings.
truncate.handling.modebytesSpecifies how whether TRUNCATE events should be propagated or not (only available when using the pgoutput plug-in with Postgres 11 or later): skip causes those event to be omitted (the default). include causes those events to be included. For information about the structure of truncate events and about their ordering semantics, see truncate events.
高级参数
PropertyDefaultDescription
snapshot.modeinitialSpecifies the criteria for performing a snapshot when the connector starts: initial - The connector performs a snapshot only when no offsets have been recorded for the logical server name. always - The connector performs a snapshot each time the connector starts. never - The connector never performs snapshots. When a connector is configured this way, its behavior when it starts is as follows. If there is a previously stored LSN in the Kafka offsets topic, the connector continues streaming changes from that position. If no LSN has been stored, the connector starts streaming changes from the point in time when the PostgreSQL logical replication slot was created on the server. The never snapshot mode is useful only when you know all data of interest is still reflected in the WAL. initial_only - The connector performs an initial snapshot and then stops, without processing any subsequent changes. exported - deprecated custom - The connector performs a snapshot according to the setting for the snapshot.custom.class property, which is a custom implementation of the io.debezium.connector.postgresql.spi.Snapshotter interface. Thereference table for snapshot mode settings has more details.
snapshot.custom.classNo defaultA full Java class name that is an implementation of the io.debezium.connector.postgresql.spi.Snapshotter interface. Required when the snapshot.mode property is set to custom. See custom snapshotter SPI.
snapshot.include.collection.listAll tables specified in table.include.listAn optional, comma-separated list of regular expressions that match names of schemas specified in table.include.list for which you want to take the snapshot when the snapshot.mode is not never
snapshot.lock.timeout.ms10000Positive integer value that specifies the maximum amount of time (in milliseconds) to wait to obtain table locks when performing a snapshot. If the connector cannot acquire table locks in this time interval, the snapshot fails. How the connector performs snapshots provides details.
snapshot.select.statement.overridesNo defaultSpecifies the table rows to include in a snapshot. Use the property if you want a snapshot to include only a subset of the rows in a table. This property affects snapshots only. It does not apply to events that the connector reads from the log.The property contains a comma-separated list of fully-qualified table names in the form *<schemaName>.<tableName>*. For example, "snapshot.select.statement.overrides": "inventory.products,customers.orders" For each table in the list, add a further configuration property that specifies the SELECT statement for the connector to run on the table when it takes a snapshot. The specified SELECT statement determines the subset of table rows to include in the snapshot. Use the following format to specify the name of this SELECT statement property: snapshot.select.statement.overrides.*<schemaName>*.*<tableName>*. For example, snapshot.select.statement.overrides.customers.orders. Example:From a customers.orders table that includes the soft-delete column, delete_flag, add the following properties if you want a snapshot to include only those records that are not soft-deleted:"snapshot.select.statement.overrides": "customer.orders", "snapshot.select.statement.overrides.customer.orders": "SELECT * FROM [customers].[orders] WHERE delete_flag = 0 ORDER BY id DESC"In the resulting snapshot, the connector includes only the records for which delete_flag = 0.
event.processing.failure.handling.modefailSpecifies how the connector should react to exceptions during processing of events: fail propagates the exception, indicates the offset of the problematic event, and causes the connector to stop. warn logs the offset of the problematic event, skips that event, and continues processing. skip skips the problematic event and continues processing.
max.queue.size20240Positive integer value for the maximum size of the blocking queue. The connector places change events received from streaming replication in the blocking queue before writing them to Kafka. This queue can provide backpressure when, for example, writing records to Kafka is slower that it should be or Kafka is not available.
max.batch.size10240Positive integer value that specifies the maximum size of each batch of events that the connector processes.
max.queue.size.in.bytes0Long value for the maximum size in bytes of the blocking queue. The feature is disabled by default, it will be active if it’s set with a positive long value.
poll.interval.ms1000Positive integer value that specifies the number of milliseconds the connector should wait for new change events to appear before it starts processing a batch of events. Defaults to 1000 milliseconds, or 1 second.
include.unknown.datatypesfalseSpecifies connector behavior when the connector encounters a field whose data type is unknown. The default behavior is that the connector omits the field from the change event and logs a warning. Set this property to true if you want the change event to contain an opaque binary representation of the field. This lets consumers decode the field. You can control the exact representation by setting the binary handling mode property.Consumers risk backward compatibility issues when include.unknown.datatypes is set to true. Not only may the database-specific binary representation change between releases, but if the data type is eventually supported by Debezium, the data type will be sent downstream in a logical type, which would require adjustments by consumers. In general, when encountering unsupported data types, create a feature request so that support can be added.
database.initial.statementsNo defaultA semicolon separated list of SQL statements that the connector executes when it establishes a JDBC connection to the database. To use a semicolon as a character and not as a delimiter, specify two consecutive semicolons, ;;. The connector may establish JDBC connections at its own discretion. Consequently, this property is useful for configuration of session parameters only, and not for executing DML statements. The connector does not execute these statements when it creates a connection for reading the transaction log.
+status.update.interval.ms10000Frequency for sending replication connection status updates to the server, given in milliseconds. The property also controls how frequently the database status is checked to detect a dead connection in case the database was shut down.
heartbeat.interval.ms0Controls how frequently the connector sends heartbeat messages to a Kafka topic. The default behavior is that the connector does not send heartbeat messages. Heartbeat messages are useful for monitoring whether the connector is receiving change events from the database. Heartbeat messages might help decrease the number of change events that need to be re-sent when a connector restarts. To send heartbeat messages, set this property to a positive integer, which indicates the number of milliseconds between heartbeat messages. Heartbeat messages are needed when there are many updates in a database that is being tracked but only a tiny number of updates are related to the table(s) and schema(s) for which the connector is capturing changes. In this situation, the connector reads from the database transaction log as usual but rarely emits change records to Kafka. This means that no offset updates are committed to Kafka and the connector does not have an opportunity to send the latest retrieved LSN to the database. The database retains WAL files that contain events that have already been processed by the connector. Sending heartbeat messages enables the connector to send the latest retrieved LSN to the database, which allows the database to reclaim disk space being used by no longer needed WAL files.
heartbeat.topics.prefix__debezium-heartbeatControls the name of the topic to which the connector sends heartbeat messages. The topic name has this pattern: <heartbeat.topics.prefix>.<server.name> For example, if the database server name is fulfillment, the default topic name is __debezium-heartbeat.fulfillment.
heartbeat.action.queryNo defaultSpecifies a query that the connector executes on the source database when the connector sends a heartbeat message. This is useful for resolving the situation described in WAL disk space consumption, where capturing changes from a low-traffic database on the same host as a high-traffic database prevents Debezium from processing WAL records and thus acknowledging WAL positions with the database. To address this situation, create a heartbeat table in the low-traffic database, and set this property to a statement that inserts records into that table, for example: INSERT INTO test_heartbeat_table (text) VALUES ('test_heartbeat') This allows the connector to receive changes from the low-traffic database and acknowledge their LSNs, which prevents unbounded WAL growth on the database host.
schema.refresh.modecolumns_diffSpecify the conditions that trigger a refresh of the in-memory schema for a table. columns_diff is the safest mode. It ensures that the in-memory schema stays in sync with the database table’s schema at all times. columns_diff_exclude_unchanged_toast instructs the connector to refresh the in-memory schema cache if there is a discrepancy with the schema derived from the incoming message, unless unchanged TOASTable data fully accounts for the discrepancy. This setting can significantly improve connector performance if there are frequently-updated tables that have TOASTed data that are rarely part of updates. However, it is possible for the in-memory schema to become outdated if TOASTable columns are dropped from the table.
snapshot.delay.msNo defaultAn interval in milliseconds that the connector should wait before performing a snapshot when the connector starts. If you are starting multiple connectors in a cluster, this property is useful for avoiding snapshot interruptions, which might cause re-balancing of connectors.
snapshot.fetch.size10240During a snapshot, the connector reads table content in batches of rows. This property specifies the maximum number of rows in a batch.
slot.stream.paramsNo defaultSemicolon separated list of parameters to pass to the configured logical decoding plug-in. For example, add-tables=public.table,public.table2;include-lsn=true.If you are using the wal2json plug-in, this property is useful for enabling server-side table filtering. Allowed values depend on the configured plug-in.
sanitize.field.namestrue if connector configuration sets the key.converter or value.converter property to the Avro converter.false if not.Indicates whether field names are sanitized to adhere to Avro naming requirements.
slot.max.retries6If connecting to a replication slot fails, this is the maximum number of consecutive attempts to connect.
slot.retry.delay.ms10000 (10 seconds)The number of milliseconds to wait between retry attempts when the connector fails to connect to a replication slot.
toasted.value.placeholder__debezium_unavailable_valueSpecifies the constant that the connector provides to indicate that the original value is a toasted value that is not provided by the database. If the setting of toasted.value.placeholder starts with the hex: prefix it is expected that the rest of the string represents hexadecimally encoded octets. See toasted values for additional details. This option is deprecated, please use unavailable.value.placeholder instead.
unavailable.value.placeholder__debezium_unavailable_valueSpecifies the constant that the connector provides to indicate that the original value is a toasted value that is not provided by the database. If the setting of unavailable.value.placeholder starts with the hex: prefix it is expected that the rest of the string represents hexadecimally encoded octets. See toasted values for additional details.
provide.transaction.metadatafalseDetermines whether the connector generates events with transaction boundaries and enriches change event envelopes with transaction metadata. Specify true if you want the connector to do this. See Transaction metadata for details.
retriable.restart.connector.wait.ms10000 (10 seconds)The number of milliseconds to wait before restarting a connector after a retriable error occurs.
skipped.operationsNo defaultA comma-separated list of operation types that will be skipped during streaming. The operations include: c for inserts/create, u for updates, and d for deletes. By default, no operations are skipped.
signal.data.collectionNo defaultFully-qualified name of the data collection that is used to send signals to the connector. The name format is schema-name.table-name.
注册示例
{
  "name": "fulfillment-connector",  
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector", 
    "database.hostname": "192.168.99.100", 
    "database.port": "5432", 
    "database.user": "postgres", 
    "database.password": "postgres", 
    "database.dbname" : "postgres", 
    "database.server.name": "fulfillment", 
    "table.include.list": "public.inventory" 

  }
}
事件Schema
INSERT
{
  "schema": { ... },
  "payload": {
    "before": null,
    "after": {
      "id": 1,
      "first_name": "Anne",
      "last_name": "Kretchmar",
      "email": "annek@noanswer.org"
    },
    "source": {
      "version": "1.7.1.Final",
      "connector": "postgresql",
      "name": "PostgreSQL_server",
      "ts_ms": 1559033904863,
      "snapshot": true,
      "db": "postgres",
      "sequence": "[\"24023119\",\"24023128\"]",
      "schema": "public",
      "table": "customers",
      "txId": 555,
      "lsn": 24023128,
      "xmin": null
    },
    "op": "c",
    "ts_ms": 1559033904863
  }
}
UPDATE
{
  "schema": { ... },
  "payload": {
      "before": { 
          "id": 1
      },
      "after": { 
          "id": 1,
          "first_name": "Anne Marie",
          "last_name": "Kretchmar",
          "email": "annek@noanswer.org"
      },
      "source": { 
          "version": "1.7.1.Final",
          "connector": "postgresql",
          "name": "PostgreSQL_server",
          "ts_ms": 1559033904863,
          "snapshot": false,
          "db": "postgres",
          "schema": "public",
          "table": "customers",
          "txId": 556,
          "lsn": 24023128,
          "xmin": null
      },
      "op": "u", 
      "ts_ms": 1465584025523  
  }
}
DELETE
{
    "schema": { ... },
    "payload": {
        "before": { 
            "id": 1
        },
        "after": null, 
        "source": { 
            "version": "1.7.1.Final",
            "connector": "postgresql",
            "name": "PostgreSQL_server",
            "ts_ms": 1559033904863,
            "snapshot": false,
            "db": "postgres",
            "schema": "public",
            "table": "customers",
            "txId": 556,
            "lsn": 46523128,
            "xmin": null
        },
        "op": "d", 
        "ts_ms": 1465581902461 
    }
}

Debezium Connector for Oracle

简介

Debezium 的 Oracle 连接器捕获并记录 Oracle 服务器上数据库中发生的行级更改,包括在连接器运行时添加的表。您可以将连接器配置为针对架构和表的特定子集发出更改事件,或者忽略、屏蔽或截断特定列中的值。

Debezium 通过使用本机 LogMiner 数据库包或XStream API从 Oracle 摄取更改事件。虽然连接器可能适用于各种 Oracle 版本和版本,但只有 Oracle EE 12 和 19 已经过测试。

Oracle 连接器的工作原理

要优化配置和运行 Debezium Oracle 连接器,了解连接器如何执行快照、流更改事件、确定 Kafka 主题名称以及使用元数据会很有帮助。

快照

通常,Oracle 服务器上的重做日志配置为不保留数据库的完整历史记录。因此,Debezium Oracle 连接器无法从日志中检索数据库的整个历史记录。为了使连接器能够为数据库的当前状态建立基线,连接器第一次启动时,它会执行数据库的初始一致快照

您可以通过设置snapshot.mode连接器配置属性的值来自定义连接器创建快照的方式。默认情况下,连接器的快照模式设置为initial

用于创建初始快照的默认连接器工作流程

当快照模式设置为默认时,连接器完成以下任务来创建快照:

  1. 确定要捕获的表
  2. 获取ROW SHARE MODE对每个受监控表的锁定,以防止在创建快照期间发生结构更改。Debezium 持有锁的时间很短。
  3. 从服务器的重做日志中读取当前系统更改号 (SCN) 位置。
  4. 捕获所有相关表的结构。
  5. 释放在步骤 2 中获得的锁。
  6. 在步骤 3 ( SELECT * FROM … AS OF SCN 123) 中读取的 SCN 位置扫描所有相关的数据库表和模式为有效,READ为每一行生成一个事件,然后将事件记录写入特定于表的 Kafka 主题。
  7. 在连接器偏移中记录快照的成功完成。

快照进程开始后,如果由于connector失败、rebalancing等原因导致进程中断,则在connector重启后进程会重新启动。连接器完成初始快照后,它会从它在步骤 3 中读取的位置继续流式传输,以便不会错过任何更新。如果连接器因任何原因再次停止,则在重新启动后,它会从之前停止的位置恢复流式更改。

环境描述
initial连接器按照用于创建初始快照默认工作流中的说明执行数据库快照。快照完成后,连接器开始为后续数据库更改流式传输事件记录。
schema_only连接器捕获所有相关表的结构,执行默认快照工作流中描述的所有步骤,但它不会创建READ事件来表示连接器启动时的数据集(步骤 6)。
注册参数表
基础参数
PropertyDefaultDescription
nameNo defaultUnique name for the connector. Attempting to register again with the same name will fail. (This property is required by all Kafka Connect connectors.)
connector.classNo defaultThe name of the Java class for the connector. Always use a value of io.debezium.connector.oracle.OracleConnector for the Oracle connector.
tasks.max1The maximum number of tasks that should be created for this connector. The Oracle connector always uses a single task and therefore does not use this value, so the default is always acceptable.
database.hostnameNo defaultIP address or hostname of the Oracle database server.
database.portNo defaultInteger port number of the Oracle database server.
database.userNo defaultName of the Oracle user account that the connector uses to connect to the Oracle database server.
database.passwordNo defaultPassword to use when connecting to the Oracle database server.
database.dbnameNo defaultName of the database to connect to. Must be the CDB name when working with the CDB + PDB model.
database.urlNo defaultSpecifies the raw database JDBC URL. Use this property to provide flexibility in defining that database connection. Valid values include raw TNS names and RAC connection strings.
database.pdb.nameNo defaultName of the Oracle pluggable database to connect to. Use this property with container database (CDB) installations only.
database.server.nameNo defaultLogical name that identifies and provides a namespace for the Oracle database server from which the connector captures changes. The value that you set is used as a prefix for all Kafka topic names that the connector emits. Specify a logical name that is unique among all connectors in your Debezium environment. The following characters are valid: alphanumeric characters, hyphens, dots, and underscores.
database.connection.adapterlogminerThe adapter implementation that the connector uses when it streams database changes. You can set the following values:logminer(default)The connector uses the native Oracle LogMiner API.xstreamThe connector uses the Oracle XStreams API.
snapshot.modeinitialSpecifies the mode that the connector uses to take snapshots of a captured table. You can set the following values:initialThe snapshot includes the structure and data of captured tables. Specify this value to populate topics with a complete representation of the data from the captured tables.schema_onlyThe snapshot includes only the structure of captured tables. Specify this value if you want the connector to capture data only for changes that occur after the snapshot.After the snapshot is complete, the connector continues to read change events from the database’s redo logs.
snapshot.locking.modesharedControls whether and for how long the connector holds a table lock. Table locks prevent certain types of changes table operations from occurring while the connector performs a snapshot. You can set the following values:sharedEnables concurrent access to the table, but prevents any session from acquiring an exclusive table lock. The connector acquires a ROW SHARE level lock while it captures table schema.nonePrevents the connector from acquiring any table locks during the snapshot. Use this setting only if no schema changes might occur during the creation of the snapshot.
snapshot.include.collection.listAll tables specified in table.include.listAn optional, comma-separated list of regular expressions that match names of fully-qualified table names (*<schemaName>*.*<tableName>*) included in table.include.list for which you want to take the snapshot.
snapshot.select.statement.overridesNo defaultSpecifies the table rows to include in a snapshot. Use the property if you want a snapshot to include only a subset of the rows in a table. This property affects snapshots only. It does not apply to events that the connector reads from the log.The property contains a comma-separated list of fully-qualified table names in the form *<schemaName>.<tableName>*. For example, "snapshot.select.statement.overrides": "inventory.products,customers.orders" For each table in the list, add a further configuration property that specifies the SELECT statement for the connector to run on the table when it takes a snapshot. The specified SELECT statement determines the subset of table rows to include in the snapshot. Use the following format to specify the name of this SELECT statement property: snapshot.select.statement.overrides.*<schemaName>*.*<tableName>* For example, snapshot.select.statement.overrides.customers.orders Example:From a customers.orders table that includes the soft-delete column, delete_flag, add the following properties if you want a snapshot to include only those records that are not soft-deleted:"snapshot.select.statement.overrides": "customer.orders", "snapshot.select.statement.overrides.customer.orders": "SELECT * FROM [customers].[orders] WHERE delete_flag = 0 ORDER BY id DESC"In the resulting snapshot, the connector includes only the records for which delete_flag = 0.
schema.include.listNo defaultAn optional, comma-separated list of regular expressions that match names of schemas for which you want to capture changes. Any schema name not included in schema.include.list is excluded from having its changes captured. By default, all non-system schemas have their changes captured. Do not also set the schema.exclude.list property. In environments that use the LogMiner implementation, you must use POSIX regular expressions only.
schema.exclude.listNo defaultAn optional, comma-separated list of regular expressions that match names of schemas for which you do not want to capture changes. Any schema whose name is not included in schema.exclude.list has its changes captured, with the exception of system schemas. Do not also set the schema.include.list property. In environments that use the LogMiner implementation, you must use POSIX regular expressions only.
table.include.listNo defaultAn optional comma-separated list of regular expressions that match fully-qualified table identifiers for tables to be monitored. Tables that are not included in the include list are excluded from monitoring. Each table identifier uses the following format: *<schema_name>.<table_name>* By default, the connector monitors every non-system table in each monitored database. Do not use this property in combination with table.exclude.list. If you use the LogMiner implementation, use only POSIX regular expressions with this property.
table.exclude.listNo defaultAn optional comma-separated list of regular expressions that match fully-qualified table identifiers for tables to be excluded from monitoring. The connector captures change events from any table that is not specified in the exclude list. Specify the identifier for each table using the following format: *<schemaName>.<tableName>*.Do not use this property in combination with table.include.list. If you use the LogMiner implementation, use only POSIX regular expressions with this property.
column.include.listNo defaultAn optional comma-separated list of regular expressions that match the fully-qualified names of columns that want to include in the change event message values. Fully-qualified names for columns use the following format: *+<Schema_name>.<table_name>.<column_name>*The primary key column is always included in an event’s key, even if you do not use this property to explicitly include its value. If you include this property in the configuration, do not also set thecolumn.exclude.list` property.
column.exclude.listNo defaultAn optional comma-separated list of regular expressions that match the fully-qualified names of columns that you want to exclude from change event message values. Fully-qualified column names use the following format: *<schema_name>.<table_name>.<column_name>* The primary key column is always included in an event’s key, even if you use this property to explicitly exclude its value. If you include this property in the configuration, do not set the column.include.list property.
column.mask.hash.**.with.salt.**No defaultAn optional, comma-separated list of regular expressions that match the fully-qualified names of character-based columns. Fully-qualified names for columns are of the form *<schemaName>*.*<tableName>*.*<columnName>*. In the resulting change event record, the values for the specified columns are replaced with pseudonyms. A pseudonym consists of the hashed value that results from applying the specified hashAlgorithm and salt. Based on the hash function that is used, referential integrity is maintained, while column values are replaced with pseudonyms. Supported hash functions are described in the MessageDigest section of the Java Cryptography Architecture Standard Algorithm Name Documentation. In the following example, CzQMA0cB5K is a randomly selected salt. column.mask.hash.SHA-256.with.salt.CzQMA0cB5K = inventory.orders.customerName, inventory.shipment.customerNameIf necessary, the pseudonym is automatically shortened to the length of the column. The connector configuration can include multiple properties that specify different hash algorithms and salts. Depending on the hashAlgorithm used, the salt selected, and the actual data set, the resulting data set might not be completely masked.
decimal.handling.modepreciseSpecifies how the connector should handle floating point values for NUMBER, DECIMAL and NUMERIC columns. You can set one of the following options:precise (default)Represents values precisely by using java.math.BigDecimal values represented in change events in a binary form.doubleRepresents values by using double values. Using double values is easier, but can result in a loss of precision.stringEncodes values as formatted strings. Using the string option is easier to consume, but results in a loss of semantic information about the real type. For more information, see Decimal types.
event.processing.failure.handling.modefailSpecifies how the connector should react to exceptions during processing of events. You can set one of the following options:failPropagates the exception (indicating the offset of the problematic event), causing the connector to stop.warnCauses the problematic event to be skipped. The offset of the problematic event is then logged.skipCauses the problematic event to be skipped.
max.queue.size8192A positive integer value that specifies the maximum size of the blocking queue. Change events read from the database log are placed in the blocking queue before they are written to Kafka. This queue can provide backpressure to the binlog reader when, for example, writes to Kafka are slow, or if Kafka is not available. Events that appear in the queue are not included in the offsets that the connector records periodically. Always specify a value that is larger than the maximum batch size that specified for the max.batch.size property.
max.batch.size2048A positive integer value that specifies the maximum size of each batch of events to process during each iteration of this connector.
max.queue.size.in.bytes0 (disabled)Long value for the maximum size in bytes of the blocking queue. To activate the feature, set the value to a positive long data type.
poll.interval.ms1000 (1 second)Positive integer value that specifies the number of milliseconds the connector should wait during each iteration for new change events to appear.
tombstones.on.deletetrueControls whether a delete event is followed by a tombstone event. The following values are possible:trueFor each delete operation, the connector emits a delete event and a subsequent tombstone event.falseFor each delete operation, the connector emits only a delete event.After a source record is deleted, a tombstone event (the default behavior) enables Kafka to completely delete all events that share the key of the deleted row in topics that have log compaction enabled.
message.key.columnsNo defaultA list of expressions that specify the columns that the connector uses to form custom message keys for change event records that it publishes to the Kafka topics for specified tables.By default, Debezium uses the primary key column of a table as the message key for records that it emits. In place of the default, or to specify a key for tables that lack a primary key, you can configure custom message keys based on one or more columns. To establish a custom message key for a table, list the table, followed by the columns to use as the message key. Each list entry takes the following format: *<fullyQualifiedTableName>*:*<keyColumn>*,*<keyColumn>* To base a table key on multiple column names, insert commas between the column names. Each fully-qualified table name is a regular expression in the following format: *<schemaName>*.*<tableName>* The property can include entries for multiple tables. Use a semicolon to separate table entries in the list. The following example sets the message key for the tables inventory.customers and purchase.orders: inventory.customers:pk1,pk2;(.*).purchaseorders:pk3,pk4 For the table inventory.customer, the columns pk1 and pk2 are specified as the message key. For the purchaseorders tables in any schema, the columns pk3 and pk4 server as the message key. There is no limit to the number of columns that you use to create custom message keys. However, it’s best to use the minimum number that are required to specify a unique key.
column.truncate.to.*length*.charsNo defaultAn optional comma-separated list of regular expressions that match the fully-qualified names of character-based columns to be truncated in change event messages if their length exceeds the specified number of characters. Length is specified as a positive integer. A configuration can include multiple properties that specify different lengths. Specify the fully-qualified name for columns by using the following format: *<schemaName>*.*<tableName>*.*<columnName>*.
column.mask.with.*length*.charsNo defaultAn optional comma-separated list of regular expressions for masking column names in change event messages by replacing characters with asterisks (*). Specify the number of characters to replace in the name of the property, for example, column.mask.with.8.chars. Specify length as a positive integer or zero. Then add regular expressions to the list for each character-based column name where you want to apply a mask. Use the following format to specify fully-qualified column names: *<schemaName>*.*<tableName>*.*<columnName>*. The connector configuration can include multiple properties that specify different lengths.
column.propagate.source.typeNo defaultAn optional comma-separated list of regular expressions that match the fully-qualified names of columns whose original type and length should be added as a parameter to the corresponding field schemas in the emitted change messages. The schema parameters __debezium.source.column.type, __debezium.source.column.length, and __debezium.source.column.scale are used to propagate the original type name and length (for variable-width types), respectively. Useful to properly size corresponding columns in sink databases. Fully-qualified names for columns are of the form *<tableName>*.*<columnName>*, or *<schemaName>*.*<tableName>*.*<columnName>*.
datatype.propagate.source.typeNo defaultAn optional comma-separated list of regular expressions that match the database-specific data type name of columns whose original type and length should be added as a parameter to the corresponding field schemas in the emitted change messages. The schema parameters __debezium.source.column.type, __debezium.source.column.length and __debezium.source.column.scale are used to propagate the original type name and length (for variable-width types), respectively. Useful to properly size corresponding columns in sink databases. Fully-qualified data type names are of the form *<tableName>*.*<typeName>*, or *<schemaName>*.*<tableName>*.*<typeName>*. See the list of Oracle-specific data type names.
heartbeat.interval.ms0Specifies, in milliseconds, how frequently the connector sends messages to a heartbeat topic. Use this property to determine whether the connector continues to receive change events from the source database. It can also be useful to set the property in situations where the connector no change events occur in captured tables for an extended period. In such a a case, although the connector continues to read the redo log, it emits no change event messages, so that the offset in the Kafka topic remains unchanged. Because the connector does not flush the latest system change number (SCN) that it read from the database, the database might retain the redo log files for longer than necessary. If the connector restarts, the extended retention period could result in the connector redundantly sending some change events. The default value of 0 prevents the connector from sending any heartbeat messages.
heartbeat.topics.prefix__debezium-heartbeatSpecifies the string that prefixes the name of the topic to which the connector sends heartbeat messages. The topic is named according to the pattern *<heartbeat.topics.prefix>.<serverName>*.
snapshot.delay.msNo defaultSpecifies an interval in milliseconds that the connector waits after it starts before it takes a snapshot. Use this property to prevent snapshot interruptions when you start multiple connectors in a cluster, which might cause re-balancing of connectors.
snapshot.fetch.size2000Specifies the maximum number of rows that should be read in one go from each table while taking a snapshot. The connector reads table contents in multiple batches of the specified size.
sanitize.field.namestrue when the connector configuration explicitly specifies the key.converter or value.converter parameters to use Avro, otherwise defaults to false.Specifies whether field names are normalized to comply with Avro naming requirements. For more information, see Avro naming.
provide.transaction.metadatafalseSet the property to true if you want Debezium to generate events with transaction boundaries and enriches data events envelope with transaction metadata.See Transaction Metadata for additional details.
log.mining.strategyredo_log_catalogSpecifies the mining strategy that controls how Oracle LogMiner builds and uses a given data dictionary for resolving table and column ids to names. redo_log_catalog:: Writes the data dictionary to the online redo logs causing more archive logs to be generated over time. This also enables tracking DDL changes against captured tables, so if the schema changes frequently this is the ideal choice. online_catalog:: Uses the database’s current data dictionary to resolve object ids and does not write any extra information to the online redo logs. This allows LogMiner to mine substantially faster but at the expense that DDL changes cannot be tracked. If the captured table(s) schema changes infrequently or never, this is the ideal choice.
log.mining.buffer.typememoryThe buffer type controls how the connector manages buffering transaction data. memory - Uses the JVM process’ heap to buffer all transaction data. Choose this option if you don’t expect the connector to process a high number of long-running or large transactions. When this option is active, the buffer state is not persisted across restarts. Following a restart, recreate the buffer from the SCN value of the current offset. infinispan - This option uses an embedded Infinispan cache to buffer transaction data and persist it to disk. Choose this option if you expect the connector to process long-running or large transactions. When this option is active, the buffer state is persisted across restarts. After a restart, there is no need to recreate the buffer. Since the buffer state is persisted across restarts, the buffer does not need to be recreated at restart. The infinispan option requires that you specify a cache file directory. Use the log.mining.buffer.location property to define the location for storing cache files.
log.mining.buffer.locationNo defaultSpecifies the location of the directory that the connector uses to read and write the buffer cache. Specify a directory that every node in the Kafka cluster can access and that is unique for the connector deployment.
log.mining.buffer.drop.on.stopfalseSpecifies whether the buffer state is deleted after the connector stops in a graceful, expected way. This setting only impacts buffer implementations that persist state across restarts, such as infinispan. The default behavior is that the buffer state is always retained between restarts. Set to true only in testing or development environments.
log.mining.batch.size.min1000The minimum SCN interval size that this connector attempts to read from redo/archive logs. Active batch size is also increased/decreased by this amount for tuning connector throughput when needed.
log.mining.batch.size.max100000The maximum SCN interval size that this connector uses when reading from redo/archive logs.
log.mining.batch.size.default20000The starting SCN interval size that the connector uses for reading data from redo/archive logs.
log.mining.sleep.time.min.ms0The minimum amount of time that the connector sleeps after reading data from redo/archive logs and before starting reading data again. Value is in milliseconds.
log.mining.sleep.time.max.ms3000The maximum amount of time that the connector ill sleeps after reading data from redo/archive logs and before starting reading data again. Value is in milliseconds.
log.mining.sleep.time.default.ms1000The starting amount of time that the connector sleeps after reading data from redo/archive logs and before starting reading data again. Value is in milliseconds.
log.mining.sleep.time.increment.ms200The maximum amount of time up or down that the connector uses to tune the optimal sleep time when reading data from logminer. Value is in milliseconds.
log.mining.view.fetch.size10000The number of content records that the connector fetches from the LogMiner content view.
log.mining.archive.log.hours0The number of hours in the past from SYSDATE to mine archive logs. When the default setting (0) is used, the connector mines all archive logs.
log.mining.archive.log.only.modefalseControls whether or not the connector mines changes from just archive logs or a combination of the online redo logs and archive logs (the default). Redo logs use a circular buffer that can be archived at any point. In environments where online redo logs are archived frequently, this can lead to LogMiner session failures. In contrast to redo logs, archive logs are guaranteed to be reliable. Set this option to true to force the connector to mine archive logs only. After you set the connector to mine only the archive logs, the latency between an operation being committed and the connector emitting an associated change event might increase. The degree of latency depends on how frequently the database is configured to archive online redo logs.
log.mining.archive.log.only.scn.poll.interval.ms10000The number of milliseconds the connector will sleep in between polling to determine if the starting system change number is in the archive logs. If log.mining.archive.log.only.mode is not enabled, this setting is not used.
log.mining.transaction.retention.hours0Positive integer value that specifies the number of hours to retain long running transactions between redo log switches. When set to 0, transactions are retained until a commit or rollback is detected.The LogMiner adapter maintains an in-memory buffer of all running transactions. Because all of the DML operations that are part of a transaction are buffered until a commit or rollback is detected, long-running transactions should be avoided in order to not overflow that buffer. Any transaction that exceeds this configured value is discarded entirely, and the connector does not emit any messages for the operations that were part of the transaction. While this option allows the behavior to be configured on a case-by-case basis, we have plans to enhance this behavior in a future release by means of adding a scalable transaction buffer, (see DBZ-3123).
log.mining.archive.destination.nameNo defaultSpecifies the configured Oracle archive destination to use when mining archive logs with LogMiner. The default behavior automatically selects the first valid, local configured destination. However, you can use a specific destination can be used by providing the destination name, for example, LOG_ARCHIVE_DEST_5.
log.mining.username.exclude.listNo defaultList of database users to exclude from the LogMiner query. It can be useful to set this property if you want the capturing process to always exclude the changes that specific users make.
log.mining.scn.gap.detection.gap.size.min1000000Specifies a value that the connector compares to the difference between the current and previous SCN values to determine whether an SCN gap exists. If the difference between the SCN values is greater than the specified value, and the time difference is smaller than log.mining.scn.gap.detection.time.interval.max.ms then an SCN gap is detected, and the connector uses a mining window larger than the configured maximum batch.
log.mining.scn.gap.detection.time.interval.max.ms20000Specifies a value, in milliseconds, that the connector compares to the difference between the current and previous SCN timestamps to determine whether an SCN gap exists. If the difference between the timestamps is less than the specified value, and the SCN delta is greater than log.mining.scn.gap.detection.gap.size.min, then an SCN gap is detected and the connector uses a mining window larger than the configured maximum batch.
lob.enabledfalseControls whether or not large object (CLOB or BLOB) column values are emitted in change events. By default, change events have large object columns, but the columns contain no values. There is a certain amount of overhead in processing and managing large object column types and payloads. To capture large object values and serialized them in change events, set this option to true.
unavailable.value.placeholder__debezium_unavailable_valueSpecifies the constant that the connector provides to indicate that the original value is unchanged and not provided by the database.
rac.nodesNo defaultA comma-separated list of Oracle Real Application Clusters (RAC) node host names or addresses. This field is required to enable Oracle RAC support. Specify the list of RAC nodes by using one of the following methods:Specify a value for database.port, and use the specified port value for each address in the rac.nodes list. For example:database.port=1521 rac.nodes=192.168.1.100,192.168.1.101Specify a value for database.port, and override the default port for one or more entries in the list. The list can include entries that use the default database.port value, and entries that define their own unique port values. For example:database.port=1521 rac.nodes=192.168.1.100,192.168.1.101:1522If you supply a raw JDBC URL for the database by using the database.url property, instead of defining a value for database.port, each RAC node entry must explicitly specify a port value.
skipped.operationsNo defaultA comma-separated list of the operation types that you want the connector to skip during streaming. You can configure the connector to skip the following types of operations:c (insert/create)u (update)d (delete)By default, no operations are skipped.
高级参数
PropertyDefaultDescription
database.history.kafka.topicThe full name of the Kafka topic where the connector stores the database schema history.
database.history.kafka.bootstrap.serversA list of host/port pairs that the connector uses for establishing an initial connection to the Kafka cluster. This connection is used for retrieving the database schema history previously stored by the connector, and for writing each DDL statement read from the source database. Each pair should point to the same Kafka cluster used by the Kafka Connect process.
database.history.kafka.recovery.poll.interval.ms100An integer value that specifies the maximum number of milliseconds the connector should wait during startup/recovery while polling for persisted data. The default is 100ms.
database.history.kafka.recovery.attempts4The maximum number of times that the connector should try to read persisted history data before the connector recovery fails with an error. The maximum amount of time to wait after receiving no data is recovery.attempts x recovery.poll.interval.ms.
database.history.skip.unparseable.ddlfalseA Boolean value that specifies whether the connector should ignore malformed or unknown database statements or stop processing so a human can fix the issue. The safe default is false. Skipping should be used only with care as it can lead to data loss or mangling when the binlog is being processed.
database.history.store.only.monitored.tables.ddl Deprecated and scheduled for removal in a future release; use database.history.store.only.captured.tables.ddl instead.falseA Boolean value that specifies whether the connector should record all DDL statements true records only those DDL statements that are relevant to tables whose changes are being captured by Debezium. Set to true with care because missing data might become necessary if you change which tables have their changes captured. The safe default is false.
database.history.store.only.captured.tables.ddlfalseA Boolean value that specifies whether the connector should record all DDL statements true records only those DDL statements that are relevant to tables whose changes are being captured by Debezium. Set to true with care because missing data might become necessary if you change which tables have their changes captured. The safe default is false.
注册示例
{
    "name": "inventory-connector",
    "config": {
        "connector.class" : "io.debezium.connector.oracle.OracleConnector",
        "tasks.max" : "1",
        "database.server.name" : "server1",
        "database.user" : "c##dbzuser",
        "database.password" : "dbz",
        "database.url": "jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST=(LOAD_BALANCE=OFF)(FAILOVER=ON)(ADDRESS=(PROTOCOL=TCP)(HOST=<oracle ip 1>)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=<oracle ip 2>)(PORT=1521)))(CONNECT_DATA=SERVICE_NAME=)(SERVER=DEDICATED)))",
        "database.dbname" : "ORCLCDB",
        "database.pdb.name" : "ORCLPDB1",
        "database.history.kafka.bootstrap.servers" : "kafka:9092",
        "database.history.kafka.topic": "schema-changes.inventory"
    }
}
事件schema
INSERT
{
  "schema": { ...
  },
  "payload": {
    "before": null,
    "after": {
      "ID": 1004,
      "FIRST_NAME": "Anne",
      "LAST_NAME": "Kretchmar",
      "EMAIL": "annek@noanswer.org"
    },
    "source": {
      "version": "1.7.1.Final",
      "name": "server1",
      "ts_ms": 1520085154000,
      "txId": "6.28.807",
      "scn": "2122185",
      "commit_scn": "2122185",
      "snapshot": false
    },
    "op": "c",
    "ts_ms": 1532592105975
  }
}
UPDATE
{
    "schema": { ... },
    "payload": {
        "before": {
            "ID": 1004,
            "FIRST_NAME": "Anne",
            "LAST_NAME": "Kretchmar",
            "EMAIL": "annek@noanswer.org"
        },
        "after": {
            "ID": 1004,
            "FIRST_NAME": "Anne",
            "LAST_NAME": "Kretchmar",
            "EMAIL": "anne@example.com"
        },
        "source": {
            "version": "1.7.1.Final",
            "name": "server1",
            "ts_ms": 1520085811000,
            "txId": "6.9.809",
            "scn": "2125544",
            "commit_scn": "2125544",
            "snapshot": false
        },
        "op": "u",
        "ts_ms": 1532592713485
    }
}
DELETE
{
    "schema": { ... },
    "payload": {
        "before": {
            "ID": 1004,
            "FIRST_NAME": "Anne",
            "LAST_NAME": "Kretchmar",
            "EMAIL": "anne@example.com"
        },
        "after": null,
        "source": {
            "version": "1.7.1.Final",
            "name": "server1",
            "ts_ms": 1520085153000,
            "txId": "6.28.807",
            "scn": "2122184",
            "commit_scn": "2122184",
            "snapshot": false
        },
        "op": "d",
        "ts_ms": 1532592105960
    }
}

Debezium connector for SQL Server

简介

Debezium SQL Server 连接器捕获 SQL Server 数据库架构中发生的行级更改。

Debezium SQL Server 连接器第一次连接到 SQL Server 数据库或群集时,它会拍摄数据库中架构的一致快照。初始快照完成后,连接器连续捕获行级的变化INSERTUPDATE或者DELETE是承诺到为CDC启用了SQL Server数据库操作。连接器为每个数据更改操作生成事件,并将它们流式传输到 Kafka 主题。连接器将表的所有事件流式传输到专用的 Kafka 主题。然后,应用程序和服务可以使用来自该主题的数据更改事件记录。

Debezium SQL Server 连接器基于SQL Server 2016 Service Pack 1 (SP1) 和更高版本的标准版或企业版中 提供的更改数据捕获功能。SQL Server 捕获进程监视指定的数据库和表,并将更改存储到专门创建的具有存储过程外观的更改表中。

要启用 Debezium SQL Server 连接器来捕获数据库操作的更改事件记录,您必须首先在 SQL Server 数据库上启用更改数据捕获。必须在数据库和要捕获的每个表上启用 CDC。当您设置CDC在源数据库上,该连接器可以捕捉行级INSERTUPDATE以及DELETE发生在数据库操作。连接器将每个源表的事件记录写入一个专用于该表的 Kafka 主题。每个捕获的表都存在一个主题。客户端应用程序读取它们遵循的数据库表的 Kafka 主题,并可以响应它们从这些主题中消费的行级事件。

连接器第一次连接到 SQL Server 数据库或集群时,它会为所有配置为捕获更改的表获取架构的一致快照,并将此状态流式传输到 Kafka。快照完成后,连接器会持续捕获后续发生的行级更改。通过首先建立所有数据的一致视图,连接器可以继续读取,而不会丢失在快照发生时所做的任何更改。

Debezium SQL Server 连接器可以容忍失败。当连接器读取更改并生成事件时,它会定期在数据库日志中记录事件的位置(LSN / 日志序列号)。如果连接器因任何原因(包括通信故障、网络问题或崩溃)而停止,则在重新启动后,连接器将继续从它读取的最后一点开始读取 SQL Server CDC表。

SQL Server 连接器的工作原理

要优化配置和运行 Debezium SQL Server 连接器,了解连接器如何执行快照、流更改事件、确定 Kafka 主题名称以及使用元数据会很有帮助。

快照

QL Server CDC 并非旨在存储数据库更改的完整历史记录。为了让 Debezium SQL Server 连接器为数据库的当前状态建立基线,它使用了一个称为快照的过程。

您可以配置连接器创建快照的方式。默认情况下,连接器的快照模式设置为initial。基于这种initial快照模式,连接器第一次启动时,它会执行数据库的初始一致性快照。此初始快照捕获与为连接器配置的includeexclude属性(例如,table.include.listcolumn.include.listtable.exclude.list等)定义的条件匹配的任何表的结构和数据。

当连接器创建快照时,它完成以下任务:

  1. 确定要捕获的表。
  2. 获取对启用了 CDC 的 SQL Server 表的锁定,以防止在创建快照期间发生结构更改。锁的级别由snapshot.isolation.mode配置选项决定。
  3. 读取服务器事务日志中的最大日志序列号 (LSN) 位置。
  4. 捕获所有相关表的结构。
  5. 如有必要,释放在步骤 2 中获得的锁。在大多数情况下,锁只持有很短的时间。
  6. 根据步骤3中读取的LSN位置扫描要捕获的SQL Server源表和模式,为表READ中的每一行生成一个事件,并将事件写入该表的Kafka主题。
  7. 在连接器偏移中记录快照的成功完成。

生成的初始快照捕获为 CDC 启用的表中每一行的当前状态。从此基线状态开始,连接器会在发生后续更改时捕获这些更改。

注册参数表
基础参数
PropertyDefaultDescription
nameNo defaultUnique name for the connector. Attempting to register again with the same name will fail. (This property is required by all Kafka Connect connectors.)
connector.classNo defaultThe name of the Java class for the connector. Always use a value of io.debezium.connector.sqlserver.SqlServerConnector for the SQL Server connector.
tasks.max1The maximum number of tasks that should be created for this connector. The SQL Server connector always uses a single task and therefore does not use this value, so the default is always acceptable.
database.hostnameNo defaultIP address or hostname of the SQL Server database server.
database.port1433Integer port number of the SQL Server database server.
database.userNo defaultUsername to use when connecting to the SQL Server database server.
database.passwordNo defaultPassword to use when connecting to the SQL Server database server.
database.dbnameNo defaultThe name of the SQL Server database from which to stream the changes Must not be used with database.names.
database.namesNo defaultThe comma-separated list of the SQL Server database names from which to stream the changes. Currently, only one database name is supported. Must not be used with database.dbname.This option is experimental and must not be used in production. Using it will make the behavior of the connector incompatible with the default configuration with no upgrade or downgrade path:The connector will use different keys for its committed offset messages.The SQL statements used in snapshot.select.statement.overrides will have to use the database name as part of the fully-qualified table name.
database.server.nameNo defaultLogical name that identifies and provides a namespace for the SQL Server database server that you want Debezium to capture. The logical name should be unique across all other connectors, since it is used as a prefix for all Kafka topic names emanating from this connector. Only alphanumeric characters, hyphens, dots and underscores must be used.
table.include.listNo defaultAn optional comma-separated list of regular expressions that match fully-qualified table identifiers for tables that you want Debezium to capture; any table that is not included in table.include.list is excluded from capture. Each identifier is of the form schemaName.tableName. By default, the connector captures all non-system tables for the designated schemas. Must not be used with table.exclude.list.
table.exclude.listNo defaultAn optional comma-separated list of regular expressions that match fully-qualified table identifiers for the tables that you want to exclude from being captured; Debezium captures all tables that are not included in table.exclude.list. Each identifier is of the form schemaName.tableName. Must not be used with table.include.list.
column.include.listempty stringAn optional comma-separated list of regular expressions that match the fully-qualified names of columns that should be included in the change event message values. Fully-qualified names for columns are of the form schemaName.tableName.columnName. Note that primary key columns are always included in the event’s key, even if not included in the value. Do not also set the column.exclude.list property.
column.exclude.listempty stringAn optional comma-separated list of regular expressions that match the fully-qualified names of columns that should be excluded from change event message values. Fully-qualified names for columns are of the form schemaName.tableName.columnName. Note that primary key columns are always included in the event’s key, also if excluded from the value. Do not also set the column.include.list property.
column.mask.hash.*hashAlgorithm*.with.salt.*salt*n/aAn optional, comma-separated list of regular expressions that match the fully-qualified names of character-based columns. Fully-qualified names for columns are of the form *<schemaName>*.*<tableName>*._<columnName>. In the resulting change event record, the values for the specified columns are replaced with pseudonyms. A pseudonym consists of the hashed value that results from applying the specified hashAlgorithm and salt. Based on the hash function that is used, referential integrity is maintained, while column values are replaced with pseudonyms. Supported hash functions are described in the MessageDigest section of the Java Cryptography Architecture Standard Algorithm Name Documentation. In the following example, CzQMA0cB5K is a randomly selected salt. column.mask.hash.SHA-256.with.salt.CzQMA0cB5K = inventory.orders.customerName, inventory.shipment.customerNameIf necessary, the pseudonym is automatically shortened to the length of the column. The connector configuration can include multiple properties that specify different hash algorithms and salts. Depending on the hashAlgorithm used, the salt selected, and the actual data set, the resulting data set might not be completely masked.
time.precision.modeadaptiveTime, date, and timestamps can be represented with different kinds of precision, including: adaptive (the default) captures the time and timestamp values exactly as in the database using either millisecond, microsecond, or nanosecond precision values based on the database column’s type; or connect always represents time and timestamp values using Kafka Connect’s built-in representations for Time, Date, and Timestamp, which uses millisecond precision regardless of the database columns’ precision. See temporal values.
decimal.handling.modepreciseSpecifies how the connector should handle values for DECIMAL and NUMERIC columns: precise (the default) represents them precisely using java.math.BigDecimal values represented in change events in a binary form. double represents them using double values, which may result in a loss of precision but is easier to use. string encodes values as formatted strings, which is easy to consume but semantic information about the real type is lost.
include.schema.changestrueBoolean value that specifies whether the connector should publish changes in the database schema to a Kafka topic with the same name as the database server ID. Each schema change is recorded with a key that contains the database name and a value that is a JSON structure that describes the schema update. This is independent of how the connector internally records database history. The default is true.
tombstones.on.deletetrueControls whether a delete event is followed by a tombstone event. true - a delete operation is represented by a delete event and a subsequent tombstone event. false - only a delete event is emitted. After a source record is deleted, emitting a tombstone event (the default behavior) allows Kafka to completely delete all events that pertain to the key of the deleted row in case log compaction is enabled for the topic.
column.truncate.to._length_.charsn/aAn optional comma-separated list of regular expressions that match the fully-qualified names of character-based columns whose values should be truncated in the change event message values if the field values are longer than the specified number of characters. Multiple properties with different lengths can be used in a single configuration, although in each the length must be a positive integer. Fully-qualified names for columns are of the form schemaName.tableName.columnName.
column.mask.with._length_.charsn/aAn optional comma-separated list of regular expressions that match the fully-qualified names of character-based columns whose values should be replaced in the change event message values with a field value consisting of the specified number of asterisk (*) characters. Multiple properties with different lengths can be used in a single configuration, although in each the length must be a positive integer or zero. Fully-qualified names for columns are of the form schemaName.tableName.columnName.
column.propagate.source.typen/aAn optional comma-separated list of regular expressions that match the fully-qualified names of columns whose original type and length should be added as a parameter to the corresponding field schemas in the emitted change messages. The schema parameters __debezium.source.column.type, __debezium.source.column.length and __debezium.source.column.scale is used to propagate the original type name and length (for variable-width types), respectively. Useful to properly size corresponding columns in sink databases. Fully-qualified names for columns are of the form schemaName.tableName.columnName.
datatype.propagate.source.type+n/aAn optional comma-separated list of regular expressions that match the database-specific data type name of columns whose original type and length should be added as a parameter to the corresponding field schemas in the emitted change messages. The schema parameters __debezium.source.column.type, __debezium.source.column.length and __debezium.source.column.scale will be used to propagate the original type name and length (for variable-width types), respectively. Useful to properly size corresponding columns in sink databases. Fully-qualified data type names are of the form schemaName.tableName.typeName. See SQL Server data types for the list of SQL Server-specific data type names.
message.key.columnsn/aA list of expressions that specify the columns that the connector uses to form custom message keys for change event records that it publishes to the Kafka topics for specified tables.By default, Debezium uses the primary key column of a table as the message key for records that it emits. In place of the default, or to specify a key for tables that lack a primary key, you can configure custom message keys based on one or more columns. To establish a custom message key for a table, list the table, followed by the columns to use as the message key. Each list entry takes the following format: *<fully-qualified_tableName>*:_<keyColumn>_,*<keyColumn>* To base a table key on multiple column names, insert commas between the column names.Each fully-qualified table name is a regular expression in the following format: *<schemaName>*.*<tableName>* The property can include entries for multiple tables. Use a semicolon to separate table entries in the list. The following example sets the message key for the tables inventory.customers and purchase.orders: inventory.customers:pk1,pk2;(.*).purchaseorders:pk3,pk4 For the table inventory.customer, the columns pk1 and pk2 are specified as the message key. For the purchaseorders tables in any schema, the columns pk3 and pk4 server as the message key.There is no limit to the number of columns that you use to create custom message keys. However, it’s best to use the minimum number that are required to specify a unique key.
binary.handling.modebytesSpecifies how binary (binary, varbinary) columns should be represented in change events, including: bytes represents binary data as byte array (default), base64 represents binary data as base64-encoded String, hex represents binary data as hex-encoded (base16) String
高就参数
PropertyDefaultDescription
snapshot.modeinitialA mode for taking an initial snapshot of the structure and optionally data of captured tables. Once the snapshot is complete, the connector will continue reading change events from the database’s redo logs. The following values are supported:initial: Takes a snapshot of structure and data of captured tables; useful if topics should be populated with a complete representation of the data from the captured tables. initial_only: Takes a snapshot of structure and data like initial but instead does not transition into streaming changes once the snapshot has completed. schema_only: Takes a snapshot of the structure of captured tables only; useful if only changes happening from now onwards should be propagated to topics.
snapshot.include.collection.listAll tables specified in table.include.listAn optional, comma-separated list of regular expressions that match names of fully-qualified table names (<db-name>.<schema-name>.<name>) included in table.include.list for which you want to take the snapshot.
snapshot.isolation.moderepeatable_readMode to control which transaction isolation level is used and how long the connector locks tables that are designated for capture. The following values are supported:read_uncommitted``read_committed``repeatable_read``snapshot``exclusive (exclusive mode uses repeatable read isolation level, however, it takes the exclusive lock on all tables to be read). The snapshot, read_committed and read_uncommitted modes do not prevent other transactions from updating table rows during initial snapshot. The exclusive and repeatable_read modes do prevent concurrent updates. Mode choice also affects data consistency. Only exclusive and snapshot modes guarantee full consistency, that is, initial snapshot and streaming logs constitute a linear history. In case of repeatable_read and read_committed modes, it might happen that, for instance, a record added appears twice - once in initial snapshot and once in streaming phase. Nonetheless, that consistency level should do for data mirroring. For read_uncommitted there are no data consistency guarantees at all (some data might be lost or corrupted).
event.processing.failure.handling.modefailSpecifies how the connector should react to exceptions during processing of events. fail will propagate the exception (indicating the offset of the problematic event), causing the connector to stop. warn will cause the problematic event to be skipped and the offset of the problematic event to be logged. skip will cause the problematic event to be skipped.
poll.interval.ms1000Positive integer value that specifies the number of milliseconds the connector should wait during each iteration for new change events to appear. Defaults to 1000 milliseconds, or 1 second.
max.queue.size8192Positive integer value that specifies the maximum size of the blocking queue into which change events read from the database log are placed before they are written to Kafka. This queue can provide backpressure to the CDC table reader when, for example, writes to Kafka are slower or if Kafka is not available. Events that appear in the queue are not included in the offsets periodically recorded by this connector. Defaults to 8192, and should always be larger than the maximum batch size specified in the max.batch.size property.
max.batch.size2048Positive integer value that specifies the maximum size of each batch of events that should be processed during each iteration of this connector. Defaults to 2048.
heartbeat.interval.ms0Controls how frequently heartbeat messages are sent. This property contains an interval in milliseconds that defines how frequently the connector sends messages to a heartbeat topic. The property can be used to confirm whether the connector is still receiving change events from the database. You also should leverage heartbeat messages in cases where only records in non-captured tables are changed for a longer period of time. In such situation the connector would proceed to read the log from the database but never emit any change messages into Kafka, which in turn means that no offset updates are committed to Kafka. This may result in more change events to be re-sent after a connector restart. Set this parameter to 0 to not send heartbeat messages at all. Disabled by default.
heartbeat.topics.prefix__debezium-heartbeatControls the naming of the topic to which heartbeat messages are sent. The topic is named according to the pattern <heartbeat.topics.prefix>.<server.name>.
snapshot.delay.msNo defaultAn interval in milli-seconds that the connector should wait before taking a snapshot after starting up; Can be used to avoid snapshot interruptions when starting multiple connectors in a cluster, which may cause re-balancing of connectors.
snapshot.fetch.size2000Specifies the maximum number of rows that should be read in one go from each table while taking a snapshot. The connector will read the table contents in multiple batches of this size. Defaults to 2000.
query.fetch.sizeNo defaultSpecifies the number of rows that will be fetched for each database round-trip of a given query. Defaults to the JDBC driver’s default fetch size.
snapshot.lock.timeout.ms10000An integer value that specifies the maximum amount of time (in milliseconds) to wait to obtain table locks when performing a snapshot. If table locks cannot be acquired in this time interval, the snapshot will fail (also see snapshots). When set to 0 the connector will fail immediately when it cannot obtain the lock. Value -1 indicates infinite waiting.
snapshot.select.statement.overridesNo defaultSpecifies the table rows to include in a snapshot. Use the property if you want a snapshot to include only a subset of the rows in a table. This property affects snapshots only. It does not apply to events that the connector reads from the log.The property contains a comma-separated list of fully-qualified table names in the form *<schemaName>.<tableName>*. For example, "snapshot.select.statement.overrides": "inventory.products,customers.orders" For each table in the list, add a further configuration property that specifies the SELECT statement for the connector to run on the table when it takes a snapshot. The specified SELECT statement determines the subset of table rows to include in the snapshot. Use the following format to specify the name of this SELECT statement property: snapshot.select.statement.overrides.*<schemaName>*.*<tableName>*. For example, snapshot.select.statement.overrides.customers.orders. Example:From a customers.orders table that includes the soft-delete column, delete_flag, add the following properties if you want a snapshot to include only those records that are not soft-deleted:"snapshot.select.statement.overrides": "customer.orders", "snapshot.select.statement.overrides.customer.orders": "SELECT * FROM [customers].[orders] WHERE delete_flag = 0 ORDER BY id DESC"In the resulting snapshot, the connector includes only the records for which delete_flag = 0.
source.struct.versionv2Schema version for the source block in CDC events; Debezium 0.10 introduced a few breaking changes to the structure of the source block in order to unify the exposed structure across all the connectors. By setting this option to v1 the structure used in earlier versions can be produced. Note that this setting is not recommended and is planned for removal in a future Debezium version.
sanitize.field.namestrue when connector configuration explicitly specifies the key.converter or value.converter parameters to use Avro, otherwise defaults to false.Whether field names are sanitized to adhere to Avro naming requirements. See Avro naming for more details.
provide.transaction.metadatafalseWhen set to true Debezium generates events with transaction boundaries and enriches data events envelope with transaction metadata.See Transaction Metadata for additional details.
retriable.restart.connector.wait.ms10000 (10 seconds)The number of milli-seconds to wait before restarting a connector after a retriable error occurs.
skipped.operationsNo defaultcomma-separated list of operation types that will be skipped during streaming. The operations include: c for inserts/create, u for updates, and d for deletes. By default, no operations are skipped.
signal.data.collectionNo defaultFully-qualified name of the data collection that is used to send signals to the connector. The name format is database_name.schema-name.table-name.
max.iteration.transactions0Specifies the maximum number of transactions per iteration to be used to reduce the memory footprint when streaming changes from multiple tables in a database. When set to 0 (the default), the connector uses the current maximum LSN as the range to fetch changes from. When set to a value greater than zero, the connector uses the n-th LSN specified by this setting as the range to fetch changes from.
注册示例
{
    "name": "inventory-connector", 
    "config": {
        "connector.class": "io.debezium.connector.sqlserver.SqlServerConnector", 
        "database.hostname": "192.168.99.100", 
        "database.port": "1433", 
        "database.user": "sa", 
        "database.password": "Password!", 
        "database.dbname": "testDB", 
        "database.server.name": "fullfillment", 
        "table.include.list": "dbo.customers", 
        "database.history.kafka.bootstrap.servers": "kafka:9092", 
        "database.history.kafka.topic": "dbhistory.fullfillment" 
    }
}
事件Schema
INSERT
{
  "schema": { ...},
  "payload": {
    "before": null,
    "after": {
      "id": 1005,
      "first_name": "john",
      "last_name": "doe",
      "email": "john.doe@example.org"
    },
    "source": {
      "version": "1.7.1.Final",
      "connector": "sqlserver",
      "name": "server1",
      "ts_ms": 1559729468470,
      "snapshot": false,
      "db": "testDB",
      "schema": "dbo",
      "table": "customers",
      "change_lsn": "00000027:00000758:0003",
      "commit_lsn": "00000027:00000758:0005",
      "event_serial_no": "1"
    },
    "op": "c",
    "ts_ms": 1559729471739
  }
}
UPDATE
{
  "schema": { ... },
  "payload": {
    "before": { 
      "id": 1005,
      "first_name": "john",
      "last_name": "doe",
      "email": "john.doe@example.org"
    },
    "after": { 
      "id": 1005,
      "first_name": "john",
      "last_name": "doe",
      "email": "noreply@example.org"
    },
    "source": { 
      "version": "1.7.1.Final",
      "connector": "sqlserver",
      "name": "server1",
      "ts_ms": 1559729995937,
      "snapshot": false,
      "db": "testDB",
      "schema": "dbo",
      "table": "customers",
      "change_lsn": "00000027:00000ac0:0002",
      "commit_lsn": "00000027:00000ac0:0007",
      "event_serial_no": "2"
    },
    "op": "u", 
    "ts_ms": 1559729998706  
  }
}
DELETE
{
  "schema": { ... },
  },
  "payload": {
    "before": { <>
      "id": 1005,
      "first_name": "john",
      "last_name": "doe",
      "email": "noreply@example.org"
    },
    "after": null, 
    "source": { 
      "version": "1.7.1.Final",
      "connector": "sqlserver",
      "name": "server1",
      "ts_ms": 1559730445243,
      "snapshot": false,
      "db": "testDB",
      "schema": "dbo",
      "table": "customers",
      "change_lsn": "00000027:00000db0:0005",
      "commit_lsn": "00000027:00000db0:0007",
      "event_serial_no": "1"
    },
    "op": "d", 
    "ts_ms": 1559730450205 
  }
}

Debezium connector for Db2

参考官网

Debezium Connector for Cassandra

参考官网

Debezium connector for Vitess

参考官网

Debezium 内置引擎

Debezium 连接器通常通过将它们部署到 Kafka Connect 服务并配置一个或多个连接器来监控上游数据库并为它们在上游数据库中看到的所有更改生成数据更改事件来操作。这些数据更改事件被写入 Kafka,在那里它们可以被许多不同的应用程序独立使用。Kafka Connect 提供出色的容错性和可扩展性,因为它作为分布式服务运行,并确保所有已注册和配置的连接器始终运行。例如,即使集群中的一个 Kafka Connect 端点出现故障,其余的 Kafka Connect 端点也会重新启动之前在现在终止的端点上运行的所有连接器,从而最大限度地减少停机时间并消除管理活动。

并非每个应用程序都需要这种级别的容错和可靠性,并且他们可能不想依赖 Kafka 代理和 Kafka Connect 服务的外部集群。相反,某些应用程序更愿意Debezium 连接器直接嵌入到应用程序空间中。他们仍然想要相同的数据更改事件,但更喜欢让连接器将它们直接发送到应用程序,而不是将它们保存在 Kafka 中。

debezium-api模块定义了一个小型 API,允许应用程序使用 Debezium 引擎轻松配置和运行 Debezium 连接器。

内置引擎文档

依赖关系

要使用 Debezium Engine 模块,请将模块添加debezium-api到应用程序的依赖项中。这个 API 在debezium-embedded模块中有一个开箱即用的实现,也应该添加到依赖项中。对于 Maven,这需要将以下内容添加到应用程序的 POM 中:

<dependency>
    <groupId>io.debezium</groupId>
    <artifactId>debezium-api</artifactId>
    <version>${version.debezium}</version>
</dependency>
<dependency>
    <groupId>io.debezium</groupId>
    <artifactId>debezium-embedded</artifactId>
    <version>${version.debezium}</version>
</dependency>

其中,${version.debezium}要么是Debezium的您正在使用的或更新的Maven的属性值中包含了Debezium版本字符串。

同样,为您的应用程序将使用的每个 Debezium 连接器添加依赖项。例如,可以将以下内容添加到您的应用程序的 Maven POM 文件中,以便您的应用程序可以使用 MySQL 连接器:

<dependency>
    <groupId>io.debezium</groupId>
    <artifactId>debezium-connector-mysql</artifactId>
    <version>${version.debezium}</version>
</dependency>

或者对于 MongoDB 连接器:

<dependency>
    <groupId>io.debezium</groupId>
    <artifactId>debezium-connector-mongodb</artifactId>
    <version>${version.debezium}</version>
</dependency>

本文档的其余部分描述了在您的应用程序中嵌入 MySQL 连接器。除了特定于连接器的配置、主题和事件之外,其他连接器的使用方式类似。

代码使用

您的应用程序需要为您要运行的每个连接器实例设置一个嵌入式引擎。该io.debezium.engine.DebeziumEngine<R>班作为各地的任何Debezium连接器一个易于使用的包装和完全管理连接器的生命周期。您DebeziumEngine使用其构建器 API创建实例,提供以下内容:

  • 您希望接收消息的格式,例如 JSON、Avro 或 Kafka Connect SourceRecord (请参阅输出消息格式
  • 定义引擎和连接器环境的配置属性(可能从属性文件加载)
  • 将为连接器产生的每个数据更改事件调用的方法

这是配置和运行嵌入式MySQL 连接器的代码示例:

// Define the configuration for the Debezium Engine with MySQL connector...
final Properties props = config.asProperties();
props.setProperty("name", "engine");
props.setProperty("offset.storage", "org.apache.kafka.connect.storage.FileOffsetBackingStore");
props.setProperty("offset.storage.file.filename", "/tmp/offsets.dat");
props.setProperty("offset.flush.interval.ms", "60000");
/* begin connector properties */
props.setProperty("database.hostname", "localhost");
props.setProperty("database.port", "3306");
props.setProperty("database.user", "mysqluser");
props.setProperty("database.password", "mysqlpw");
props.setProperty("database.server.id", "85744");
props.setProperty("database.server.name", "my-app-connector");
props.setProperty("database.history",
      "io.debezium.relational.history.FileDatabaseHistory");
props.setProperty("database.history.file.filename",
      "/path/to/storage/dbhistory.dat");

// Create the engine with this configuration ...
try (DebeziumEngine<ChangeEvent<String, String>> engine = DebeziumEngine.create(Json.class)
        .using(props)
        .notifying(record -> {
            System.out.println(record);
        }).build()
    ) {
    // Run the engine asynchronously ...
    ExecutorService executor = Executors.newSingleThreadExecutor();
    executor.execute(engine);

    // Do something else or wait for a signal or an event
}
// Engine is stopped when the main code is finished

让我们更详细地研究这段代码,从我们在这里重复的前几行开始:

// Define the configuration for the Debezium Engine with MySQL connector...
final Properties props = config.asProperties();
props.setProperty("name", "engine");
props.setProperty("connector.class", "io.debezium.connector.mysql.MySqlConnector");
props.setProperty("offset.storage", "org.apache.kafka.connect.storage.FileOffsetBackingStore");
props.setProperty("offset.storage.file.filename", "/tmp/offsets.dat");
props.setProperty("offset.flush.interval.ms", 60000);

这将创建一个新的标准Properties对象来设置引擎所需的几个字段,而不管使用的是哪个连接器。第一个是引擎的名称,它将在由连接器及其内部状态生成的源记录中使用,因此请在您的应用程序中使用有意义的名称。该connector.class字段定义了扩展Kafka Connectorg.apache.kafka.connect.source.SourceConnector抽象类的类的名称;在这个例子中,我们指定 Debezium 的MySqlConnector类。

当 Kafka Connect 连接器运行时,它从源读取信息并定期记录“偏移量”,这些“偏移量”定义了它已处理的信息量。如果连接器重新启动,它将使用最后记录的偏移量来知道它应该恢复读取源信息中的哪个位置。由于连接器不知道或不关心如何存储偏移量,引擎提供一种方法来存储和恢复这些偏移量。我们配置的接下来的几个字段指定我们的引擎应该使用FileOffsetBackingStore该类将偏移量存储在/path/to/storage/offset.dat本地文件系统上的文件(该文件可以任意命名并存储在任何地方)。此外,虽然连接器记录了它产生的每个源记录的偏移量,但引擎会定期将偏移量刷新到后备存储(在我们的例子中,每分钟一次)。这些字段可以根据您的应用程序的需要进行定制。

接下来的几行定义了特定于连接器的字段(记录在每个连接器的文档中),在我们的示例中是MySqlConnector连接器:

    /* begin connector properties */
    props.setProperty("database.hostname", "localhost")
    props.setProperty("database.port", "3306")
    props.setProperty("database.user", "mysqluser")
    props.setProperty("database.password", "mysqlpw")
    props.setProperty("database.server.id", "85744")
    props.setProperty("database.server.name", "my-app-connector")
    props.setProperty("database.history",
          "io.debezium.relational.history.FileDatabaseHistory")
    props.setProperty("database.history.file.filename",
          "/path/to/storage/dbhistory.dat")

在这里,我们设置运行 MySQL 数据库服务器的主机名和端口号,并定义将用于连接到 MySQL 数据库的用户名和密码。请注意,对于 MySQL,用户名和密码应对应于已被授予以下 MySQL 权限的 MySQL 数据库用户:

  • SELECT
  • RELOAD
  • SHOW DATABASES
  • REPLICATION SLAVE
  • REPLICATION CLIENT

读取数据库的一致快照时需要前三个权限。最后两个权限允许数据库读取服务器的binlog,通常用于MySQL复制。

该配置还包括一个数字标识符server.id。由于 MySQL 的 binlog 是 MySQL 复制机制的一部分,为了读取 binlog,MySqlConnector实例必须加入 MySQL 服务器组,这意味着该服务器 ID在组成 MySQL 服务器组的所有进程中必须是唯一的,并且是介于两者之间的任意整数1 和 2 32 -1。在我们的代码中,我们将它设置为一个相当大但有些随机的值,我们将仅用于我们的应用程序。

该配置还为 MySQL 服务器指定了一个逻辑名称。连接器在它生成的每个源记录的主题字段中包含此逻辑名称,使您的应用程序能够识别这些记录的来源。我们的示例使用服务器名称“products”,大概是因为数据库包含产品信息。当然,您可以将其命名为对您的应用程序有意义的任何名称。

MySqlConnector该类运行时,它会读取 MySQL 服务器的二进制日志,其中包括对服务器托管的数据库所做的所有数据更改和架构更改。由于对数据的所有更改都是根据记录更改时所属表的架构来构建的,因此连接器需要跟踪所有架构更改,以便它可以正确解码更改事件。连接器记录模式信息,以便连接器重新启动并从上次记录的偏移量继续读取时,它确切地知道该偏移量处的数据库模式是什么样的。在我们配置的最后两个字段中定义了连接器如何记录数据库模式历史,即我们的连接器应该使用FileDatabaseHistory该类来存储数据库模式历史更改/path/to/storage/dbhistory.dat 本地文件系统上的文件(同样,此文件可以命名为任何名称并存储在任何位置)。

最后使用该build()方法构建不可变配置。(顺便说一句,我们可以使用其中一种方法从属性文件中读取配置,而不是以编程方式构建它Configuration.read(…)。)

现在我们有了一个配置,我们可以创建我们的引擎。这里再次是相关的代码行:

// Create the engine with this configuration ...
try (DebeziumEngine<ChangeEvent<String, String>> engine = DebeziumEngine.create(Json.class)
        .using(props)
        .notifying(record -> {
            System.out.println(record);
        })
        .build()) {
}

所有更改事件都将传递给给定的处理程序方法,该方法必须匹配java.util.function.Consumer<R>功能接口的签名,其中<R>必须匹配调用时指定的格式类型create()。请注意,您的应用程序的处理程序函数不应抛出任何异常;如果是,引擎将记录该方法抛出的任何异常,并将继续对下一个源记录进行操作,但您的应用程序将没有机会处理导致异常的特定源记录,这意味着您的应用程序可能会变得不一致与数据库。

此时,我们有一个已DebeziumEngine配置并准备运行的现有对象,但它不执行任何操作。将DebeziumEngine被设计成由一个异步执行ExecutorExecutorService

// Run the engine asynchronously ...
ExecutorService executor = Executors.newSingleThreadExecutor();
executor.execute(engine);

// Do something else or wait for a signal or an event

您的应用程序可以通过调用其close()方法来安全优雅地停止引擎:

// At some later time ...
engine.close();

或者当引擎支持Closeable接口时,它会在try块离开时自动调用。

引擎的连接器将停止从源系统读取信息,将所有剩余的更改事件转发到您的处理程序函数,并将最新的偏移量刷新到偏移存储。只有在所有这些完成之后,引擎的run()方法才会返回。如果您的应用程序需要在退出之前等待引擎完全停止,您可以使用ExcecutorService shutdownawaitTermination方法执行此操作:

try {
    executor.shutdown();
    while (!executor.awaitTermination(5, TimeUnit.SECONDS)) {
        logger.info("Waiting another 5 seconds for the embedded engine to shut down");
    }
}
catch ( InterruptedException e ) {
    Thread.currentThread().interrupt();
}

或者,您可以CompletionCallback在创建时注册DebeziumEngine为在引擎终止时通知的回调。

回想一下,当 JVM 关闭时,它只等待守护线程。因此,如果您的应用程序退出,请确保等待引擎完成或在守护线程上运行引擎。

您的应用程序应该始终正确地停止引擎以确保正常和完全关闭,并且每个源记录都准确地发送到应用程序一次。例如,不要依赖于关闭ExecutorService,因为这会中断正在运行的线程。虽然DebeziumEngine当它的线程被中断时确实会终止,但引擎可能不会干净地终止,并且当您的应用程序重新启动时,它可能会看到一些在关闭之前处理过的相同源记录。

输出消息格式

DebeziumEngine#create()可以接受多个不同的参数,这些参数会影响消费者接收消息的格式。允许的值为:

  • Connect.class - 输出值是包装 Kafka Connect 的更改事件 SourceRecord
  • Json.class- 输出值是一对编码为JSON字符串的键和值
  • Avro.class - 输出值是一对键值对,编码为 Avro 序列化记录
  • CloudEvents.class- 输出值是一对键值对,编码为Cloud Events消息

在内部,引擎使用适当的 Kafka Connect 转换器实现,转换委托给该转换器实现。可以使用引擎属性对转换器进行参数化以修改其行为。

JSON输出格式的一个例子是

final Properties props = new Properties();
...
props.setProperty("converter.schemas.enable", "false"); // don't include schema in message
...
final DebeziumEngine<ChangeEvent<String, String>> engine = DebeziumEngine.create(Json.class)
    .using(props)
    .notifying((records, committer) -> {

        for (ChangeEvent<String, String> r : records) {
            System.out.println("Key = '" + r.key() + "' value = '" + r.value() + "'");
            committer.markProcessed(r);
        }
...

其中ChangeEvent数据类型是键/值对。

消息转换

在将消息传递给处理程序之前,可以通过 Kafka Connect简单消息转换(SMT)管道运行它们 。每个 SMT 都可以不变地传递消息、修改它或过滤掉它。该链是使用 property 配置的transforms。该属性包含要应用的转换的逻辑名称的逗号分隔列表。transforms.<logical_name>.type然后,属性transforms.<logical_name>.*为传递给转换的每个转换和配置选项定义实现类的名称。

配置的一个例子是

final Properties props = new Properties();
...
props.setProperty("transforms", "filter, router");                                               // (1)
props.setProperty("transforms.router.type", "org.apache.kafka.connect.transforms.RegexRouter");  // (2)
props.setProperty("transforms.router.regex", "(.*)");                                            // (3)
props.setProperty("transforms.router.replacement", "trf$1");                                     // (3)
props.setProperty("transforms.filter.type", "io.debezium.embedded.ExampleFilterTransform");      // (4)
  1. 定义了两个转换 -filterrouter
  2. router转换的实现是org.apache.kafka.connect.transforms.RegexRouter
  3. router改造有两种配置选择-regexreplacement
  4. filter转换的实现是io.debezium.embedded.ExampleFilterTransform

高级记录消费

对于某些用例,例如尝试批量写入记录或针对异步 API 写入记录时,上述功能接口可能具有挑战性。在这些情况下,使用io.debezium.engine.DebeziumEngine.ChangeConsumer<R>.界面可能会更容易。

此接口具有单一功能,具有以下签名:

/**
  * Handles a batch of records, calling the {@link RecordCommitter#markProcessed(Object)}
  * for each record and {@link RecordCommitter#markBatchFinished()} when this batch is finished.
  * @param records the records to be processed
  * @param committer the committer that indicates to the system that we are finished
  */
 void handleBatch(List<R> records, RecordCommitter<R> committer) throws InterruptedException;

正如 Javadoc 中所提到的,RecordCommitter每个记录都会调用该对象,并且一旦每个批次完成。该RecordCommitter接口是线程安全的,允许灵活处理记录。

您可以选择覆盖已处理记录的偏移量。这是通过首先Offsets调用构建一个新 对象来完成的RecordCommitter#buildOffsets(),使用 更新偏移量Offsets#set(String key, Object value),然后RecordCommitter#markProcessed(SourceRecord record, Offsets sourceOffsets)使用更新的Offsets.

要使用ChangeConsumerAPI,您必须将接口的实现传递给notifyingAPI,如下所示:

class MyChangeConsumer implements DebeziumEngine.ChangeConsumer<RecordChangeEvent<SourceRecord>> {
  public void handleBatch(List<RecordChangeEvent<SourceRecord>> records, RecordCommitter<RecordChangeEvent<SourceRecord>> committer) throws InterruptedException {
    ...
  }
}
// Create the engine with this configuration ...
DebeziumEngine<RecordChangeEvent<SourceRecord>> engine = DebeziumEngine.create(ChangeEventFormat.of(Connect.class))
        .using(props)
        .notifying(new MyChangeConsumer())
        .build();

如果使用 JSON 格式(等效格式也适用于其他格式),则代码如下所示:

class JsonChangeConsumer implements DebeziumEngine.ChangeConsumer<ChangeEvent<String, String>> {
  public void handleBatch(List<ChangeEvent<String, String>> records,
    RecordCommitter<ChangeEvent<String, String>> committer) throws InterruptedException {
    ...
  }
}
// Create the engine with this configuration ...
DebeziumEngine<ChangeEvent<String, String>> engine = DebeziumEngine.create(Json.class)
        .using(props)
        .notifying(new MyChangeConsumer())
        .build();

引擎属性

除非默认值可用,否则需要以下配置属性(为了文本格式,Java 类的包名称替换为<…>)。

财产默认描述
name连接器实例的唯一名称。
connector.class连接器的 Java 类的名称,例如 <…>.MySqlConnectorMySQL 连接器。
offset.storage<…>.FileOffsetBackingStore负责连接器偏移持久性的 Java 类的名称。它必须实现<…>.OffsetBackingStore接口。
offset.storage.file.filename""要存储偏移量的文件路径。在需要时offset.storage被设置为<…>.FileOffsetBackingStore
offset.storage.topic""要存储偏移量的 Kafka 主题的名称。在需要时offset.storage被设置为<…>.KafkaOffsetBackingStore
offset.storage.partitions""创建偏移存储主题时使用的分区数。在需要时offset.storage被设置为<…>.KafkaOffsetBackingStore
offset.storage.replication.factor""创建偏移存储主题时使用的复制因子。在需要时offset.storage被设置为<…>.KafkaOffsetBackingStore
offset.commit.policy<…>.PeriodicCommitOffsetPolicy提交策略的 Java 类的名称。它定义了何时必须根据处理的事件数量和自上次提交以来经过的时间触发偏移提交。这个类必须实现接口<…>.OffsetCommitPolicy。默认值是基于时间间隔的定期提交策略。
offset.flush.interval.ms60000尝试提交偏移量的时间间隔。默认值为 1 分钟。
offset.flush.timeout.ms5000在取消进程和恢复要在未来尝试提交的偏移数据之前,等待记录刷新和分区偏移数据提交到偏移存储的最大毫秒数。默认值为 5 秒。
internal.key.converter<…>.JsonConverter用于序列化和反序列化偏移的关键数据的 Converter 类。默认为 JSON 转换器。
internal.value.converter<…>.JsonConverter用于序列化和反序列化偏移值数据的 Converter 类。默认为 JSON 转换器。

数据库历史属性

一些连接器还需要配置数据库历史的附加属性集:

  • MySQL
  • SQL Server
  • Oracle
  • Db2

如果没有正确配置数据库历史,连接器将拒绝启动。默认配置期望 Kafka 集群可用。对于其他部署,可以使用基于文件的数据库历史存储实现。

财产默认描述
database.history<…>.KafkaDatabaseHistory负责持久化数据库历史的 Java 类的名称。 它必须实现<…>.DatabaseHistory接口。
database.history.file.filename""存储数据库历史记录的文件的路径。 在需要时database.history被设置为<…>.FileDatabaseHistory
database.history.kafka.topic""存储数据库历史的 Kafka 主题。 在需要时database.history被设置为<…>.KafkaDatabaseHistory
database.history.kafka.bootstrap.servers""要连接的 Kafka 集群服务器的初始列表。集群提供了存储数据库历史的主题。 在需要时database.history被设置为<…>.KafkaDatabaseHistory

处理故障

当引擎执行时,它的连接器会主动记录每个源记录中的源偏移量,并且引擎会定期将这些偏移量刷新到持久存储中。当应用程序和引擎正常关闭或崩溃时,当它们重新启动时,引擎及其连接器将从上次记录的偏移量开始重新读取源信息。

那么,当您的应用程序在嵌入式引擎运行时出现故障时会发生什么?最终效果是应用程序可能会在重新启动后收到一些在崩溃前已经处理过的源记录。多少取决于引擎将偏移量刷新到其存储(通过offset.flush.interval.ms属性)的频率以及特定连接器在一批中返回的源记录数。最好的情况是每次都刷新偏移量(例如,offset.flush.interval.ms设置为 0),但即使如此,嵌入式引擎仍然只会在从连接器接收到每批源记录后才刷新偏移量。

例如,MySQL 连接器使用max.batch.size来指定批处理中可以出现的最大源记录数。即使offset.flush.interval.ms设置为 0,当应用程序在崩溃后重新启动时,它可能会看到最多n 个重复项,其中n是批次的大小。如果该offset.flush.interval.ms属性设置得更高,则应用程序可能会看到最多n * m重复项,其中n是批次的最大大小,而m是在单个偏移刷新间隔期间可能累积的批次数。(显然,可以将嵌入式连接器配置为不使用批处理并始终刷新偏移量,从而导致应用程序永远不会收到任何重复的源记录。然而,这会显着增加开销并降低连接器的吞吐量。)

底线是当使用嵌入式连接器时,应用程序将在正常操作期间(包括正常关闭后重新启动)接收每个源记录一次,但确实需要容忍在崩溃或不正确关闭后立即重新启动后接收重复事件. 如果应用程序需要更严格的一次性行为,那么它们应该使用可以提供一次性保证的完整 Debezium 平台(即使在崩溃和重启之后)。

  • 0
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值