NiFi数据流实践：Mysql CDC To Hive

邢为栋

已于 2023-07-28 15:41:46 修改

阅读量1.6k

点赞数

分类专栏： Bigdata 文章标签： mysql hdfs hive nifi

于 2023-07-13 17:02:34 首次发布

本文链接：https://blog.csdn.net/xwd127429/article/details/131706702

版权

Bigdata 专栏收录该内容

79 篇文章 4 订阅

订阅专栏

NiFi数据流实践：实时获取Mysql CDC数据，实时写入Hive。
NiFi版本：1.22.0
Flowfile、Processor、Controller Service、Record等概念说明，详见NiFi官方文档：
Apache NiFi Documentation。
NiFi官方文档提供了详细的概念说明和使用说明，耐心通读一遍，就可以快速上手开发NiFi数据流。

概述

NiFi的PutHDFS Processor支持实时写数据到HDFS，但是没有Processor支持实时写数据到Hive。
本文提供了一种思路，可以实现NiFi实时写数据到Hive，并已应用到生产中。
主要思路是通过NiFi对Hive表的数据和元数据进行独立更新：（需要提前创建Hive表）

使用PutHDFS更新Hive表数据
使用ExecuteStreamCommand执行外部自定义脚本更新Hive表元数据

这种做法和正常写数据到Hive表没有区别，满足Hive表操作的所有特性。
完整的数据流图如下：

上图数据流各个Processor之间的Queue标明了连接的Relationship，这里不做详细说明。
数据流流程解读：

使用CaptureChangeMySQL，实时获取Mysql Binlog数据，数据为JSON格式，配置为每条记录输出一个flowfile。
使用JSLTTransformJSON，将Binlog JSON数据转换为自定义的JSON结构，方便下游处理。
使用EvaluateJsonPath，从flowfile内容中生成flowfile属性，这里提取了Binlog记录的时间戳，属性名为binlog.record.timestamp，方便下游获取flowfile对应的Hive表的日期分区。
使用UpdateAttribute，添加hive.database、hive.table和hive.partition.dt属性，删除了上一步提取的时间戳属性（目的是减少flowfile资源占用，非必要）
使用MergeContent，将小flowfile合并为大flowfile，避免HDFS中出现大量小文件。
使用ConvertRecord，将flowfile内容数据格式从JSON转换为AVRO，保持和Hive表存储格式一致（也可以使用Parquet等格式）。
使用PutHDFS，将flowfile写入到Hive表数据存储目录。
使用DetectDuplicate，检测hive.partition.dt属性，用于检测新的日期分区值。
使用ExecuteStreamCommand，处理新的分区值，通过外部自定义脚本，执行Hive添加表分区的指令，实时添加Hive表分区。

涉及到的Processors及简介：

Processor	简介
CaptureChangeMySQL	实时获取Mysql Binlog数据，输出是JSON数据
JSLTTransformJSON	转换flowfile内容中的JSON数据结构
EvaluateJsonPath	根据flowfile内容中的JSON数据，生成flowfile属性或内容。
UpdateAttribute	更新或删除flowfile属性
MergeContent	合并多个flowfile到一个flowfile
ConvertRecord	转换flowfile内容数据格式，例如将JSON格式转换为AVRO格式
PutHDFS	将flowfile写入HDFS目录
DetectDuplicate	检测flowfile属性值是否重复
ExecuteStreamCommand	提供一个灵活的方式，可以集成外部的命令或脚本到NiFi数据流中，可以将输入flowfile的内容传递给命令

涉及到的Controller Services及简介：

Controller Service	简介
JsonTreeReader	解析JSON数据到单独的Record对象
AvroRecordSetWriter	写Record对象数据集内容到二进制Avro格式
KerberosKeytabUserService	通过Kerberos Principal和Keytab进行Kerberos认证
RedisConnectionPoolService	Redis连接池服务
RedisDistributedMapCacheClientService	使用Redis作为后端缓存的DistributedMapCacheClient，用来提供缓存服务

下面对数据流各环节配置详细说明。

Process Group配置说明

NiFi支持配置Parameter Context，可以方便的配置一些通用参数，可以分配到Process Group，然后其中的Processor可以引用这些参数。
一个Parameter Context可以分配到多个Process Group；一个Process Group执行配置一个Parameter Context。
本案例配置的Parameter Context的名称是MysqlToHive，在下文的Processor（ExecuteStreamCommand）中引用了其中的参数。

上图配置的MysqlToHive的Parameter Context配置如下：

Processors配置说明

下面每个Processor，未说明的Tab均使用默认值。

CaptureChangeMySQL

Scheduling Tab

一般保持默认值即可。
注意 Run Schedule 配置，保持默认值 0 sec 即可，此时处理器会保持长期运行，符合实时获取数据的需求。
如果设置成大于0的时间单位，处理器将定期启停，会出现各种各样的问题。
我遇到的问题，当设置成大于0的时间单位时，stop之后再start，可能会出现一下情况：

发现处理器不输出数据，也不报错。（当时jdbc驱动版本低于mysql版本，更换版本后，没再遇到此问题）
报错Mysql Binlog Server ID重复，推测是定期执行处理器，出行同时运行多个进程或线程的情况。

Properties Tab

属性避坑指南：

Property	避坑指南	本案例配置值
MySQL Driver Class Name	注意要兼容目标Mysql版本。	com.mysql.cj.jdbc.Driver
MySQL Driver Location(s)	jdbc jar文件绝对路径。小提示：可以在处理器配置资源文件路径的，尽量配置到处理器，灵活方便，也减少了对NiFi服务的影响。	/data/nifi/extra_libs/mysql-connector-j-8.0.33.jar
Event Processing Strategy	我使用的是Max Events Per FlowFile，配合 Events Per FlowFile = 1，实现每个flowfile一个cdc记录，方便后续处理。	Events Per FlowFile
Events Per FlowFile	我设置的是1，配合Event Processing Strategy使用。	1
Server ID	默认值是65535，Mysql Binlog要求每个Slave的Server ID唯一。根据实际应用设置，避免重复即可。	65530
Database/Schema Name Pattern	可以用来配置只获取指定数据库的events，支持正则。默认是获取全部数据库的events。示例配置：test1\|test2，表示只获取test1和test2两个数据库的events。
Table Name Pattern	可以用来配置只获取指定数据表的events，支持正则。默认是获取全部数据表的events。
Use Binlog GTID	我设置了true，需要Mysql Binlog开启GTID功能，更可靠。	true

参考：
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-cdc-mysql-nar/1.22.0/org.apache.nifi.cdc.mysql.processors.CaptureChangeMySQL/index.html

JSLTTransformJSON

个人感觉 JSLTTransformJSON 比 JoltTransformJSON 更简单易用。

Properties Tab

属性避坑指南：

Property	避坑指南	本案例配置值
JSLT Transformation	需要学习 JSLT语法。转换逻辑中可以使用flowfile属性值。	见下方示例

示例：

// JSLT Transformation配置
{
  "cdc_sequence_id": ${cdc.sequence.id},
  "columns": to-json({for (.columns)
    .name : .value
    if (is-string(.name))
  }),
  * : .
}

// 输入JSON，假设cdc.sequence.id=1
{
  "type" : "insert",
  "timestamp" : 1689237765000,
  "binlog_gtidset" : "4a60a881-bad6-11ec-a01d-00163e19ca3b:1-358723163",
  "database" : "test",
  "table_name" : "test_table",
  "table_id" : 130,
  "columns" : [ {
    "id" : 1,
    "name" : "id",
    "column_type" : -5,
    "value" : 1234
  }, {
    "id" : 2,
    "name" : "user_id",
    "column_type" : 4,
    "value" : 21
  } ]
}

// 输出JSON，cdc_sequence_id是从flowfile属性中获取的
{
  "cdc_sequence_id": 1,
  "columns" : "{\"id\":1234,\"user_id\":21}",
  "type" : "insert",
  "timestamp" : 1689237765000,
  "binlog_gtidset" : "4a60a881-bad6-11ec-a01d-00163e19ca3b:1-358723163",
  "database" : "test",
  "table_name" : "test_table",
  "table_id" : 130
}

示例配置说明：

"cdc_sequence_id": ${cdc.sequence.id},

输出JSON结构中，key为cdc_sequence_id，value为flowfile的cdc.sequence.id属性的值。

"columns": to-json({for (.columns)
  .name : .value
  if (is-string(.name))
}),

输出JSON结构中，key为columns，value为输入JSON结构中columns值的转换结果。
输入JSON结果中，columns为数组，转换逻辑为：

遍历columns元素，将key=name的值作为key，key=value的值作为value
key=name的值必须是字符串
使用to-json将结果转换为json字符串

* : .

其他输入JSON的key，value，原样写入输出JSON。

参考：
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-jslt-nar/1.22.0/org.apache.nifi.processors.jslt.JSLTTransformJSON/index.html
https://github.com/schibsted/jslt

EvaluateJsonPath

Properties Tab

属性避坑指南：

Property	避坑指南	本案例配置值
Destination	我只用了flowfile-attribute选项。至于使用flowfile-content修改flowfile内容，使用上面的JSLTTransformJSON或许更方便些。	flowfile-attribute
binlog.record.timestamp	自定义属性。从flowfile内容中提取。	$.timestamp

参考：
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.22.0/org.apache.nifi.processors.standard.EvaluateJsonPath/index.html

UpdateAttribute

Properties Tab

属性避坑指南：

Property	避坑指南	本案例配置值
Delete Attributes Expression	用来删除属性，可以使用正则表达式。	binlog.record.*
hive.database	自定义属性。指定hive database。	（根据需要设置）
hive.table	自定义属性。指定hive table。	（根据需要设置）
hive.partition.dt	自定义属性。指定日期分区值。	${binlog.record.timestamp:format(“yyyy-MM-dd”)}

更新属性，需要处理器处于配置状态，右上角可以点击➕，用来添加属性，可以使用NiFi Expression Language。
参考：
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-update-attribute-nar/1.22.0/org.apache.nifi.processors.attributes.UpdateAttribute/index.html
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-update-attribute-nar/1.22.0/org.apache.nifi.processors.attributes.UpdateAttribute/additionalDetails.html
https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html

MergeContent

Properties Tab

属性避坑指南：

Property	避坑指南	本案例配置值
Merge Strategy	我是用的是Bin-Packing Algorithm，这个选项表示使用装箱算法对flowfile进行分组，相同属性的flowfile会被分到同一组中，合并进一个flowfile。	Bin-Packing Algorithm
Correlation Attribute Name	指定用于控制分组的属性，该属性相同的flowfile会被分到同一组中。	hive.partition.dt
Minimum Number of Entries	一个分组中最少的flowfiles数量。	1
Maximum Number of Entries	一个分组中最多的flowfiles数量。达到后会触发合并。	1000000
Minimum Group Size	最小的分组大小。HDFS块大小设置的是128MB，下游ConvertRecord转换数据格式后，flowfile会变小，为了使最终flowfile大小接近HDFS块大小，这里的配置偏高。	128 MB
Maximum Group Size	最大的分组大小。达到后会触发合并。HDFS块大小设置的是128MB，下游ConvertRecord转换数据格式后，flowfile会变小，为了使最终flowfile大小接近HDFS块大小，这里的配置偏高。	138 MB
Max Bin Age	分组最长存活时间，超时后会触发合并。	10 minutes
Maximum number of Bins	内存中同时存在的最大分组数量，超过这个数量时，会合并存在时间最长的分组。	5

合并机制说明：（实测观察得到的结论）
Maximum Number of Entries、Maximum Group Size、Max Bin Age和Maximum number of Bins，任何一个达到阈值，都将触发分组合并。
当以上条件都不满足时，文件数量大于Minimum Number of Entries 并且分组大小大于Minimum Group Size时，会触发合并。

官方文档建议Maximum Number of Entries配置小于等于1000，如果需求超过1000，可以再加一个MergeContent。例如一个MergeContent可以合并1000个flowfile，两个串联就可以合并1000000个flowfile。
我这里为了方便，只使用了一个MergeContent，实测观察还没发现问题。
上图配置的效果主要使用文件大小控制合并。实际可以使用两个MergeContent，在符合官方建议的前提进行合理的配置，可以达到同样的效果。

参考：
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.22.0/org.apache.nifi.processors.standard.MergeContent/index.html
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.22.0/org.apache.nifi.processors.standard.MergeContent/additionalDetails.html

ConvertRecord

Properties Tab

属性避坑指南：

Property	避坑指南	本案例配置值
Record Reader	指定Record读取器，将输入数据格式转换为Record对象。	JsonTreeReader
Record Writer	指定Record写入器，将Record对象转换为输出数据格式。	AvroRecordSetWriter

参考：
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.22.0/org.apache.nifi.processors.standard.ConvertRecord/index.html

PutHDFS

Properties Tab

属性避坑指南：

Property	避坑指南	本案例配置值
Hadoop Configuration Resources	指定Hadoop配置资源文件，需要core-site.xml和hdfs-site.xml。示例配置：/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml	/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
Kerberos User Service	我是用这个属性进行Kerberos认证，配置了一个KerberosKeytabUserService，下文的Contrller Service会提到这个。也有其他属性支持Kerberos认证，可以自行研究用法。	KerberosKeytabUserService-dc
Additional Classpath Resources	添加额外的Classpath资源，支持目录和文件。示例配置：/opt/cloudera/parcels/GPLEXTRAS/jars	/opt/cloudera/parcels/GPLEXTRAS/jars
Directory	要写入flowfile的HDFS目录。可以自动创建缺失的目录。	/user/hive/warehouse/ ${hive.database}.db/$ {hive.table}/dt=${hive.partition.dt}

参考：
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hadoop-nar/1.22.0/org.apache.nifi.processors.hadoop.PutHDFS/index.html

DetectDuplicate

Properties Tab

在这里插入图片描述

属性避坑指南：

Property	避坑指南	本案例配置值
Cache Entry Identifier	指定flowfile的一个属性或者属性表达式。这里要注意，正常情况下，所有dataflow使用同一个缓存服务，所以需要保证每个dataflow设置的Cache Entry Identifier是唯一的，这样才能保证每个dataflow缓存的数据都是唯一的，否则dataflow之间会相互干扰，造成部分Hive表无法自动添加分区。	`${hive.database}:${hive.table}:${hive.partition.dt}`
Distributed Cache Service	指定分布式缓存服务，这里使用的是RedisDistributedMapCacheClientService。	RedisDistributedMapCacheClientService
Cache The Entry Identifier	默认值为true。true表示processor负责缓存Entry Identifier；false表示processor不负责缓存Entry Identifier，这时需要其他processor负责缓存Entry Identifier。	true

参考：
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.22.0/org.apache.nifi.processors.standard.DetectDuplicate/index.html

ExecuteStreamCommand

Properties Tab

关键属性说明：

Property	避坑指南	本案例配置值
Working Directory	执行命令的工作目录，默认值为NiFi根目录。目前没发现具体用途，配置成自定义目录，方便管理。	/data/nifi/processor_space/execue_stream_command
Command Path	指定要执行的命令。	/opt/pyenv/venv/edw/bin/python
Command Arguments Strategy	命令参数策略，默认值为Command Arguments Property。详情可以查阅下面的参考链接。我使用的是默认值，配合Command Arguments属性使用。	Command Arguments Property
Command Arguments	提供命令参数，需要使用分割符分割。分隔符可以在Argument Delimiter属性配置。可以使用属性表达式和参数表达式。	-u;/data/nifi/scripts/hive_add_partition.py;–database; ${hive.database};--table;$ {hive.table};–partition;dt=‘${hive.partition.dt}’;–hiveserver2-host;#{hiveserver2-host};–hiveserver2-port;#{hiveserver2-port};–principal;#{principal};–keytab;#{keytab}
Argument Delimiter	指定命令参数的分割符，默认值为’;'。	;
Ignore STDIN	是否忽略标准输入，实际控制是否将输入的flowfile内容传递给命令，默认值为false，表示不忽略，代表要传递。当设置为true时，表示忽略，代表不传递。我是用的是true，因为在这个案例中不需要使用输入的flowfile内容。	true

这里执行外部的自定义python脚本，内部封装了向Hive表添加分区的逻辑。
按照上面的配置，相当于执行命令：

# 在processor中执行时，会将属性替换成实际值
/opt/pyenv/venv/edw/bin/python \
  -u /data/nifi/scripts/hive_add_partition.py \
  --database ${hive.database} \
  --table ${hive.table} \
  --partition dt=\'${hive.partition.dt}\'

# 具体示例
/opt/pyenv/venv/edw/bin/python \
  -u /data/nifi/scripts/hive_add_partition.py \
  --database test \
  --table test_mysql_binlog \
  --partition dt=\'2023-07-19\'

python脚本代码如下：

import argparse
from krbcontext.context import krbContext
from pyhive import hive


def main(args):
    # 连接hive
    with krbContext(using_keytab=True, principal=args.principal, keytab_file=args.keytab):
        hive_conn = hive.connect(
            host=args.hiveserver2_host,
            port=args.hiveserver2_port,
            auth='KERBEROS',
            kerberos_service_name='hive'
        )
        cursor = hive_conn.cursor()
        # 添加表分区，可以添加 IF NOT EXISTS 进行容错。
        hql = (
            f"ALTER TABLE {args.database}.{args.table} ADD PARTITION ({args.partition})"
        )
        cursor.execute(hql)
        # 关闭连接
        cursor.close()
        hive_conn.close()

if __name__ == '__main__':
    # 命令行
    parser = argparse.ArgumentParser(prog='hive_add_partition', description='add partition to hive table.')
    parser.add_argument('--database', help='database of table need to add partition')
    parser.add_argument('--table', help='table need to add partition')
    parser.add_argument('--partition', help='partition string')
    parser.add_argument('--hiveserver2-host', default='', help='hiveserver2 host')
    parser.add_argument('--hiveserver2-port', default=10000, type=int, help='hiveserver2 port')
    parser.add_argument('--principal', default='dc', help='kerberos principal')
    parser.add_argument('--keytab', default='/data/nifi/kerberos/dc.keytab', help='kerberos keytab')
    args = parser.parse_args()

    main(args)

参考：
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.22.0/org.apache.nifi.processors.standard.ExecuteStreamCommand/index.html

Controller Services配置说明

JsonTreeReader

直接使用的默认配置。
参考：
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-record-serialization-services-nar/1.22.0/org.apache.nifi.json.JsonTreeReader/index.html

AvroRecordSetWriter

直接使用的默认配置。
参考：
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-record-serialization-services-nar/1.22.0/org.apache.nifi.avro.AvroRecordSetWriter/index.html

RedisConnectionPoolService

Properties Tab

本案例配置：

属性	配置
Redis Mode	Standalone
Connection String	localhost:6379
Password	（redis.conf中的requirepass配置值）

参考：
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-redis-nar/1.22.0/org.apache.nifi.redis.service.RedisConnectionPoolService/index.html

RedisDistributedMapCacheClientService

Properties Tab

本案例配置：

属性	配置
Redis Connection Pool	选择上文配置的redis连接池即可
TTL	25 hours。针对于本案例，缓存的是hive表日期分区值，binlog日志时间是向前的，每个日期分区缓存25小时，可以解决延迟数据问题，避免添加已添加过的分区。

参考：
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-redis-nar/1.22.0/org.apache.nifi.redis.service.RedisDistributedMapCacheClientService/index.html

KerberosKeytabUserService

Properties Tab

配置上 Kerberos Principal 和对应的 Keytab 即可。
参考：
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-kerberos-user-service-nar/1.22.0/org.apache.nifi.kerberos.KerberosKeytabUserService/index.html

另外一种维护Hive元数据的方法

主要思路：

创建Hive表。如果使用Hive外部表，在写入数据之后创建表也可以。
使用PutHDFS将数据写入Hive表在HDFS上的存储目录。如果Hive表是分区表，PutHDFS Directory属性配置的目录需要细化到最细粒度分区。
如果Hive表是分区表，需要使用 alter … add partition 语句添加 HDFS 目录到Hive表元数据中，这样Hive就可以读取到数据。如果Hive表不是分区表，则可以直接读取数据。

这种做法就是将更新Hive元数据的功能放到NiFi的下游任务中去做，比如在读取NiFi写入的数据之前，执行Hive添加分区的指令。
示例说明：
创建Hive表：

CREATE TABLE test.test_mysql_binlog (
  `binlog_gtidset` STRING,
  `cdc_sequence_id` BIGINT,
  `database` STRING,
  `table_name` STRING,
  `type` STRING,  
  `timestamp` BIGINT,
  `columns` STRING
) 
PARTITIONED BY (`dt` STRING) 
STORED AS AVRO;

PutHDFS Directory 配置：

/user/hive/warehouse/${hive.database}.db/${hive.table}/dt=${hive.partition.dt}

这样就可以实现将flowfile文件动态写入对应的分区。假设hive.partition.dt值为2023-07-13，此时PutHDFS Directory为：

/user/hive/warehouse/test.db/test_mysql_binlog/dt=2023-07-13

如果hive表有两个分区dt和country，则 PutHDFS Directory 配置如下：

/user/hive/warehouse/${hive.database}.db/${hive.table}/dt=${hive.partition.dt}/country=${hive.partition.country}

假设hive.partition.dt值为2023-07-13，hive.partition.country值为china，此时PutHDFS Directory为：

/user/hive/warehouse/test.db/test_mysql_binlog/dt=2023-07-13/country=china

满足Hive分区格式。
当数据写入HDFS的对应的Hive分区目录，此时Hive表元数据还没有录入这个分区，可以通过如下命令添加分区：

ALTER TABLE test.test_mysql_binlog
ADD partition (dt='2023-07-13');

-- 如果分区是dt,country，需要细化到country分区层级
ALTER TABLE test.test_mysql_binlog
ADD partition (dt='2023-07-13', country='china');

-- 如果是外部表，需要指定LOCATION
ALTER TABLE test.test_mysql_binlog
ADD partition (dt='2023-07-13');
LOCATION 'hdfs://nameservice1/user/hive/warehouse/test.db/test_mysql_binlog/dt=2023-07-13;