Hudi on Flink sync Hive 实践
代码结构
def getHudiTableDDL(importConfig: ImportConfig): String = {
s"""
| CREATE TABLE ${importConfig.sinkDatabase}.${importConfig.sinkTableName} ( ${importConfig.sinkFields} )
| ${if (importConfig.isPartition) "PARTITIONED BY (`partition`)" else ""}
| WITH (
| 'connector' = 'hudi',
| 'path' = '${ClusterConstant.HudiMeta.DIR}.${importConfig.sinkDatabase}.${importConfig.sinkTableName}',
| 'table.type' = 'COPY_ON_WRITE',
| 'read.streaming.enabled' = 'true', -- this option enable the streaming read
| 'read.streaming.check-interval' = '10', -- specifies the check interval for finding new source commits, default 60s.
| 'hive_sync.enable'='true', -- required,开启hive同步功能
| 'hive_sync.table'='${importConfig.sinkTableName}', -- required, hive 新建的表名
| 'hive_sync.db'='${importConfig.sinkDatabase}', -- required, hive 新建的数据库名
| 'hive_sync.mode' = 'hms', -- required, 将hive sync mode设置为hms, 默认jdbc
| 'hive_sync.metastore.uris' = '${ClusterConstant.HiveMeta.META_URL}' -- required, metastore的端口
| )
|
|""".stripMargin
def getSqlserverTableDDL(importConfig: ImportConfig): String ={
val a = s"""
| CREATE TABLE IF NOT EXISTS ${importConfig.sourceDatabase}.${importConfig.sourceTableName} (
| ${importConfig.sourceFields}
| ) WITH (
| 'connector' = 'sqlserver-cdc',
| 'hostname' = '${importConfig.sourceUrl}',
| 'port' = '${importConfig.sourcePort}',
| 'username' = '${importConfig.sourceName}',
| 'password' = '${importConfig.sourcePw}',
| 'database-name' = '${importConfig.sourceDatabase}',
| 'schema-name' = '${importConfig.sourceSchema}',
| 'table-name' = '${importConfig.sourceTableName}',
| 'server-time-zone' = 'Asia/Shanghai'
|)
|""".stripMargin
a
}
集群运行的问题
在集群上运行会报缺少 HiveConf或者Hive类,这种是Hudi-0.10.1没有依赖Hive环境的问题,需要在Hudi源码上加上Hive依赖,然后打开shade重新编译:
// mvn install -DskipTests -Drat.skip=true -Pflink-bundle-shade-hive2
<profile>
<id>flink-bundle-shade-hive2</id>
<properties>
<hive.version>2.3.1</hive.version>
<flink.bundle.hive.scope>compile</flink.bundle.hive.scope>
</properties>
<dependencies>
<dependency>
<groupId>${hive.groupid}</groupId>
<artifactId>hive-service-rpc</artifactId>
<version>${hive.version}</version>
<scope>${flink.bundle.hive.scope}</scope>
</dependency>
<dependency>
<groupId>${hive.groupid}</groupId>
<artifactId>hive-exec</artifactId>
<version>${hive.version}</version>
<scope>${flink.bundle.hive.scope}</scope>
</dependency>
</dependencies>
</profile>