flink sql client 连接kafka解析avro数据 (avro ArrayIndexOutOfBoundsException 解决办法)

flink sql 连接kafka解析avro数据

因为工作需要,需要写FlinkSqlClient解析avro数据,但是网上例子很少,最后在看到一位老哥用debug方式排查问题,我这里也用了相同方式,才找到问题所在 。
我用的flink 版本为 flink-1.12.2-bin-scala_2.12

查看avro的schema

{
  "type" : "record",
  "name" : "KafkaAvroMessage",
  "namespace" : "xxx",
  "fields" : [ {
    "name" : "transactionId",
    "type" : "string"
  }, {
    "name" : "opType",
    "type" : "string"
  }, {
    "name" : "schemaName",
    "type" : "string"
  }, {
    "name" : "tableName",
    "type" : "string"
  }, {
    "name" : "columnInfos",
    "type" : {
      "type" : "map",
      "values" : {
        "type" : "record",
        "name" : "ColumnInfo",
        "fields" : [ {
          "name" : "oldValue",
          "type" : [ "null", "string" ],
          "default" : null
        }, {
          "name" : "newValue",
          "type" : [ "null", "string" ],
          "default" : null
        }, {
          "name" : "name",
          "type" : "string"
        }, {
          "name" : "isKeyColumn",
          "type" : "boolean"
        }, {
          "name" : "type",
          "type" : "string"
        } ]
      }
    }
  }, {
    "name" : "timeStamp",
    "type" : "string"
  }, {
    "name" : "numberOfColumns",
    "type" : "int"
  }, {
    "name" : "processedTimeStamp",
    "type" : "long"
  }, {
    "name" : "schemaVersion",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "rba",
    "type" : "long"
  }, {
    "name" : "seqNo",
    "type" : "long"
  },
    {
      "name" : "domainName",
      "type" : "string",
      "default" : ""
    } ]
}


写flink sql 报错

拿到avro 按照flink sql 官网写sql(开始写的错误sql我就不贴出来了后面会写注意的点),会爆出如下错误

Exception in thread "main" java.lang.RuntimeException: Failed to fetch next result
	at org.apache.flink.streaming.api.operators.collect.CollectResultIterator.nextResultFromFetcher(CollectResultIterator.java:109)
	at org.apache.flink.streaming.api.operators.collect.CollectResultIterator.hasNext(CollectResultIterator.java:80)
	at org.apache.flink.table.planner.sinks.SelectTableSinkBase$RowIteratorWrapper.hasNext(SelectTableSinkBase.java:117)
	at org.apache.flink.table.api.internal.TableResultImpl$CloseableRowIteratorWrapper.hasNext(TableResultImpl.java:350)
	at org.apache.flink.table.utils.PrintUtils.printAsTableauForm(PrintUtils.java:149)
	at org.apache.flink.table.api.internal.TableResultImpl.print(TableResultImpl.java:154)
	at com.stubhub.wyane.flink.avro.avroTest4.main(avroTest4.java:50)
Caused by: java.io.IOException: Failed to fetch job execution result
	at org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.getAccumulatorResults(CollectResultFetcher.java:169)
	at org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.next(CollectResultFetcher.java:118)
	at org.apache.flink.streaming.api.operators.collect.CollectResultIterator.nextResultFromFetcher(CollectResultIterator.java:106)
	... 6 more
Caused by: java.util.concurrent.ExecutionException: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
	at org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.getAccumulatorResults(CollectResultFetcher.java:167)
	... 8 more
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
	at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
	at org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$2(MiniClusterJobClient.java:117)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
	at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
	at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
	at org.apache.flink.runtime.minicluster.MiniClusterJobClient.getJobExecutionResult(MiniClusterJobClient.java:114)
	at org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.getAccumulatorResults(CollectResultFetcher.java:166)
	... 8 more
Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
	at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:118)
	at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:80)
	at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:233)
	at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:224)
	at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:215)
	at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:669)
	at org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:89)
	at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:447)
	at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212)
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
	at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
	at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
	at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
	at akka.actor.Actor.aroundReceive(Actor.scala:517)
	at akka.actor.Actor.aroundReceive$(Actor.scala:515)
	at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
	at akka.actor.ActorCell.invoke(ActorCell.scala:561)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
	at akka.dispatch.Mailbox.run(Mailbox.scala:225)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.io.IOException: Failed to deserialize Avro record.
	at org.apache.flink.formats.avro.AvroRowDataDeserializationSchema.deserialize(AvroRowDataDeserializationSchema.java:101)
	at org.apache.flink.formats.avro.AvroRowDataDeserializationSchema.deserialize(AvroRowDataDeserializationSchema.java:44)
	at org.apache.flink.api.common.serialization.DeserializationSchema.deserialize(DeserializationSchema.java:82)
	at org.apache.flink.streaming.connectors.kafka.table.DynamicKafkaDeserializationSchema.deserialize(DynamicKafkaDeserializationSchema.java:113)
	at org.apache.flink.streaming.connectors.kafka.internals.KafkaFetcher.partitionConsumerRecordsHandler(KafkaFetcher.java:179)
	at org.apache.flink.streaming.connectors.kafka.internals.KafkaFetcher.runFetchLoop(KafkaFetcher.java:142)
	at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:826)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:66)
	at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:263)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 20
	at org.apache.flink.avro.shaded.org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:460)
	at org.apache.flink.avro.shaded.org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:283)
	at org.apache.flink.avro.shaded.org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:187)
	at org.apache.flink.avro.shaded.org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
	at org.apache.flink.avro.shaded.org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:259)
	at org.apache.flink.avro.shaded.org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247)
	at org.apache.flink.avro.shaded.org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
	at org.apache.flink.avro.shaded.org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
	at org.apache.flink.avro.shaded.org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
	at org.apache.flink.formats.avro.AvroDeserializationSchema.deserialize(AvroDeserializationSchema.java:139)
	at org.apache.flink.formats.avro.AvroRowDataDeserializationSchema.deserialize(AvroRowDataDeserializationSchema.java:98)
	... 9 more

Process finished with exit code 1

在这里插入图片描述

里面明显的说法就是索引越界,但是是因为sql写的有问题

正确的flink sql

CREATE TABLE xxxx (
`transactionId` STRING NOT NULL,
`opType` STRING NOT NULL,
`schemaName` STRING NOT NULL,
`tableName` STRING NOT NULL,
`columnInfos` MAP<STRING NOT NULL,ROW<oldValue STRING NULL ,newValue STRING ,name STRING NOT NULL ,isKeyColumn BOOLEAN NOT NULL,type STRING NOT NULL > NOT NULL> NOT NULL,
`timeStamp` STRING NOT NULL,
`numberOfColumns` INT NOT NULL ,
`processedTimeStamp` BIGINT NOT NULL,
`schemaVersion` STRING ,
`rba` BIGINT NOT NULL ,
`seqNo` BIGINT  NOT NULL ,
`domainName` STRING NOT NULL
) WITH (
  'connector' = 'kafka',
  'topic' = 'xxxx',
  'scan.startup.mode' = 'earliest-offset',
  'properties.bootstrap.servers' = 'xxxx',
  'properties.group.id' = 'xxxx',
  'properties.security.protocol' = 'SASL_SSL',
  'properties.sasl.jaas.config' = 'org.apache.kafka.common.security.plain.PlainLoginModule required username="xxxx" password="xxxx";',
  'properties.sasl.mechanism' = 'PLAIN',
  'format' = 'avro'
)

注意的点

1.schema 如果字段后面没有 “type” : [ “null”, “string” ] 这个指定,则需要 加上NOT NULL 例如transactionId 字段

2.如果有 “type” : [ “null”, “string” ] 则可以不加 例如 oldValue字段或者newValue 这种写法

3.如果有 domainName 这种写法 也需要加上NOT NULL

4.map 后面没有 “type” : [ “null”, “string” ] 所以需要加上 NOT NULL

5.row 后面没有 “type” : [ “null”, “string” ] 所以需要加上 NOT NULL

如何确定flink sql是用map array row

可以看到里面有个type 指的是map 毫无疑问,直接用map,但是map里面有个record 类型 ,实际上record 类型对应的是flink sql的row类型,其他的类型可以参照下表

Flink SQL 类型Avro 类型Avro 逻辑类型
CHAR / VARCHAR / STRINGstring
BOOLEANboolean
BINARY / VARBINARYbytes
DECIMALfixeddecimal
TINYINTint
SMALLINTint
INTint
BIGINTlong
FLOATfloat
DOUBLEdouble
DATEintdate
TIMEinttime-millis
TIMESTAMPlongtimestamp-millis
ARRAYarray
MAP (key 必须是 string/char/varchar 类型)map
MULTISET (元素必须是 string/char/varchar 类型)map
ROWrecord

其中row类型字段定义类似于scala中的 case class,类似于以下定义

ROW<myField INT, myOtherField BOOLEAN>

更新一个bug

像下面schema中字段bbb 目前sql是解析不出来的,已经和flink官方确认是bug,会在1.15的版本优化,回复邮件看贴图

{
  "type" : "record",
  "name" : "KafkaAvroMessage",
  "namespace" : "xxx.xxx.",
    "fields":[
        {
            "name":"aaa",
            "type":"string"
        },
        {
            "name":"bbb",
            "type":[
                "null",
                "string"
            ]
        },
        {
            "name":"ccc",
            "type":"string",
 "default":null
        }
    ]
}

在这里插入图片描述

  • 2
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 6
    评论
Apache Flink 提供了与 Kafka 进行无缝集成的功能。使用 Flink SQL 连接 Kafka 可以轻松地将流处理和数据分析应用程序与 Kafka 集成。 以下是在 Flink SQL连接 Kafka 的步骤: 1. 首先,需要在 Flink 中导入 Kafka 的依赖项。可以在 pom.xml 文件中添加以下依赖项: ``` <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-kafka_{scala.binary.version}</artifactId> <version>${flink.version}</version> </dependency> ``` 2. 在 Flink SQL 中,需要使用 CREATE TABLE 语句来创建与 Kafka 主题的连接。以下是一个示例 CREATE TABLE 语句: ``` CREATE TABLE myKafkaTable ( kafka_key STRING, kafka_value STRING, kafka_timestamp TIMESTAMP(3), kafka_topic STRING, kafka_partition INT, kafka_offset BIGINT ) WITH ( 'connector' = 'kafka', 'topic' = 'myKafkaTopic', 'properties.bootstrap.servers' = 'localhost:9092', 'properties.group.id' = 'myKafkaConsumerGroup', 'format' = 'json', 'scan.startup.mode' = 'earliest-offset' ) ``` 3. 在上面的示例中,`myKafkaTable` 是要创建的表的名称,`kafka_key`、`kafka_value`、`kafka_timestamp`、`kafka_topic`、`kafka_partition` 和 `kafka_offset` 是表中的列名。`'connector' = 'kafka'` 表示连接Kafka,`'topic' = 'myKafkaTopic'` 表示要连接Kafka 主题的名称,`'properties.bootstrap.servers' = 'localhost:9092'` 表示 Kafka 服务器的地址,`'properties.group.id' = 'myKafkaConsumerGroup'` 表示使用的消费者组的名称,`'format' = 'json'` 表示消息格式为 JSON,`'scan.startup.mode' = 'earliest-offset'` 表示从最早的可用偏移量开始读取消息。 4. 通过使用 Flink SQL 中的 SELECT 语句,可以从 Kafka 主题中读取和查询数据。以下是一个示例 SELECT 语句: ``` SELECT kafka_key, COUNT(*) as count FROM myKafkaTable GROUP BY kafka_key ``` 5. 最后,可以使用 Flink 中的 DataStream API 或 Table API 来处理从 Kafka 中读取的数据。 这就是在 Flink SQL连接 Kafka 的基本步骤。通过使用 Flink SQLKafka,可以轻松地构建流处理和数据分析应用程序。
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值