Flink实践:Flink1.11连接Hive2.3.6

1.使用步骤

        下载安装flink。并进行如下配置:

        1、在sql-client-defaults.yaml中添加:

catalogs:
   - name: myhive
     type: hive
     hive-conf-dir: /usr/local/hive/conf
     hive-version: 2.3.6

        2、修改hive的hive-site.xml

<configuration>
<property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
</property>
<property>
        <name>hive.metastore.local</name>
        <value>false</value>
</property>
<property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://hadoop2:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>
</property>
<property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
</property>
<property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
</property>
<property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>123456</value>
</property>
<property>
        <name>hive.metastore.schema.verification</name>
        <value>true</value>
</property>
<property>
        <name>hive.metastore.uris</name>
        <value>thrift://hadoop2:9083</value>
</property>
</configuration>

        3、 开启hive9083端口

前台启动:hive --service metastore
后台启动:nohup hive --service metastore &

        4、启动flink进行验证:sql-client.sh embedded

2.windows下flink读取kafka数据写入hive

        完整代码如下:

def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    // 启用检查点,指定触发checkpoint的时间间隔(单位:毫秒,默认500毫秒),默认情况是不开启的
    env.enableCheckpointing(5000L)
    // 根据用户指定的数量可以同时触发多个Checkpoint,进而提升Checkpoint整体的效率
    env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
    // 设定语义模式,默认情况是exactly_once
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
    // 设定Checkpoint超时时间,默认为10分钟
    env.getCheckpointConfig.setCheckpointTimeout(60000)
    // 设定两个Checkpoint之间的最小时间间隔,防止出现例如状态数据过大而导致Checkpoint执行时间过长,从而导致Checkpoint积压过多,
    // 最终Flink应用密切触发Checkpoint操作,会占用了大量计算资源而影响到整个应用的性能(单位:毫秒)
    env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)

    val settings = EnvironmentSettings.newInstance()
      .useBlinkPlanner()
      .inStreamingMode()
      .build()
    val tableEnv = StreamTableEnvironment.create(env, settings)

    tableEnv.getConfig.getConfiguration.set(ExecutionCheckpointingOptions.CHECKPOINTING_MODE, CheckpointingMode.EXACTLY_ONCE)
    tableEnv.getConfig.getConfiguration.set(ExecutionCheckpointingOptions.CHECKPOINTING_INTERVAL, Duration.ofSeconds(60))
   // tableEnv.getConfig.setSqlDialect(SqlDialect.HIVE)

    // 构造hive catalog
    // Catalog名称
    val name = "myhive"
    // 默认数据库名称
    val defaultDatabase = "testflink"
    // hive-site.xml路径
    val hiveConfDir = "E:\\workspace\\FlinkDemo11\\FlinkSQL\\src\\main\\resources";
    // Hive版本号
    val version = "2.3.6";
    val hiveCatalog = new HiveCatalog(name, defaultDatabase, hiveConfDir, version)
    tableEnv.registerCatalog(name, hiveCatalog)
    tableEnv.useCatalog(name)
    tableEnv.getConfig.setSqlDialect(SqlDialect.DEFAULT)
    tableEnv.executeSql("show tables")
    tableEnv.executeSql(
      """
        |CREATE TABLE input (
        |    userId String,
        |    userName STRING,
        |    r_t TIMESTAMP(3),
        |    WATERMARK FOR r_t AS r_t - INTERVAL '5' SECOND
        |) WITH (
        |    'connector' = 'kafka',  -- using kafka connector
        |    'topic' = 'flinksource',  -- kafka topic
        |    'scan.startup.mode' = 'latest-offset',  -- reading from the beginning
        |    'properties.group.id' = 'group1',
        |    'properties.bootstrap.servers' = 'hadoop1:9092,hadoop2:9092,hadoop3:9092',  -- kafka broker address
        |    'format' = 'json'  -- the data format is json
        |)
        |
      """.stripMargin)


    tableEnv.executeSql(
      """
        |CREATE TABLE output_print2(
        | userId STRING,
        | userName STRING,
        | dt String,
        | hr String,
        | mm String
        |) WITH (
        |    'connector' = 'print'
        |)
      """.stripMargin)
    val insertSql1 = "insert into output_print2 SELECT userId, userName, " +
      "DATE_FORMAT(r_t, 'yyyy-MM-dd') as dt," +
      "DATE_FORMAT(r_t, 'HH') as hr," +
      "DATE_FORMAT(r_t, 'mm') as mm " +
      "FROM input"
    tableEnv.executeSql(insertSql1)


    tableEnv.getConfig.setSqlDialect(SqlDialect.HIVE)
    tableEnv.executeSql(
      """
        |CREATE TABLE output(
        | userId STRING,
        | userName STRING,
        | dt String,
        | hr String
        |) PARTITIONED BY (dt string,hr string)  TBLPROPERTIES (
        | 'partition.time-extractor.timestamp-pattern'='$dt $hr:00:00',
        | 'sink.partition-commit.trigger'='partition-time', 
        | 'sink.partition-commit.delay'='0s', 
        | 'sink.partition-commit.policy.kind'='metastore,success-file', 
        | 'partition.time-extractor.kind'='custom',
        | 'partition.time-extractor.class'='MyPartTimeExtractor'
        |)
      """.stripMargin)

    val insertSql = "insert into output SELECT userId, userName, " +
      "DATE_FORMAT(r_t, 'yyyy-MM-dd') as dt," +
      "DATE_FORMAT(r_t, 'HH') as hr " +
      "FROM input"
    tableEnv.executeSql(insertSql)
  }

需要注意以下几点:

1、new HiveCatalog(name, defaultDatabase, hiveConfDir, version)

        name:为catalog的名称,即我们在配置文件sql-client-defaults.yaml中设置的catalog name。

        defaultDatabase:使用的hive中的数据库的名称。

        hiveConfDir:hive-site.xml的地址。

        version:hive的版本号

2、当我们需要连接kafka时,使用

tableEnv.getConfig.setSqlDialect(SqlDialect.DEFAULT)

       连接hive时,使用

tableEnv.getConfig.setSqlDialect(SqlDialect.HIVE)

        不然会显示语法错误。

3、hive表的参数解释:

  • partition.time-extractor.timestamp-pattern:配置hour级别的partition时间抽取策略
  • sink.partition-commit.trigger:使用partition中抽取时间,加上watermark决定partiton commit的时机
  • sink.partition-commit.delay:配置dalay为小时级,当 watermark > partition时间 + 1小时,会commit这个partition
  • sink.partition-commit.policy.kind:partitiion commit的策略是:先更新metastore(addPartition),再写SUCCESS文件​​​​​​​​​​​​​​

 4、如果数据有写入,但是在hive中查询不到数据,有可能是没有开启checkpoint。

3.记录在实践过程中出现的不少问题

1、在maven项目一种引入hive-exec2.3.6的包,会出现如下问题:

Could not find artifact org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde的解决方法

        需要在maven的setting.xml中添加如下镜像:

</mirror>
    <mirror>
    <id>aliyunmaven</id>
    <mirrorOf>*</mirrorOf>
    <name>spring-plugin</name>
    <url>https://maven.aliyun.com/repository/spring-plugin</url>
 </mirror>

 <mirror> 
    <id>repo2</id> 
    <name>Mirror from Maven Repo2</name> 
    <url>https://repo.spring.io/plugins-release/</url> 
    <mirrorOf>central</mirrorOf> 
 </mirror>

2、Error: java: 无法访问org.apache.hadoop.mapred.JobConf 找不到org.apache.hadoop.mapred.JobConf的类文件

        需要添加如下依赖包:

<dependency>
	<groupId>org.apache.hadoop</groupId>
	<artifactId>hadoop-mapreduce-client-core</artifactId>
	<version>${hadoop.version}</version>
</dependency>
<dependency>
	<groupId>org.apache.hadoop</groupId>
	<artifactId>hadoop-common</artifactId>
	<version>${hadoop.version}</version>
</dependency>
<dependency>
	<groupId>org.apache.hadoop</groupId>
	<artifactId>hadoop-mapreduce-client-common</artifactId>
	<version>${hadoop.version}</version>as
</dependency>
<dependency>
	<groupId>org.apache.hadoop</groupId>
	<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
	<version>${hadoop.version}</version>
</dependency>

3、Make sure you have set a valid value for hive.metastore.uris

需要在hive-site.xml中添加:

<property>
  <name>hive.metastore.local</name>
  <value>false</value>
</property>

<property>
  <name>hive.metastore.uris</name>
  <value>thrift://hadoop2:9083</value>
</property>

4、Exception in thread "main" org.apache.flink.table.catalog.exceptions.CatalogException: Failed to create Hive Metastore client

        需要开启hive9083端口:

前台启动:hive --service metastore
后台启动:nohup hive --service metastore &

5、如下java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration错误,具体如下:

Exception in thread "main" org.apache.flink.table.client.SqlClientException: Unexpected exception. This is a bug. Please consider filing an issue.
	at org.apache.flink.table.client.SqlClient.main(SqlClient.java:213)
Caused by: org.apache.flink.table.client.gateway.SqlExecutionException: Could not create execution context.
	at org.apache.flink.table.client.gateway.local.ExecutionContext$Builder.build(ExecutionContext.java:870)
	at org.apache.flink.table.client.gateway.local.LocalExecutor.openSession(LocalExecutor.java:227)
	at org.apache.flink.table.client.SqlClient.start(SqlClient.java:108)
	at org.apache.flink.table.client.SqlClient.main(SqlClient.java:201)
Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
	at org.apache.flink.table.catalog.hive.factories.HiveCatalogFactory.createCatalog(HiveCatalogFactory.java:84)
	at org.apache.flink.table.client.gateway.local.ExecutionContext.createCatalog(ExecutionContext.java:378)
	at org.apache.flink.table.client.gateway.local.ExecutionContext.lambda$null$5(ExecutionContext.java:626)
	at java.util.HashMap.forEach(HashMap.java:1289)
	at org.apache.flink.table.client.gateway.local.ExecutionContext.lambda$initializeCatalogs$6(ExecutionContext.java:625)
	at org.apache.flink.table.client.gateway.local.ExecutionContext.wrapClassLoader(ExecutionContext.java:264)
	at org.apache.flink.table.client.gateway.local.ExecutionContext.initializeCatalogs(ExecutionContext.java:624)
	at org.apache.flink.table.client.gateway.local.ExecutionContext.initializeTableEnvironment(ExecutionContext.java:523)
	at org.apache.flink.table.client.gateway.local.ExecutionContext.<init>(ExecutionContext.java:183)
	at org.apache.flink.table.client.gateway.local.ExecutionContext.<init>(ExecutionContext.java:136)
	at org.apache.flink.table.client.gateway.local.ExecutionContext$Builder.build(ExecutionContext.java:859)
	... 3 more
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 14 more

        解决方案:

        flink 1.11开始官方不再维护flink-shaded-hadoop-2-uber jar,可以使用之前版本的shade,社区建议导入hadoop classpath,执行这行代码,

export HADOOP_CLASSPATH=`hadoop classpath`

         如果是提交到fink集群出现此问题。则需要从maven工厂下载 flink-shaded-hadoop-2-uber-2.7.5-10.0.jar放在lib目录下并重启集群。

6、Table options do not contain an option key 'connector' for discovering a connector.

        当执行insert语句时会出现如上错误,有可能是没有在.yaml里面配置好hive相关的选项,也有可能是表示在Flink SQL中建的,而非是在hive中建的。需要在hive中建表。

7、Caused by: org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot instantiate user function。具体如下:

Caused by: org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot instantiate user function.
	at org.apache.flink.streaming.api.graph.StreamConfig.getStreamOperatorFactory(StreamConfig.java:275)
	at org.apache.flink.streaming.runtime.tasks.OperatorChain.createChainedOperator(OperatorChain.java:471)
	at org.apache.flink.streaming.runtime.tasks.OperatorChain.createOutputCollector(OperatorChain.java:393)
	at org.apache.flink.streaming.runtime.tasks.OperatorChain.createChainedOperator(OperatorChain.java:459)
	at org.apache.flink.streaming.runtime.tasks.OperatorChain.createOutputCollector(OperatorChain.java:393)
	at org.apache.flink.streaming.runtime.tasks.OperatorChain.<init>(OperatorChain.java:155)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:459)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:528)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.StreamCorruptedException: unexpected block data
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1581)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2158)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2158)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2158)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2158)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:427)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:576)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:562)
	at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:550)
	at org.apache.flink.util.InstantiationUtil.readObjectFromConfig(InstantiationUtil.java:511)
	at org.apache.flink.streaming.api.graph.StreamConfig.getStreamOperatorFactory(StreamConfig.java:260)
	... 10 more

        解决方案:

        flink-conf.yaml中添加如下配置,并重启集群:

classloader.resolve-order: parent-first

  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值