flink 1.4版本flink table方式消费kafka写入hive方式踩坑

最近在搞flink,搞了一个当前比较新的版本试了一下,当时运行了很长时间,hdfs里面查询有文件,但是hive里面查询这个表为空,后面用了很多种方式,一些是说自己去刷新hive表,如下:

第一种方式刷新
alter table t_kafkamsg2hivetable add partition(dt='2022-03-04',hr=11);

第二种方式刷新,也可以说是修复
msck repair table t_kafkamsg2hivetable;

看了很多博客,最后还是在官网找到答案,先贴代码:

按照flink官方给的案例生成的项目,地址:

Fraud Detection with the DataStream API | Apache Flink

$ mvn archetype:generate \
    -DarchetypeGroupId=org.apache.flink \
    -DarchetypeArtifactId=flink-walkthrough-datastream-java \
    -DarchetypeVersion=1.14.3 \
    -DgroupId=frauddetection \
    -DartifactId=frauddetection \
    -Dversion=0.1 \
    -Dpackage=spendreport \
    -DinteractiveMode=false

pom文件导入内容(有点杂,可以选引入)

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
		 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>

	<groupId>frauddetection</groupId>
	<artifactId>frauddetection</artifactId>
	<version>0.1</version>
	<packaging>jar</packaging>

	<name>Flink Walkthrough DataStream Java</name>
	<url>https://flink.apache.org</url>

	<properties>
		<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
		<flink.version>1.14.3</flink.version>
		<target.java.version>1.8</target.java.version>
		<scala.binary.version>2.11</scala.binary.version>
		<maven.compiler.source>${target.java.version}</maven.compiler.source>
		<maven.compiler.target>${target.java.version}</maven.compiler.target>
		<log4j.version>2.17.1</log4j.version>
		<hive.version>2.1.1</hive.version>
		<hadoop.version>2.6.0</hadoop.version>
	</properties>

<!--	<repositories>-->
<!--		<repository>-->
<!--			<id>apache.snapshots</id>-->
<!--			<name>Apache Development Snapshot Repository</name>-->
<!--			<url>https://repository.apache.org/content/repositories/snapshots/</url>-->
<!--			<releases>-->
<!--				<enabled>false</enabled>-->
<!--			</releases>-->
<!--			<snapshots>-->
<!--				<enabled>true</enabled>-->
<!--			</snapshots>-->
<!--		</repository>-->
<!--	</repositories>-->

	<dependencies>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-walkthrough-common_${scala.binary.version}</artifactId>
			<version>${flink.version}</version>
		</dependency>

		<!-- This dependency is provided, because it should not be packaged into the JAR file. -->
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
			<version>${flink.version}</version>
			<scope>provided</scope>
		</dependency>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-clients_${scala.binary.version}</artifactId>
			<version>${flink.version}</version>
			<scope>provided</scope>
		</dependency>

		<!-- Add connector dependencies here. They must be in the default scope (compile). -->

		<!-- Example:

		<dependency>
		    <groupId>org.apache.flink</groupId>
		    <artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
		    <version>${flink.version}</version>
		</dependency>
		-->

		<!-- Add logging framework, to produce console output when running in the IDE. -->
		<!-- These dependencies are excluded from the application JAR by default. -->
		<dependency>
			<groupId>org.apache.logging.log4j</groupId>
			<artifactId>log4j-slf4j-impl</artifactId>
			<version>${log4j.version}</version>
			<scope>runtime</scope>
		</dependency>
		<dependency>
			<groupId>org.apache.logging.log4j</groupId>
			<artifactId>log4j-api</artifactId>
			<version>${log4j.version}</version>
			<scope>runtime</scope>
		</dependency>
		<dependency>
			<groupId>org.apache.logging.log4j</groupId>
			<artifactId>log4j-core</artifactId>
			<version>${log4j.version}</version>
			<scope>runtime</scope>
		</dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>compile</scope>
        </dependency>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-test-utils_2.11</artifactId>
			<version>1.14.3</version>
			<scope>test</scope>
		</dependency>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-runtime</artifactId>
			<version>1.14.3</version>
<!--			<scope>test</scope>-->
		</dependency>


<!--		<dependency>-->
<!--			<groupId>org.apache.flink</groupId>-->
<!--			<artifactId>flink-streaming-java_2.11</artifactId>-->
<!--			<version>1.14.3</version>-->
<!--			<scope>test</scope>-->
<!--			<classifier>tests</classifier>-->
<!--		</dependency>-->
		<dependency>
			<groupId>org.projectlombok</groupId>
			<artifactId>lombok</artifactId>
			<version>1.18.20</version>
		</dependency>

		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-connector-kafka_2.11</artifactId>
			<version>1.14.3</version>
		</dependency>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-connector-jdbc_2.11</artifactId>
			<version>1.14.3</version>
		</dependency>
		<dependency>
			<groupId>mysql</groupId>
			<artifactId>mysql-connector-java</artifactId>
			<version>8.0.16</version>
		</dependency>
		<dependency>
			<groupId>cn.hutool</groupId>
			<artifactId>hutool-all</artifactId>
			<version>5.5.8</version>
		</dependency>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-table-api-java-bridge_2.11</artifactId>
			<version>1.14.3</version>
			<scope>provided</scope>
		</dependency>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-table-planner_2.11</artifactId>
			<version>1.14.3</version>
			<scope>provided</scope>
		</dependency>
<!--		<dependency>-->
<!--			<groupId>org.apache.flink</groupId>-->
<!--			<artifactId>flink-streaming-scala_2.11</artifactId>-->
<!--			<version>1.14.3</version>-->
<!--			<scope>provided</scope>-->
<!--		</dependency>-->

		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-table-common</artifactId>
			<version>1.14.3</version>
			<scope>provided</scope>
		</dependency>
<!--		<dependency>-->
<!--			<groupId>org.apache.flink</groupId>-->
<!--			<artifactId>flink-avro</artifactId>-->
<!--			<version>1.14.3</version>-->
<!--		</dependency>-->

		<!--        hive-->
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-connector-hive_${scala.binary.version}</artifactId>
			<version>${flink.version}</version>
<!--			<scope>provided</scope>-->
		</dependency>
<!--		flink json方面依赖-->
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-json</artifactId>
			<version>${flink.version}</version>
		</dependency>
<!--flink hive相关依赖-->
		<dependency>
			<groupId>org.apache.hive</groupId>
			<artifactId>hive-exec</artifactId>
			<version>${hive.version}</version>
		</dependency>

		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-hadoop-compatibility_2.11</artifactId>
			<version>${flink.version}</version>
		</dependency>

		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-common</artifactId>
			<version>${hadoop.version}</version>
		</dependency>

		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-client</artifactId>
			<version>${hadoop.version}</version>
		</dependency>




		<!--		<dependency>-->
<!--			<groupId>org.apache.flink</groupId>-->
<!--			<artifactId>flink-java</artifactId>-->
<!--			<version>${flink.version}</version>-->
<!--		</dependency>-->
<!--		<dependency>-->
<!--			<groupId>org.apache.flink</groupId>-->
<!--			<artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>-->
<!--			<version>${flink.version}</version>-->
<!--		</dependency>-->

	</dependencies>


	<build>
		<plugins>

			<!-- Java Compiler -->
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-compiler-plugin</artifactId>
				<version>3.1</version>
				<configuration>
					<source>${target.java.version}</source>
					<target>${target.java.version}</target>
				</configuration>
			</plugin>

			<!-- We use the maven-shade plugin to create a fat jar that contains all necessary dependencies. -->
			<!-- Change the value of <mainClass>...</mainClass> if your program entry point changes. -->
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-shade-plugin</artifactId>
				<version>3.0.0</version>
				<executions>
					<!-- Run shade goal on package phase -->
					<execution>
						<phase>package</phase>
						<goals>
							<goal>shade</goal>
						</goals>
						<configuration>
							<artifactSet>
								<excludes>
									<exclude>org.apache.flink:flink-shaded-force-shading</exclude>
									<exclude>com.google.code.findbugs:jsr305</exclude>
									<exclude>org.slf4j:*</exclude>
									<exclude>org.apache.logging.log4j:*</exclude>
								</excludes>
							</artifactSet>
							<filters>
								<filter>
									<!-- Do not copy the signatures in the META-INF folder.
                                    Otherwise, this might cause SecurityExceptions when using the JAR. -->
									<artifact>*:*</artifact>
									<excludes>
										<exclude>META-INF/*.SF</exclude>
										<exclude>META-INF/*.DSA</exclude>
										<exclude>META-INF/*.RSA</exclude>
									</excludes>
								</filter>
							</filters>
							<transformers>
								<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<!--									<mainClass>spendreport.FraudDetectionJob</mainClass>-->
									<mainClass>spendreport.hive.SinkHiveTest</mainClass>
<!--									<mainClass>spendreport.kafka.test.KafkaProductMain</mainClass>-->
								</transformer>
							</transformers>
						</configuration>
					</execution>
				</executions>
			</plugin>
		</plugins>



		<pluginManagement>
			<plugins>

				<!-- This improves the out-of-the-box experience in Eclipse by resolving some warnings. -->
				<plugin>
					<groupId>org.eclipse.m2e</groupId>
					<artifactId>lifecycle-mapping</artifactId>
					<version>1.0.0</version>
					<configuration>
						<lifecycleMappingMetadata>
							<pluginExecutions>
								<pluginExecution>
									<pluginExecutionFilter>
										<groupId>org.apache.maven.plugins</groupId>
										<artifactId>maven-shade-plugin</artifactId>
										<versionRange>[3.0.0,)</versionRange>
										<goals>
											<goal>shade</goal>
										</goals>
									</pluginExecutionFilter>
									<action>
										<ignore/>
									</action>
								</pluginExecution>
								<pluginExecution>
									<pluginExecutionFilter>
										<groupId>org.apache.maven.plugins</groupId>
										<artifactId>maven-compiler-plugin</artifactId>
										<versionRange>[3.1,)</versionRange>
										<goals>
											<goal>testCompile</goal>
											<goal>compile</goal>
										</goals>
									</pluginExecutionFilter>
									<action>
										<ignore/>
									</action>
								</pluginExecution>
							</pluginExecutions>
						</lifecycleMappingMetadata>
					</configuration>
				</plugin>
			</plugins>
		</pluginManagement>
	</build>

</project>

1、kafka生产消息代码

package spendreport.kafka.test;

import cn.hutool.json.JSONObject;
import cn.hutool.json.JSONUtil;
import org.apache.flink.streaming.api.functions.source.SourceFunction;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Random;

//使用并行度为1的source
public class MyJsonParalleSource implements SourceFunction<String> {//1

    //private long count = 1L;
    private boolean isRunning = true;
    List<String> ipList = Arrays.asList("192.168.25.3","192.168.25.4","192.168.25.5","192.168.25.6","192.168.25.7","192.168.25.8");
    /**
     * 主要的方法
     * 启动一个source
     * 大部分情况下,都需要在这个run方法中实现一个循环,这样就可以循环产生数据了
     *
     * @param ctx
     * @throws Exception
     */
    @Override
    public void run(SourceContext<String> ctx) throws Exception {
        while(isRunning){
            // ip信息描述
            int i = new Random().nextInt(5);
            JSONObject jsonObject = JSONUtil.createObj()
                    .set("ip", ipList.get(i))
                    .set("msg", ipList.get(i) + "接进来,请接听对应信息")
                    .set("ts", System.currentTimeMillis());
            String book = jsonObject.toString();
            ctx.collect(book);
            System.out.println("-----------------"+book);
            //每5秒产生一条数据
            Thread.sleep(5000);
        }
    }
    //取消一个cancel的时候会调用的方法
    @Override
    public void cancel() {
        isRunning = false;
    }
}
package spendreport.kafka.test;

import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;

import java.time.Instant;
import java.util.Properties;
import java.util.Random;
import java.util.TimeZone;

/**
 * @author liujiapeng.wb
 * @ClassName KafkaProductMain
 * @Description TODO
 * @date 2022/3/2 15:19
 * @Version 1.0
 */
public class KafkaProductMain {
    public static final Random random = new Random();
    public static void main(String[] args) throws Exception {
        // kafkaproduct-0.1
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

//        DataStreamSource<String> text = env.addSource(new MyNoParalleSource()).setParallelism(1);
        DataStreamSource<String> text = env.addSource(new MyJsonParalleSource()).setParallelism(1);

        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "xxx:9092");
        //new FlinkKafkaProducer("topn",new KeyedSerializationSchemaWrapper(new SimpleStringSchema()),properties,FlinkKafkaProducer.Semantic.EXACTLY_ONCE);
//        FlinkKafkaProducer<String> producer = new FlinkKafkaProducer("test_flink",new SimpleStringSchema(),properties);
        FlinkKafkaProducer<String> producer = new FlinkKafkaProducer("test_flink",new SimpleStringSchema(),properties);
/*
        //event-timestamp事件的发生时间
        producer.setWriteTimestampToKafka(true);
*/
        text.addSink(producer);
        env.execute();
    }
}

2、flink table 结合flink sql方式导数据到hive

package spendreport.hive;

import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.environment.ExecutionCheckpointingOptions;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.SqlDialect;
import org.apache.flink.table.api.TableEnvironment;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
import org.apache.flink.table.catalog.hive.HiveCatalog;

import java.time.Duration;

public class SinkHiveTest {
    public static void main(String[] args) {

        System.setProperty("HADOOP_USER_NAME", "hive");
        EnvironmentSettings settings = EnvironmentSettings.inStreamingMode();

        // 第一种方式创建----------start------------------
         StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//        String jarFile = "D:\\ljpPro\\frauddetection-0.1.jar";
//        StreamExecutionEnvironment env = StreamExecutionEnvironment.createRemoteEnvironment("192.168.37.103", 8081, jarFile);
//        使用StreamExecutionEnvironment创建StreamTableEnvironment,必须设置StreamExecutionEnvironment的checkpoint
        env.enableCheckpointing(10000, CheckpointingMode.EXACTLY_ONCE);
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
//        env.setParallelism(1);
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings);
        Configuration configuration = tableEnv.getConfig().getConfiguration();
        configuration.setString("table.exec.hive.fallback-mapred-reader", "true");
        //如果 topic 中的某些分区闲置,watermark 生成器将不会向前推进。 你可以在表配置中设置 'table.exec.source.idle-timeout' 选项来避免上述问题
        configuration.setString("table.exec.source.idle-timeout", "10s");
        // 第一种方式创建------------------end----------------------

        // 第二种方式创建----------start------------------
//        TableEnvironment tableEnv = TableEnvironment.create(settings);
//        Configuration configuration = tableEnv.getConfig().getConfiguration();
//        configuration.setString("table.exec.hive.fallback-mapred-reader", "true");
//        configuration.set(ExecutionCheckpointingOptions.CHECKPOINTING_MODE, CheckpointingMode.EXACTLY_ONCE);
//        configuration.set(ExecutionCheckpointingOptions.CHECKPOINTING_INTERVAL, Duration.ofSeconds(2));
        // 第二种方式创建------------------end----------------------


// 1.创建HiveCatalog
        String name = "myhive";
        String defaultDatabase = "mydatabase";
        String hiveConfDir = "/etc/hive/conf/";

        HiveCatalog hive = new HiveCatalog(name, defaultDatabase, hiveConfDir);
        // 2.注册HiveCatalog
        tableEnv.registerCatalog(name, hive);
        // 3.把HiveCatalog: myhive作为当前session的catalog
        tableEnv.useCatalog(name);
        tableEnv.useDatabase(defaultDatabase);
//指定方言
        tableEnv.getConfig().setSqlDialect(SqlDialect.HIVE);
// 5. 建表sql以hive为目的地

        tableEnv.executeSql("drop table if exists t_kafkaMsg2hiveTable");

        tableEnv.executeSql("CREATE TABLE IF NOT EXISTS t_kafkaMsg2hiveTable ("

                + "ip STRING,"

                + "msg STRING"

                + ")"

                + " PARTITIONED BY (dt STRING, hr STRING) STORED AS parquet TBLPROPERTIES ("

                + " 'partition.time-extractor.timestamp-pattern'='$dt $hr:00:00'," // hive 分区提取器提取时间戳的格式

                + " 'sink.partition-commit.trigger'='partition-time'," // 分区触发提交的类型可以指定 "process-time" 和 "partition-time" 处理时间和分区时间

                + " 'sink.partition-commit.delay'='5s'," // 提交延迟
//                + " 'table.exec.source.idle-timeout'='10s'," //  如果 topic 中的某些分区闲置,watermark 生成器将不会向前推进。 你可以在表配置中设置 'table.exec.source.idle-timeout' 选项来避免上述问题
                + " 'sink.partition-commit.watermark-time-zone'='Asia/Shanghai', " //-- Assume user configured time zone is 'Asia/Shanghai'
                + " 'sink.partition-commit.policy.kind'='metastore,success-file'" // 提交类型

                + ")");


//指定方言
        tableEnv.getConfig().setSqlDialect(SqlDialect.DEFAULT);


        // 4.建表sql以kafka为数据源,创建后就会监听kafka并写入数据至flink的内存

        tableEnv.executeSql("drop table if exists t_KafkaMsgSourceTable");

        tableEnv.executeSql("CREATE TABLE IF NOT EXISTS t_KafkaMsgSourceTable ("
                + "ip STRING"
                + ",msg STRING"
                + ",ts BIGINT" //13位原始时间戳
//                + ",ts3 AS TO_TIMESTAMP(FROM_UNIXTIME(ts/ 1000, 'yyyy-MM-dd HH:mm:ss'))" //flink的TIMESTAMP(3)格式
                + ",ts3 AS TO_TIMESTAMP_LTZ(ts, 3)" //flink的TIMESTAMP(3)格式
                + ",WATERMARK FOR ts3 AS ts3 - INTERVAL '5' SECOND" //水印最迟5s
                + ")"
                + " WITH ("
                + " 'connector' = 'kafka',"
                + " 'topic' = 'test_flink',"
                + " 'properties.bootstrap.servers' = 'xxx:9092',"
                + " 'properties.group.id' = 'kafkaflinkhivedemo',"
                + " 'scan.startup.mode' = 'earliest-offset'," //earliest-offset
                + " 'format' = 'json',"
                + " 'json.ignore-parse-errors' = 'true'"
                + ")");


        // 6. 同步2个表
        tableEnv.executeSql("INSERT INTO t_kafkaMsg2hiveTable "
                + "SELECT ip,msg,DATE_FORMAT(ts3, 'yyyy-MM-dd'), DATE_FORMAT(ts3, 'HH') FROM t_KafkaMsgSourceTable").print();

//        tableEnv.getConfig().setSqlDialect(SqlDialect.HIVE);
//        tableEnv.executeSql("alter table t_kafkamsg2hivetable add partition(dt='2022-03-04',hr=11)");
//        tableEnv.executeSql("SELECT * FROM t_kafkamsg2hivetable WHERE dt='2022-03-04' and hr='11'").print();
//        tableEnv.executeSql("select * from t_KafkaMsgSourceTable").print();

    }
}

3、resources文件引入

要引入下面几个文件,需要自己去hadoop文件下面去找:

core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml

然后在/etc/hive/conf/目录下要有一个hive-site文件,最后执行的时候,在ideal的run配置中还需选中这个

理论上这样执行就可以了,对应刚刚开始说的一直查询不到,我的理解是kafka中topic分区和flink这边的并行分区对应不上,然后flink一直在导数据,但是没有刷新hive元数据,加了如下配置:

configuration.setString("table.exec.source.idle-timeout", "10s");

 多少秒自己定,我是在flink地址中找到:

 当然,前人栽树后人乘凉,参考了很多博客大佬的经验,一些如下:

【flink】【kafka】【hive】flink消费kafka数据存到hive - xiaostudy - 博客园

FlinkSQL Kafka to Hive_L, there!-CSDN博客

  • 0
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值