Flink消费与生产kafka数据
由于最近毕设需要设计一个小功能,一个日志分析并转换合并放到kafka上的一个需求,今天来总结积记录一下思路与代码实现。
-
首先先明确业务流程,我们需要:
- 先从kafka上消费数据,两条数据流,数据是kafka上的Json字符串;
- 进行对数据的加工与分析(包含对数据的统计,过滤,转换),并处理成为一种格式的数据;
- 进行union合流操作;
- sink到kafka上,并自定义key与value。
OK,明确业务需求后我们开始上编码部分,大家也可以根据自己的业务来参考。
-
pom文件内容:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.huyue.flink</groupId> <artifactId>FlinkConsumerAOI</artifactId> <version>0.0.1-SNAPSHOT</version> <packaging>jar</packaging> <name>FlinkConsumerAOI</name> <url>http://maven.apache.org</url> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties> <dependencies> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.56</version> </dependency> <dependency> <groupId>org.redisson</groupId> <artifactId>redisson</artifactId> <version>3.11.6</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.poi/poi --> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi</artifactId> <version>4.0.1</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>4.0.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml-schemas --> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml-schemas</artifactId> <version>4.0.1</version> </dependency> <dependency> <groupId>c3p0</groupId> <artifactId>c3p0</artifactId> <version>0.9.0.4</version> </dependency> <dependency> <groupId>com.zaxxer</groupId> <artifactId>HikariCP</artifactId> <version>3.1.0</version> </dependency> <!-- https://mvnrepository.com/artifact/net.sourceforge.javacsv/javacsv --> <dependency> <groupId>net.sourceforge.javacsv</groupId> <artifactId>javacsv</artifactId> <version>2.0</version> </dependency> <!-- https://mvnrepository.com/artifact/org.postgresql/postgresql --> <dependency> <groupId>org.postgresql</groupId> <artifactId>postgresql</artifactId> <version>42.2.5</version> </dependency> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>2.2.0</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-clients_2.12</artifactId> <version>1.11.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-java_2.12</artifactId> <version>1.11.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-java --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-java</artifactId> <version>1.11.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-kafka_2.12</artifactId> <version>1.11.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core --> <dependency> <groupId>commons-logging</groupId> <artifactId>commons-logging</artifactId> <version>1.1.1</version> </dependency> <dependency> <groupId>commons-cli</groupId> <artifactId>commons-cli</artifactId> <version>1.4</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-api</artifactId> <version>1.8.0-beta0</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <version>1.8.0-beta0</version> </dependency> <dependency> <groupId>log4j</groupId> <artifactId>log4j</artifactId> <version>1.2.17</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> </dependencies> </project>
这里为了方便看到Flink的分析过程日志,我们引入log4j的jar包,并且需要将
log4j.properties
文件放到项目的src目录下,# log4j.properties文件 # Root logger option log4j.rootLogger=info, stdout # Direct log messages to stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.Target=System.out log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
-
主函数部分,主要是kafka配置、定义数据流、调用分析算子以及合流sink到kafka;
/** * @throws Exception * @Author: Hu.Yue * @Title: main * @Description: TODO * @param @param args * @return void * @throws */ public static void main(String[] args) throws Exception { //分析数据流1 String topic = "inTopic1"; //分析数据流2 String topic2 = "inTopic2"; //合并数据流 String outTopic = "outTopic"; //创建环境 StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment(); //配置恢复点 environment.enableCheckpointing(10000); environment.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime); Properties props = new Properties(); //kafka集群地址 props.setProperty("bootstrap.servers", "10.144.3.155:9092,10.144.5.233:9092,10.144.4.54:9092"); //认证(如果没有此配置可以忽略) props.put("sasl.jaas.config", "org.apache.kafka.common.security.scram.ScramLoginModule required username='xxx' password='xxx';"); props.put("security.protocol", "SASL_PLAINTEXT"); props.put("sasl.mechanism", "PLAIN"); //分组定义 props.setProperty("group.id", "java_group1"); props.setProperty("kafka.consumer.auto.offset.reset", "enableAutoCommit"); //消费序列化 props.setProperty("key.deserializer", StringDeserializer.class.getName()); props.setProperty("value.deserializer", StringDeserializer.class.getName()); //生产序列化 props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); //参数配置 props.put("acks", "all"); props.put("retries", 3); props.put("batch.size", 65536); props.put("linger.ms", 1); props.put("buffer.memory", 33554432); props.put("max.request.size", 10485760); //定义String类型的两个消费者 FlinkKafkaConsumer<String> consumerBoard = new FlinkKafkaConsumer<String>(topic, new SimpleStringSchema(), props); FlinkKafkaConsumer<String> consumerFovComp = new FlinkKafkaConsumer<String>(topic2, new SimpleStringSchema(), props); //设置消费原则 consumerBoard.setStartFromEarliest(); consumerFovComp.setStartFromEarliest(); //定义数据流,调用方法在下面介绍 DataStream<String> dataStream_board = ConversionBoard(environment,consumerBoard); DataStream<String> dataStream_dovComp = ConversionFovComp(environment,consumerFovComp); //合流操作,并RichSinkFunction自定义上传的key与value DataStream<String> unionStream = dataStream_board.union(dataStream_dovComp); unionStream.addSink(new RichSinkFunction<String>() { private static final long serialVersionUID = 1L; KafkaProducer<String, String> producer = null; @Override public void open(Configuration parameters) { producer = new KafkaProducer<String, String>(props); } @Override public void invoke(String value) throws Exception { if(null != value) { String key = String.format("%d", (int) (Math.random() * 6)); ProducerRecord<String,String> producerRecord = new ProducerRecord<String,String>(outTopic, key, value); producer.send(producerRecord); producer.flush(); } } }); //控制台打印数据流信息 unionStream.print(); //提交操作 environment.execute("union DataStream"); }
-
俩条数据流的格式的不同的,算子需要同通过某种逻辑规则将数据转化成相同的后进行合流
/** * @Author: Hu.Yue * @Title: ConversionBoard * @Description: board-pass * @param @param environment * @param @param consumer * @param @return * @param @throws Exception * @return DataStream<Tuple3<String,String,Integer>> * @throws */ public static DataStream<String> ConversionBoard(StreamExecutionEnvironment environment, FlinkKafkaConsumer<String> consumer) throws Exception{ DataStream<String> dataStream = environment.addSource(consumer) //对数据进行包装,使一条数据流包含可以分组的key值和便于统计的一个count值 .flatMap(new FlatMapFunction<String, Tuple3<String,String,Integer>>() { private static final long serialVersionUID = 1L; public void flatMap(String value, Collector<Tuple3<String, String, Integer>> out) throws Exception { ObjectMapper oMapper = new ObjectMapper(); JsonNode node = oMapper.readTree(value); //这里我们使用其中的4个字段作为key,每一个数据后面带一个1, out.collect(new Tuple3<String, String, Integer>( node.get("xxx").asText()+ node.get("yyy").asText() + node.get("zzz").asText() + node.get("bbb").asText(), value, 1)); } }) //通过刚才拼接的key来进行分组 .keyBy(0) //设置会话时间来等这一组数据全部到齐,这里的三秒是测试时使用的,真实的生产环境可以拉长时间来多等一会 .window(ProcessingTimeSessionWindows.withGap(Time.seconds(3))) //聚合函数,对相同的key值数据流进行count操作并合并成一条数据输出 .reduce(new ReduceFunction<Tuple3<String,String,Integer>>() { private static final long serialVersionUID = 1L; public Tuple3<String, String, Integer> reduce(Tuple3<String, String, Integer> value1, Tuple3<String, String, Integer> value2) throws Exception { return new Tuple3<String, String, Integer>(value2.f0,value2.f1,value2.f2+value1.f2); } }) //过滤不符合我们预期的数据 .filter(new FilterFunction<Tuple3<String,String,Integer>>() { private static final long serialVersionUID = 1L; public boolean filter(Tuple3<String, String, Integer> value) throws Exception { ObjectMapper oMapper = new ObjectMapper(); JsonNode node = oMapper.readTree(value.f1); if((node.get("boardSN").asText().length() == 10) && (!node.get("idStation").asText().startsWith("L")) && ("\"P \"".equals(node.get("status").asText().trim()) || "\"Y \"".equals(node.get("status").asText().trim())) && (node.get("modifierDate").asText() != null) && (node.get("bBoardsn").asText() == null || "".equals(node.get("bBoardsn").asText()) || node.get("bBoardsn").asText() != node.get("boardSN").asText())) { return true; } return false; } }) //对数据进行符合结构转化 .flatMap(new FlatMapFunction<Tuple3<String,String,Integer>, String>() { private static final long serialVersionUID = 1L; @Override public void flatMap(Tuple3<String, String, Integer> value, Collector<String> out) throws Exception { ObjectMapper mapper = new ObjectMapper(); JsonNode node = mapper.readTree(value.f1); String mcbSno = node.get("boardSN").asText().length() > 32 ? node.get("boardSN").asText().toUpperCase().substring(0,32) : node.get("boardSN").asText(); String fdate = node.get("fdate").asText(); String fixNo = node.get("idStation").asText().toUpperCase().substring(0,4); String cModel = node.get("cModel").asText(); String badgeNo = node.get("modifier").asText().length() > 7 ? node.get("modifier").asText().substring(0,7) : node.get("modifier").asText(); String cycleTime = node.get("cycleTime").asText(); Integer unionQty = 1; String wc= ""; String pdLine = fixNo.substring(0,3); String pcbSide = node.get("cModel").asText().toUpperCase().trim().substring( node.get("cModel").asText().length()-1); //这里拿到按需(key--value.f1)统计的count值 unionQty = value.f2; switch (pcbSide) { case "1": wc = "05"; break; case "2": wc = "0B"; break; case "3": wc = "0C"; break; case "4": wc = "0D"; break; case "5": wc = "PE"; break; default: break; } if(wc != null && !"".equals(wc)) { //转化成符合FIS的数据结构 FisResult fis = new FisResult( mcbSno + "_" + wc + "_" +fdate, mcbSno, wc, 1, pdLine, "", "", badgeNo, unionQty, "57", fdate, fixNo, cycleTime, cModel, "f3"); //转化成JSON串后被out收录 String result = JSON.toJSONString(fis); out.collect(result); } } }); //一定记得返回流 return dataStream; }
-
下一个数据流分析方法和上面的雷同,只是分析逻辑不同而已,这里考虑篇幅就不介绍了,重要的部分我已经打好备注,大家按需食用即可,有问题欢迎评论区交流。