Flink消费kafka数据进行统计,过滤,合流后sink到kafka

Flink消费与生产kafka数据

由于最近毕设需要设计一个小功能,一个日志分析并转换合并放到kafka上的一个需求,今天来总结积记录一下思路与代码实现。

  1. 首先先明确业务流程,我们需要:

    1. 先从kafka上消费数据,两条数据流,数据是kafka上的Json字符串;
    2. 进行对数据的加工与分析(包含对数据的统计,过滤,转换),并处理成为一种格式的数据;
    3. 进行union合流操作;
    4. sink到kafka上,并自定义key与value。

    OK,明确业务需求后我们开始上编码部分,大家也可以根据自己的业务来参考。

  2. pom文件内容:

    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
      <modelVersion>4.0.0</modelVersion>
    
      <groupId>com.huyue.flink</groupId>
      <artifactId>FlinkConsumerAOI</artifactId>
      <version>0.0.1-SNAPSHOT</version>
      <packaging>jar</packaging>
    
      <name>FlinkConsumerAOI</name>
      <url>http://maven.apache.org</url>
    
      <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
      </properties>
    
      <dependencies>
      		<dependency>
    			<groupId>com.alibaba</groupId>
    			<artifactId>fastjson</artifactId>
    			<version>1.2.56</version>
    		</dependency>
      
    		<dependency>
    			<groupId>org.redisson</groupId>
    			<artifactId>redisson</artifactId>
    			<version>3.11.6</version>
    		</dependency>
    
    		<!-- https://mvnrepository.com/artifact/org.apache.poi/poi -->
    		<dependency>
    			<groupId>org.apache.poi</groupId>
    			<artifactId>poi</artifactId>
    			<version>4.0.1</version>
    		</dependency>
    
    		<dependency>
    			<groupId>org.apache.poi</groupId>
    			<artifactId>poi-ooxml</artifactId>
    			<version>4.0.1</version>
    		</dependency>
    		<!-- https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml-schemas -->
    		<dependency>
    			<groupId>org.apache.poi</groupId>
    			<artifactId>poi-ooxml-schemas</artifactId>
    			<version>4.0.1</version>
    		</dependency>
    
    		<dependency>
    			<groupId>c3p0</groupId>
    			<artifactId>c3p0</artifactId>
    			<version>0.9.0.4</version>
    		</dependency>
    
    		<dependency>
    			<groupId>com.zaxxer</groupId>
    			<artifactId>HikariCP</artifactId>
    			<version>3.1.0</version>
    		</dependency>
    
    		<!-- https://mvnrepository.com/artifact/net.sourceforge.javacsv/javacsv -->
    		<dependency>
    			<groupId>net.sourceforge.javacsv</groupId>
    			<artifactId>javacsv</artifactId>
    			<version>2.0</version>
    		</dependency>
    		<!-- https://mvnrepository.com/artifact/org.postgresql/postgresql -->
    		<dependency>
    			<groupId>org.postgresql</groupId>
    			<artifactId>postgresql</artifactId>
    			<version>42.2.5</version>
    		</dependency>
    
    		<dependency>
    			<groupId>org.apache.kafka</groupId>
    			<artifactId>kafka-clients</artifactId>
    			<version>2.2.0</version>
    		</dependency>
    		<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients -->
    		<dependency>
    			<groupId>org.apache.flink</groupId>
    			<artifactId>flink-clients_2.12</artifactId>
    			<version>1.11.1</version>
    		</dependency>
    
    		<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java -->
    		<dependency>
    			<groupId>org.apache.flink</groupId>
    			<artifactId>flink-streaming-java_2.12</artifactId>
    			<version>1.11.1</version>
    		</dependency>
    
    
    		<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-java -->
    		<dependency>
    			<groupId>org.apache.flink</groupId>
    			<artifactId>flink-java</artifactId>
    			<version>1.11.1</version>
    		</dependency>
    
    		<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka -->
    		<dependency>
    			<groupId>org.apache.flink</groupId>
    			<artifactId>flink-connector-kafka_2.12</artifactId>
    			<version>1.11.1</version>
    		</dependency>
    
    
    		<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
    		<dependency>
    			<groupId>commons-logging</groupId>
    			<artifactId>commons-logging</artifactId>
    			<version>1.1.1</version>
    		</dependency>
    
    		<dependency>
    			<groupId>commons-cli</groupId>
    			<artifactId>commons-cli</artifactId>
    			<version>1.4</version>
    		</dependency>
    
    		<dependency>
    			<groupId>org.slf4j</groupId>
    			<artifactId>slf4j-api</artifactId>
    			<version>1.8.0-beta0</version>
    		</dependency>
    		<dependency>
    			<groupId>org.slf4j</groupId>
    			<artifactId>slf4j-log4j12</artifactId>
    			<version>1.8.0-beta0</version>
    		</dependency>
    
    		<dependency>
    			<groupId>log4j</groupId>
    			<artifactId>log4j</artifactId>
    			<version>1.2.17</version>
    		</dependency>
        <dependency>
          <groupId>junit</groupId>
          <artifactId>junit</artifactId>
          <version>3.8.1</version>
          <scope>test</scope>
        </dependency>
      </dependencies>
    </project>
    
    

    这里为了方便看到Flink的分析过程日志,我们引入log4j的jar包,并且需要将log4j.properties文件放到项目的src目录下,

    # log4j.properties文件
    # Root logger option
    log4j.rootLogger=info, stdout
    
    # Direct log messages to stdout
    log4j.appender.stdout=org.apache.log4j.ConsoleAppender
    log4j.appender.stdout.Target=System.out
    log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
    log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
    
  3. 主函数部分,主要是kafka配置、定义数据流、调用分析算子以及合流sink到kafka;

    /**
    	 * @throws Exception   
    	* @Author: Hu.Yue
    	* @Title: main  
    	* @Description: TODO 
    	* @param @param args 
    	* @return void 
    	* @throws  
    	*/
    	public static void main(String[] args) throws Exception {
            //分析数据流1
    		String topic = "inTopic1";
            //分析数据流2
    		String topic2 = "inTopic2";
            //合并数据流
    		String outTopic = "outTopic";
    		
    		//创建环境
    	    StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
    	    
    	    //配置恢复点
    	    environment.enableCheckpointing(10000);
    	    environment.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
    	    
    	    Properties props = new Properties();
            //kafka集群地址
    		props.setProperty("bootstrap.servers", "10.144.3.155:9092,10.144.5.233:9092,10.144.4.54:9092");
            //认证(如果没有此配置可以忽略)
    		props.put("sasl.jaas.config", "org.apache.kafka.common.security.scram.ScramLoginModule required username='xxx' password='xxx';");
    		props.put("security.protocol", "SASL_PLAINTEXT");
    		props.put("sasl.mechanism", "PLAIN");
            //分组定义
    		props.setProperty("group.id", "java_group1");
    		props.setProperty("kafka.consumer.auto.offset.reset", "enableAutoCommit");
            //消费序列化
    		props.setProperty("key.deserializer", StringDeserializer.class.getName());
    		props.setProperty("value.deserializer", StringDeserializer.class.getName());
    		//生产序列化
    		props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
    		props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
    		//参数配置
    		props.put("acks", "all");
    		props.put("retries", 3);
    		props.put("batch.size", 65536);
    		props.put("linger.ms", 1);
    		props.put("buffer.memory", 33554432);
    		props.put("max.request.size", 10485760);
    		
            //定义String类型的两个消费者
    		FlinkKafkaConsumer<String> consumerBoard = new FlinkKafkaConsumer<String>(topic, new SimpleStringSchema(), props);
    		FlinkKafkaConsumer<String> consumerFovComp = new FlinkKafkaConsumer<String>(topic2, new SimpleStringSchema(), props);
    		
            //设置消费原则
    		consumerBoard.setStartFromEarliest();
    		consumerFovComp.setStartFromEarliest();
    		
           	//定义数据流,调用方法在下面介绍
    		DataStream<String> dataStream_board = ConversionBoard(environment,consumerBoard);
    		DataStream<String> dataStream_dovComp = ConversionFovComp(environment,consumerFovComp);
    		
            //合流操作,并RichSinkFunction自定义上传的key与value
    		DataStream<String> unionStream = dataStream_board.union(dataStream_dovComp);
    		unionStream.addSink(new RichSinkFunction<String>() {
    					private static final long serialVersionUID = 1L;
    					KafkaProducer<String, String> producer = null;
    					
    					@Override
    					public void open(Configuration parameters) {
    						producer = new KafkaProducer<String, String>(props);
    					}
    					
    					@Override
    					public void invoke(String value) throws Exception {
    						if(null != value) {
    							String key = String.format("%d", (int) (Math.random() * 6));
    							ProducerRecord<String,String> producerRecord = new ProducerRecord<String,String>(outTopic, key, value); 
    							
    							producer.send(producerRecord);
    							producer.flush();
    						}
    					}
    				});
            //控制台打印数据流信息
    		unionStream.print();
            //提交操作
    		environment.execute("union DataStream");
    		
    	}
    
  4. 俩条数据流的格式的不同的,算子需要同通过某种逻辑规则将数据转化成相同的后进行合流

    /**  
    	* @Author: Hu.Yue
    	* @Title: ConversionBoard  
    	* @Description: board-pass 
    	* @param @param environment
    	* @param @param consumer
    	* @param @return
    	* @param @throws Exception 
    	* @return DataStream<Tuple3<String,String,Integer>> 
    	* @throws  
    	*/ 
    	public static DataStream<String> ConversionBoard(StreamExecutionEnvironment environment, FlinkKafkaConsumer<String> consumer) throws Exception{
    		DataStream<String> dataStream = environment.addSource(consumer)
                //对数据进行包装,使一条数据流包含可以分组的key值和便于统计的一个count值
    				.flatMap(new FlatMapFunction<String, Tuple3<String,String,Integer>>() {
    					
    					private static final long serialVersionUID = 1L;
    
    					public void flatMap(String value, Collector<Tuple3<String, String, Integer>> out) throws Exception {
    					
    						ObjectMapper oMapper = new ObjectMapper();
    						JsonNode node = oMapper.readTree(value);
    						//这里我们使用其中的4个字段作为key,每一个数据后面带一个1,
    						out.collect(new Tuple3<String, String, Integer>(
    								node.get("xxx").asText()+ node.get("yyy").asText() + node.get("zzz").asText() + node.get("bbb").asText(),
    								value,
    								1));
    						}
    				})
                	//通过刚才拼接的key来进行分组
    				.keyBy(0)
                	//设置会话时间来等这一组数据全部到齐,这里的三秒是测试时使用的,真实的生产环境可以拉长时间来多等一会
    				.window(ProcessingTimeSessionWindows.withGap(Time.seconds(3)))
                	//聚合函数,对相同的key值数据流进行count操作并合并成一条数据输出
    				.reduce(new ReduceFunction<Tuple3<String,String,Integer>>() {
    					
    					private static final long serialVersionUID = 1L;
    
    					public Tuple3<String, String, Integer> reduce(Tuple3<String, String, Integer> value1,
    							Tuple3<String, String, Integer> value2) throws Exception {
    						return new Tuple3<String, String, Integer>(value2.f0,value2.f1,value2.f2+value1.f2);
    					}
    				})
                	//过滤不符合我们预期的数据
    				.filter(new FilterFunction<Tuple3<String,String,Integer>>() {
    
    					private static final long serialVersionUID = 1L;
    
    					public boolean filter(Tuple3<String, String, Integer> value) throws Exception {
    						ObjectMapper oMapper = new ObjectMapper();
    						JsonNode node = oMapper.readTree(value.f1);
    						if((node.get("boardSN").asText().length() == 10) &&
    								(!node.get("idStation").asText().startsWith("L")) && 
    								("\"P \"".equals(node.get("status").asText().trim()) || "\"Y \"".equals(node.get("status").asText().trim())) &&
    								(node.get("modifierDate").asText() != null) && 
    								(node.get("bBoardsn").asText() == null || "".equals(node.get("bBoardsn").asText()) || node.get("bBoardsn").asText() != node.get("boardSN").asText())) {
    							return true;
    						}	
    						return false;
    					}
    				})
                	//对数据进行符合结构转化
    				.flatMap(new FlatMapFunction<Tuple3<String,String,Integer>, String>() {
    
    					private static final long serialVersionUID = 1L;
    
    					@Override
    					public void flatMap(Tuple3<String, String, Integer> value, Collector<String> out) throws Exception {
    						ObjectMapper mapper = new ObjectMapper();
    						JsonNode node = mapper.readTree(value.f1);
    						
    						String mcbSno = node.get("boardSN").asText().length() > 32 ? node.get("boardSN").asText().toUpperCase().substring(0,32) : node.get("boardSN").asText();
    						String fdate = node.get("fdate").asText();
    						String fixNo = node.get("idStation").asText().toUpperCase().substring(0,4);
    						String cModel = node.get("cModel").asText();
    						String badgeNo = node.get("modifier").asText().length() > 7 ? node.get("modifier").asText().substring(0,7) : node.get("modifier").asText();
    						String cycleTime = node.get("cycleTime").asText();
    						Integer unionQty = 1;
    						String wc= "";
    						String pdLine = fixNo.substring(0,3);
    						String pcbSide = node.get("cModel").asText().toUpperCase().trim().substring(
    								node.get("cModel").asText().length()-1);
                            //这里拿到按需(key--value.f1)统计的count值
    						unionQty = value.f2;
    						switch (pcbSide) {
    						case "1":
    							wc = "05";
    							break;
    						case "2":
    							wc = "0B";
    							break;
    						case "3":
    							wc = "0C";
    							break;
    						case "4":
    							wc = "0D";
    							break;
    						case "5":
    							wc = "PE";
    							break;
    						default:
    							break;
    						}
    
    						if(wc != null && !"".equals(wc)) {
                                //转化成符合FIS的数据结构
    							FisResult fis = new FisResult(
    									mcbSno + "_" + wc + "_" +fdate,
    									mcbSno,
    									wc,
    									1,
    									pdLine,
    									"",
    									"",
    									badgeNo,
    									unionQty,
    									"57",
    									fdate,
    									fixNo,
    									cycleTime,
    									cModel,
    									"f3");
    							//转化成JSON串后被out收录
    							String result = JSON.toJSONString(fis);
    							out.collect(result);
    						}
    					}
    				});
            //一定记得返回流
    		return dataStream;
    	}
    
  5. 下一个数据流分析方法和上面的雷同,只是分析逻辑不同而已,这里考虑篇幅就不介绍了,重要的部分我已经打好备注,大家按需食用即可,有问题欢迎评论区交流。

  • 3
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值