第 27 节实战需求分析(数据清洗[实时ETL])

最新推荐文章于 2023-08-21 15:15:52 发布

江湖侠客

最新推荐文章于 2023-08-21 15:15:52 发布

阅读量877

点赞数 1

分类专栏： Flink入门实战

本文链接：https://blog.csdn.net/weixin_39868387/article/details/104778217

版权

Flink入门实战专栏收录该内容

41 篇文章 11 订阅

订阅专栏

上篇：第 26 节 Flink Kafka-Connector详解

应用场景分析

数据清洗【实时ETL】
数据报表
在这里插入图片描述

需求分析

针对算法产生的日志数据进行清洗拆分
1：算法产生的日志数据是嵌套json格式，需要拆分打平
2：针对算法中的国家字段进行大区转换
3：最后把不同类型的日志数据分别进行存储

架构图

在这里插入图片描述

上代码实际操作（Java）

1、创建maven父工程

在这里插入图片描述

父工程创建ok

2、创建子模块DataClean

在这里插入图片描述

子工程DataClean，创建ok

在父工程FlinkProj的pom文件添加依赖：
为方便管理，父工程的pom文件写版本，它会自动集成，我们就可以在子模块的pom文件不需要写版本号

   <dependencyManagement>

        <dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>1.6.1</version>
            <!-- provided在这表示此依赖只在代码编译的时候使用，运行和打包的时候不使用 -->
            <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_2.11</artifactId>
            <version>1.6.1</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-scala_2.11</artifactId>
            <version>1.6.1</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_2.11</artifactId>
            <version>1.6.1</version>
            <!--<scope>provided</scope>-->
        </dependency>

        <dependency>
            <groupId>org.apache.bahir</groupId>
            <artifactId>flink-connector-redis_2.11</artifactId>
            <version>1.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-statebackend-rocksdb_2.11</artifactId>
            <version>1.6.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka-0.11_2.11</artifactId>
            <version>1.6.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>0.11.0.3</version>
        </dependency>

        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>1.7.25</version>
        </dependency>

        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.7.25</version>
        </dependency>

            <!-- https://mvnrepository.com/artifact/redis.clients/jedis -->
            <dependency>
                <groupId>redis.clients</groupId>
                <artifactId>jedis</artifactId>
                <version>2.9.0</version>
            </dependency>

   <!-- https://mvnrepository.com/artifact/com.alibaba/fastjson -->
            <dependency>
                <groupId>com.alibaba</groupId>
                <artifactId>fastjson</artifactId>
                <version>1.2.45</version>
            </dependency>

        </dependencies>

    </dependencyManagement>

  <!---打包的代码 -->
    <build>
        <plugins>
            <!-- 编译插件-->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.6.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
            <!-- scala编译插件 -->
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.1.6</version>
                <configuration>
                    <scalaCompatVersion>2.11</scalaCompatVersion>
                    <scalaVersion>2.11.12</scalaVersion>
                    <encoding>UTF-8</encoding>
                </configuration>
                <executions>
                    <execution>
                        <id>compile-scala</id>
                        <phase>compile</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>test-compile-scala</id>
                        <phase>test-compile</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <!-- 打jar包插件(会包含所有依赖) -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>2.6</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <archive>
                        <manifest>
                            <!-- 可以设置jar包的入口类(可选) -->
                            <mainClass></mainClass>
                        </manifest>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

（4）在子模块DataClean的pom文件引入依赖：
不需要写版本号，它会集成父工程的pom文件

         <dependencies>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>

    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-java_2.11</artifactId>

    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-scala_2.11</artifactId>

    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-scala_2.11</artifactId>

    </dependency>

    <dependency>
        <groupId>org.apache.bahir</groupId>
        <artifactId>flink-connector-redis_2.11</artifactId>

    </dependency>

    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-statebackend-rocksdb_2.11</artifactId>
    </dependency>

    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-kafka-0.11_2.11</artifactId>
    </dependency>

    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka-clients</artifactId>
    </dependency>

    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
    </dependency>

    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
    </dependency>

        <!-- https://mvnrepository.com/artifact/redis.clients/jedis -->
        <dependency>
            <groupId>redis.clients</groupId>
            <artifactId>jedis</artifactId>
        </dependency>
        
 <!-- https://mvnrepository.com/artifact/com.alibaba/fastjson -->
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
        </dependency>

    </dependencies>

3、代码编写

在这里插入图片描述

DataClean.java

package xuwei.tech;

import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.CoFlatMapFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011;
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchemaWrapper;
import org.apache.flink.util.Collector;
import xuwei.tech.source.MyRedisSource;

import java.util.HashMap;
import java.util.Properties;

/**
 * 创建kafka topic 命令：
 * bin/kafka-topics.sh --create --topic allData -zookeeper flink102:2181 --partitions 5 --replication-factor 1
 * bin/kafka-topics.sh --create --topic allDataclean -zookeeper flink102:2181 --partitions 5 --replication-factor 1
 * 
 * 数据清洗需求
 *
 * 组装代码
 */
public class DataClean {
    public static void main(String[] args)throws Exception {
        //创建flink运行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //配置checkpoint配置
        env.enableCheckpointing(60000);
        env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
        env.getCheckpointConfig().setMinPauseBetweenCheckpoints(30000);
        env.getCheckpointConfig().setCheckpointTimeout(10000);
        env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
        env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);

        //指定kafkasource
        String topic="allData";
         Properties prop= new Properties();
         prop.setProperty("bootstarp.servers","flink102:9092");
         prop.setProperty("group.id","conl");

         FlinkKafkaConsumer011<String> myConsumer = new FlinkKafkaConsumer011<String>(topic,new SimpleStringSchema(),prop);

         //获取kafka中的数据
        //{"dt":"2018-01-01 11:11:11","CountyCode":"US","data":{"type":"s1","score":0.3,"level":0.1,"level":"0"}
         DataStreamSource<String> data = env.addSource(myConsumer);

         //最新的国家码和大区的映射关系
         DataStreamSource<HashMap<String, String>> mapData = env.addSource(new MyRedisSource());

         DataStream<String> redisData = data.connect(mapData).flatMap(new CoFlatMapFunction<String, HashMap<String, String>, String>() {
            //存储国家和大区的映射关系
            private HashMap<String, String> allMap = new HashMap<String, String>();


            //flatMap1处理的是kafka中的数据
            @Override
            public void flatMap1(String value, Collector<String> out) throws Exception {
                JSONObject jsonObject = JSONObject.parseObject(value);
                String dt = jsonObject.getString("dt");
                String counttryCode = jsonObject.getString("counttryCode");
                //获取大区
                String area = allMap.get(counttryCode);
                JSONArray jsonArray = jsonObject.getJSONArray("data");
                for (int i = 0; i < jsonArray.size(); i++) {
                    JSONObject jsonObject1 = jsonArray.getJSONObject(i);
                    jsonObject1.put("area", area);
                    jsonObject1.put("dt", dt);
                    out.collect(jsonObject1.toJSONString());
                }
            }

            //flatMap2处理的是redis返回的map类型的数据
            @Override
            public void flatMap2(HashMap<String, String> value, Collector<String> out) throws Exception {
                this.allMap = value;
            }
        });
         String outTopic="allDataClean";
         Properties outprop = new Properties();
          outprop.setProperty("bootstrap.servers", "flink102:9092");
         FlinkKafkaProducer011<String> myProducer = new FlinkKafkaProducer011<>(outTopic, new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()), outprop, FlinkKafkaProducer011.Semantic.EXACTLY_ONCE);
        redisData.addSink(myProducer);
        env.execute("DataClean");

    }
}

MyRedisSource.java

package xuwei.tech.source;

import org.apache.flink.streaming.api.functions.source.SourceFunction;


import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import redis.clients.jedis.Jedis;
import redis.clients.jedis.exceptions.JedisConnectionException;

import java.util.HashMap;
import java.util.Map;

/**
 * redis中进行数据初始化
 *
 * hset oreas AREA_US US
 * hset oreas AREA_CT TW,HK
 * hset oreas AREA_AR PK,KW,SA
 * hset oreas AREA_IN IN
 *
 * 在redis中保存的国家和大区的关系
 *
 * 需要把大区和国家的对应关系组装成java的hashmap
 */
public class MyRedisSource implements SourceFunction<HashMap<String,String>> {
    private Logger logger = LoggerFactory.getLogger(MyRedisSource.class);
    private final long SLEEP_MILLION=60000;
    private boolean isRunning=true;
    private Jedis jedis=null;

    @Override
    public void run(SourceContext<HashMap<String, String>> ctx) throws Exception {

        this.jedis = new Jedis("flink102", 6379);

         //存储所有国家和大区的对应关系
         HashMap<String, String> KeyValueMap = new HashMap<>();
        while (isRunning){
            try {
                KeyValueMap.clear();
                //把数据取出来
                Map<String, String> areas = jedis.hgetAll("areas");
                for (Map.Entry<String,String>entry:areas.entrySet()) {
                    String key=entry.getKey();
                    String value = entry.getValue();
                    String[] splits = value.split(",");
                    for (String split:splits) {
                        KeyValueMap.put(split,key);
                    }
                }
                if(KeyValueMap.size()>0){
                  ctx.collect(KeyValueMap);
                }else {
                    logger.warn("从redis中获取的数据为空！！！");
                }
                Thread.sleep(SLEEP_MILLION);
            } catch (JedisConnectionException e) {   //捕获连接异常
                logger.error("redis链接异常，重新获取链接",e.getCause());
                jedis = new Jedis("flink102", 6379);
            }catch (Exception e){
                logger.error("source 数据源异常",e.getCause());
            }
        }
    }

    @Override
    public void cancel() {
        isRunning=false;
        if(jedis!=null){
         jedis.close();
        }
    }
}

4、启动虚拟机的redis、kafka服务

//kafka服务启动起来了
[root@flink102 ~]# jps
11984 Kafka
12336 Jps
6628 QuorumPeerMain
//redis服务启动起来了
[root@flink102 ~]# redis-cli
127.0.0.1:6379>

5、创建kafka的topic

可以采用历史创建提示：history |grep topic
在这里插入图片描述

//创建kafka的topic，名为allData
[root@flink102 kafka-2.11]# bin/kafka-topics.sh --create --topic allData -zookeeper flink102:2181 --partitions 5 --replication-factor 1
Created topic "allData".
[root@flink102 kafka-2.11]# 

//创建kafka的topic，名为allDataclean
[root@flink102 kafka-2.11]# bin/kafka-topics.sh --create --topic allDataclean -zookeeper flink102:2181 --partitions 5 --replication-factor 1
Created topic "allDataclean".
[root@flink102 kafka-2.11]#

6、redis的数据初始化

//redis的数据初始化
127.0.0.1:6379> hset areas AREA_US US
(integer) 1
127.0.0.1:6379> hset areas AREA_CT TW,HK
(integer) 1
127.0.0.1:6379> hset areas AREA_AR PK,KW,SA
(integer) 1
127.0.0.1:6379> hset areas AREA_IN IN
(integer) 1
127.0.0.1:6379> 

//查看数据
127.0.0.1:6379> hgetall areas
1) "AREA_US"
2) "US"
3) "AREA_CT"
4) "TW,HK"
5) "AREA_AR"
6) "PK,KW,SA"
7) "AREA_IN"
8) "IN"
127.0.0.1:6379>

7、创建生产者的工具类

在这里插入图片描述

kafkaProducer.java

package xuwei.tech.util;

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;

import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Properties;
import java.util.Random;

/**
 * Created by xuwei.tech on 2018/11/6.
 */
public class kafkaProducer {

    public static void main(String[] args) throws Exception{
        Properties prop = new Properties();
        //指定kafka broker地址
        prop.put("bootstrap.servers", "flink102:9092");
        //指定key value的序列化方式
        prop.put("key.serializer", StringSerializer.class.getName());
        prop.put("value.serializer", StringSerializer.class.getName());
        //指定topic名称
        String topic = "allData";

        //创建producer链接
        KafkaProducer<String, String> producer = new KafkaProducer<String,String>(prop);

        //{"dt":"2018-01-01 10:11:11","countryCode":"US","data":[{"type":"s1","score":0.3,"level":"A"},{"type":"s2","score":0.2,"level":"B"}]}

        //生产消息
        while(true){
            String message = "{\"dt\":\""+getCurrentTime()+"\",\"countryCode\":\""+getCountryCode()+"\",\"data\":[{\"type\":\""+getRandomType()+"\",\"score\":"+getRandomScore()+",\"level\":\""+getRandomLevel()+"\"},{\"type\":\""+getRandomType()+"\",\"score\":"+getRandomScore()+",\"level\":\""+getRandomLevel()+"\"}]}";
            System.out.println(message);
            producer.send(new ProducerRecord<String, String>(topic,message));
            Thread.sleep(2000);
        }
        //关闭链接
        //producer.close();
    }

    public static String getCurrentTime(){
        SimpleDateFormat sdf = new SimpleDateFormat("YYYY-MM-dd HH:mm:ss");
        return sdf.format(new Date());
    }

    public static String getCountryCode(){
        String[] types = {"US","TW","HK","PK","KW","SA","IN"};
        Random random = new Random();
        int i = random.nextInt(types.length);
        return types[i];
    }


    public static String getRandomType(){
        String[] types = {"s1","s2","s3","s4","s5"};
        Random random = new Random();
        int i = random.nextInt(types.length);
        return types[i];
    }

    public static double getRandomScore(){
        double[] types = {0.3,0.2,0.1,0.5,0.8};
        Random random = new Random();
        int i = random.nextInt(types.length);
        return types[i];
    }

    public static String getRandomLevel(){
        String[] types = {"A","A+","B","C","D"};
        Random random = new Random();
        int i = random.nextInt(types.length);
        return types[i];
    }


}

直接运行生产者工具类kafkaProducer，控制台打印数据信息
在这里插入图片描述
以上控制台说明数据取出来了

8、虚拟机监控allDataClean

[root@flink102 kafka-2.11]# bin/kafka-console-consumer.sh --bootstrap-server flink102:9092 --topic allDataclean

在resource目录创建logger配置文件
在这里插入图片描述

log4j.properties

log4j.rootLogger=info,stdout

log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target = System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout 
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c] [%p] - %m%n

启动程序
在这里插入图片描述

江湖侠客

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
第 27 节实战需求分析(数据清洗[实时ETL])

上篇：第 26 节 Flink Kafka-Connector详解应用场景分析数据清洗【实时ETL】数据报表需求分析针对算法产生的日志数据进行清洗拆分1：算法产生的日志数据是嵌套json格式，需要拆分打平2：针对算法中的国家字段进行大区转换3：最后把不同类型的日志数据分别进行存储架构图上代码实际操作（Java）1、创建maven父工程父工程创建ok2、创...
复制链接

扫一扫