上篇:第 26 节 Flink Kafka-Connector详解
应用场景分析
数据清洗【实时ETL】
数据报表
需求分析
针对算法产生的日志数据进行清洗拆分
1:算法产生的日志数据是嵌套json格式,需要拆分打平
2:针对算法中的国家字段进行大区转换
3:最后把不同类型的日志数据分别进行存储
架构图
上代码实际操作(Java)
1、创建maven父工程
父工程创建ok
2、创建子模块DataClean
子工程DataClean,创建ok
在父工程FlinkProj的pom文件添加依赖:
为方便管理,父工程的pom文件写版本,它会自动集成,我们就可以在子模块的pom文件不需要写版本号
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.6.1</version>
<!-- provided在这表示此依赖只在代码编译的时候使用,运行和打包的时候不使用 -->
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>1.6.1</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
<version>1.6.1</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>1.6.1</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-statebackend-rocksdb_2.11</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.11_2.11</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.11.0.3</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
<!-- https://mvnrepository.com/artifact/redis.clients/jedis -->
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>2.9.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.alibaba/fastjson -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.45</version>
</dependency>
</dependencies>
</dependencyManagement>
<!---打包的代码 -->
<build>
<plugins>
<!-- 编译插件-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<!-- scala编译插件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.1.6</version>
<configuration>
<scalaCompatVersion>2.11</scalaCompatVersion>
<scalaVersion>2.11.12</scalaVersion>
<encoding>UTF-8</encoding>
</configuration>
<executions>
<execution>
<id>compile-scala</id>
<phase>compile</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>test-compile-scala</id>
<phase>test-compile</phase>
<goals>
<goal>add-source</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- 打jar包插件(会包含所有依赖) -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.6</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<!-- 可以设置jar包的入口类(可选) -->
<mainClass></mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
(4)在子模块DataClean的pom文件引入依赖:
不需要写版本号,它会集成父工程的pom文件
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-statebackend-rocksdb_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.11_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</dependency>
<!-- https://mvnrepository.com/artifact/redis.clients/jedis -->
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
</dependency>
<!-- https://mvnrepository.com/artifact/com.alibaba/fastjson -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
</dependency>
</dependencies>
3、代码编写
DataClean.java
package xuwei.tech;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.CoFlatMapFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011;
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchemaWrapper;
import org.apache.flink.util.Collector;
import xuwei.tech.source.MyRedisSource;
import java.util.HashMap;
import java.util.Properties;
/**
* 创建kafka topic 命令:
* bin/kafka-topics.sh --create --topic allData -zookeeper flink102:2181 --partitions 5 --replication-factor 1
* bin/kafka-topics.sh --create --topic allDataclean -zookeeper flink102:2181 --partitions 5 --replication-factor 1
*
* 数据清洗需求
*
* 组装代码
*/
public class DataClean {
public static void main(String[] args)throws Exception {
//创建flink运行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//配置checkpoint配置
env.enableCheckpointing(60000);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(30000);
env.getCheckpointConfig().setCheckpointTimeout(10000);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
//指定kafkasource
String topic="allData";
Properties prop= new Properties();
prop.setProperty("bootstarp.servers","flink102:9092");
prop.setProperty("group.id","conl");
FlinkKafkaConsumer011<String> myConsumer = new FlinkKafkaConsumer011<String>(topic,new SimpleStringSchema(),prop);
//获取kafka中的数据
//{"dt":"2018-01-01 11:11:11","CountyCode":"US","data":{"type":"s1","score":0.3,"level":0.1,"level":"0"}
DataStreamSource<String> data = env.addSource(myConsumer);
//最新的国家码和大区的映射关系
DataStreamSource<HashMap<String, String>> mapData = env.addSource(new MyRedisSource());
DataStream<String> redisData = data.connect(mapData).flatMap(new CoFlatMapFunction<String, HashMap<String, String>, String>() {
//存储国家和大区的映射关系
private HashMap<String, String> allMap = new HashMap<String, String>();
//flatMap1处理的是kafka中的数据
@Override
public void flatMap1(String value, Collector<String> out) throws Exception {
JSONObject jsonObject = JSONObject.parseObject(value);
String dt = jsonObject.getString("dt");
String counttryCode = jsonObject.getString("counttryCode");
//获取大区
String area = allMap.get(counttryCode);
JSONArray jsonArray = jsonObject.getJSONArray("data");
for (int i = 0; i < jsonArray.size(); i++) {
JSONObject jsonObject1 = jsonArray.getJSONObject(i);
jsonObject1.put("area", area);
jsonObject1.put("dt", dt);
out.collect(jsonObject1.toJSONString());
}
}
//flatMap2处理的是redis返回的map类型的数据
@Override
public void flatMap2(HashMap<String, String> value, Collector<String> out) throws Exception {
this.allMap = value;
}
});
String outTopic="allDataClean";
Properties outprop = new Properties();
outprop.setProperty("bootstrap.servers", "flink102:9092");
FlinkKafkaProducer011<String> myProducer = new FlinkKafkaProducer011<>(outTopic, new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()), outprop, FlinkKafkaProducer011.Semantic.EXACTLY_ONCE);
redisData.addSink(myProducer);
env.execute("DataClean");
}
}
MyRedisSource.java
package xuwei.tech.source;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import redis.clients.jedis.Jedis;
import redis.clients.jedis.exceptions.JedisConnectionException;
import java.util.HashMap;
import java.util.Map;
/**
* redis中进行数据初始化
*
* hset oreas AREA_US US
* hset oreas AREA_CT TW,HK
* hset oreas AREA_AR PK,KW,SA
* hset oreas AREA_IN IN
*
* 在redis中保存的国家和大区的关系
*
* 需要把大区和国家的对应关系组装成java的hashmap
*/
public class MyRedisSource implements SourceFunction<HashMap<String,String>> {
private Logger logger = LoggerFactory.getLogger(MyRedisSource.class);
private final long SLEEP_MILLION=60000;
private boolean isRunning=true;
private Jedis jedis=null;
@Override
public void run(SourceContext<HashMap<String, String>> ctx) throws Exception {
this.jedis = new Jedis("flink102", 6379);
//存储所有国家和大区的对应关系
HashMap<String, String> KeyValueMap = new HashMap<>();
while (isRunning){
try {
KeyValueMap.clear();
//把数据取出来
Map<String, String> areas = jedis.hgetAll("areas");
for (Map.Entry<String,String>entry:areas.entrySet()) {
String key=entry.getKey();
String value = entry.getValue();
String[] splits = value.split(",");
for (String split:splits) {
KeyValueMap.put(split,key);
}
}
if(KeyValueMap.size()>0){
ctx.collect(KeyValueMap);
}else {
logger.warn("从redis中获取的数据为空!!!");
}
Thread.sleep(SLEEP_MILLION);
} catch (JedisConnectionException e) { //捕获连接异常
logger.error("redis链接异常,重新获取链接",e.getCause());
jedis = new Jedis("flink102", 6379);
}catch (Exception e){
logger.error("source 数据源异常",e.getCause());
}
}
}
@Override
public void cancel() {
isRunning=false;
if(jedis!=null){
jedis.close();
}
}
}
4、启动虚拟机的redis、kafka服务
//kafka服务启动起来了
[root@flink102 ~]# jps
11984 Kafka
12336 Jps
6628 QuorumPeerMain
//redis服务启动起来了
[root@flink102 ~]# redis-cli
127.0.0.1:6379>
5、创建kafka的topic
可以采用历史创建提示:history |grep topic
//创建kafka的topic,名为allData
[root@flink102 kafka-2.11]# bin/kafka-topics.sh --create --topic allData -zookeeper flink102:2181 --partitions 5 --replication-factor 1
Created topic "allData".
[root@flink102 kafka-2.11]#
//创建kafka的topic,名为allDataclean
[root@flink102 kafka-2.11]# bin/kafka-topics.sh --create --topic allDataclean -zookeeper flink102:2181 --partitions 5 --replication-factor 1
Created topic "allDataclean".
[root@flink102 kafka-2.11]#
6、redis的数据初始化
//redis的数据初始化
127.0.0.1:6379> hset areas AREA_US US
(integer) 1
127.0.0.1:6379> hset areas AREA_CT TW,HK
(integer) 1
127.0.0.1:6379> hset areas AREA_AR PK,KW,SA
(integer) 1
127.0.0.1:6379> hset areas AREA_IN IN
(integer) 1
127.0.0.1:6379>
//查看数据
127.0.0.1:6379> hgetall areas
1) "AREA_US"
2) "US"
3) "AREA_CT"
4) "TW,HK"
5) "AREA_AR"
6) "PK,KW,SA"
7) "AREA_IN"
8) "IN"
127.0.0.1:6379>
7、创建生产者的工具类
kafkaProducer.java
package xuwei.tech.util;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Properties;
import java.util.Random;
/**
* Created by xuwei.tech on 2018/11/6.
*/
public class kafkaProducer {
public static void main(String[] args) throws Exception{
Properties prop = new Properties();
//指定kafka broker地址
prop.put("bootstrap.servers", "flink102:9092");
//指定key value的序列化方式
prop.put("key.serializer", StringSerializer.class.getName());
prop.put("value.serializer", StringSerializer.class.getName());
//指定topic名称
String topic = "allData";
//创建producer链接
KafkaProducer<String, String> producer = new KafkaProducer<String,String>(prop);
//{"dt":"2018-01-01 10:11:11","countryCode":"US","data":[{"type":"s1","score":0.3,"level":"A"},{"type":"s2","score":0.2,"level":"B"}]}
//生产消息
while(true){
String message = "{\"dt\":\""+getCurrentTime()+"\",\"countryCode\":\""+getCountryCode()+"\",\"data\":[{\"type\":\""+getRandomType()+"\",\"score\":"+getRandomScore()+",\"level\":\""+getRandomLevel()+"\"},{\"type\":\""+getRandomType()+"\",\"score\":"+getRandomScore()+",\"level\":\""+getRandomLevel()+"\"}]}";
System.out.println(message);
producer.send(new ProducerRecord<String, String>(topic,message));
Thread.sleep(2000);
}
//关闭链接
//producer.close();
}
public static String getCurrentTime(){
SimpleDateFormat sdf = new SimpleDateFormat("YYYY-MM-dd HH:mm:ss");
return sdf.format(new Date());
}
public static String getCountryCode(){
String[] types = {"US","TW","HK","PK","KW","SA","IN"};
Random random = new Random();
int i = random.nextInt(types.length);
return types[i];
}
public static String getRandomType(){
String[] types = {"s1","s2","s3","s4","s5"};
Random random = new Random();
int i = random.nextInt(types.length);
return types[i];
}
public static double getRandomScore(){
double[] types = {0.3,0.2,0.1,0.5,0.8};
Random random = new Random();
int i = random.nextInt(types.length);
return types[i];
}
public static String getRandomLevel(){
String[] types = {"A","A+","B","C","D"};
Random random = new Random();
int i = random.nextInt(types.length);
return types[i];
}
}
直接运行生产者工具类kafkaProducer,控制台打印数据信息
以上控制台说明数据取出来了
8、虚拟机监控allDataClean
[root@flink102 kafka-2.11]# bin/kafka-console-consumer.sh --bootstrap-server flink102:9092 --topic allDataclean
在resource目录创建logger配置文件
log4j.properties
log4j.rootLogger=info,stdout
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target = System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c] [%p] - %m%n
启动程序