由于公司决定将使用Flink分布式计算框架作为后期产品的优先技术框架,于是最近花了点时间来学习Flink,本文使用kafka作为数据源。
- Flink的安装:先去Flink官网下载Flink组件,我下载的版本是‘Apache Flink 1.8.1 for Scala 2.11’。下载后解压到本地/usr/local/flink-1.8.1
- Flink启动:2.1. cd /usr/local/flink-1.8.1/conf; 2.2 less flink-conf.yaml,找到‘jobmanager.rpc.address: ’,将其替换为'jobmanager.rpc.address: localhost'; 2.3 cd /usr/local/flink-1.8.1/bin, start-cluster.sh,此时Flink便在本地启动了起来,我们在浏览器中打开Flink UI:http://localhost:8081
- Flink api的使用:新建一个Java maven项目,在pom文件中引入flink和kafka核心组件:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.damon</groupId> <artifactId>flink</artifactId> <version>0.0.1-SNAPSHOT</version> <packaging>jar</packaging> <name>flink</name> <url>http://maven.apache.org</url> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <flink.version>1.4.1</flink.version> <deploy.dir>./target/flink/</deploy.dir> </properties> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> <version>1.5.10.RELEASE</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-java</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-core</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-java_2.11</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-clients_2.11</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-kafka-0.9_2.11</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-runtime_2.11</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> <dependency> <groupId>com.google.code.gson</groupId> <artifactId>gson</artifactId> <version>2.8.5</version> </dependency> <dependency> <groupId>log4j</groupId> <artifactId>log4j</artifactId> <version>1.2.17</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-api</artifactId> <version>1.7.26</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <version>1.7.25</version> <scope>compile</scope> </dependency> </dependencies> <build> <finalName>flinkpackage</finalName> <sourceDirectory>src/main/java</sourceDirectory> <resources> <!-- 控制资源文件的拷贝 --> <resource> <directory>src/main/resources</directory> <targetPath>${project.build.directory}</targetPath> </resource> </resources> <plugins> <!-- 设置源文件编码方式 --> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <!-- <defaultLibBundleDir>lib</defaultLibBundleDir>--> <source>1.8</source> <target>1.8</target> <encoding>UTF-8</encoding> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>1.2.1</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <mainClass>com.damon.flink.App</mainClass> </transformer> <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer"> <resource>reference.conf</resource> </transformer> </transformers> </configuration> </execution> </executions> </plugin> </plugins> </build> </project>
在主类中添加flink设置数据源及消费kafka数据的代码:
package com.damon.flink; import com.damon.flink.model.Student; import com.damon.flink.sink.StudentSink; import com.google.gson.Gson; import org.apache.flink.api.common.serialization.SimpleStringSchema; import org.apache.flink.streaming.api.TimeCharacteristic; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.util.Properties; public class App { private static Logger log = LoggerFactory.getLogger(App.class); private static Gson gson = new Gson(); @SuppressWarnings({ "serial", "deprecation" }) public static void main( String[] args ) throws Exception { String topic = "test.topic"; final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.enableCheckpointing(5000); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); Properties properties = new Properties(); properties.setProperty("bootstrap.servers","localhost:9092"); properties.setProperty("zookeeper.connect","localhost:2181"); properties.setProperty("group.id","test-consumer-group"); FlinkKafkaConsumer09<String> consumer09 = new FlinkKafkaConsumer09<String>(topic,new SimpleStringSchema(),properties); DataStream<String> kafkaStream = env.addSource(consumer09); DataStream<Student> studentStream = kafkaStream.map(stuent->gson.fromJson(stuent,Student.class)).keyBy("gender"); studentStream.addSink(new StudentSink()); env.execute("Flink Streaming Java API Skeleton"); log.debug("Flink Started ..."); } }
其中model Student类:
package com.damon.flink.model; public class Student { private String id; private String name; private String gender; private int age; private int score; public String getId() { return id; } public void setId(String id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } public String getGender() { return gender; } public void setGender(String gender) { this.gender = gender; } public int getAge() { return age; } public void setAge(int age) { this.age = age; } public int getScore() { return score; } public void setScore(int score) { this.score = score; } }
用于消费kafka消息的自定义StudentSink:
package com.damon.flink.sink; import com.damon.flink.model.Student; import org.apache.flink.configuration.Configuration; import org.apache.flink.streaming.api.functions.sink.RichSinkFunction; import org.slf4j.Logger; import org.slf4j.LoggerFactory; public class StudentSink extends RichSinkFunction<Student> { private static Logger log = LoggerFactory.getLogger(StudentSink.class); @Override public void open(Configuration parameters) throws Exception { super.open(parameters); } @Override public void close() throws Exception { super.close(); } @Override public void invoke(Student value, Context context) throws Exception { log.info("Student : "+value.getName()+", Score : "+value.getScore()); } }
启动此项目,然后使用kafka发送消息,可以看到在console中消费了kafka消息:
16:32:24.386 [flink-akka.actor.default-dispatcher-4] DEBUG org.apache.flink.runtime.taskmanager.TaskManager - Sending heartbeat to JobManager 16:32:29.387 [flink-akka.actor.default-dispatcher-5] DEBUG org.apache.flink.runtime.taskmanager.TaskManager - Sending heartbeat to JobManager 16:32:34.388 [flink-akka.actor.default-dispatcher-7] DEBUG org.apache.flink.runtime.taskmanager.TaskManager - Sending heartbeat to JobManager 16:32:39.386 [flink-akka.actor.default-dispatcher-7] DEBUG org.apache.flink.runtime.taskmanager.TaskManager - Sending heartbeat to JobManager 16:32:43.967 [Sink: Unnamed (7/12)] INFO com.damon.flink.sink.StudentSink - Student : DamonTest-2019-07-28 16:32:43.043, Score : 67 16:32:44.385 [flink-akka.actor.default-dispatcher-7] DEBUG org.apache.flink.runtime.taskmanager.TaskManager - Sending heartbeat to JobManager 16:32:44.587 [Sink: Unnamed (2/12)] INFO com.damon.flink.sink.StudentSink - Student : DamonTest-2019-07-28 16:32:44.044, Score : 88 16:32:45.413 [Sink: Unnamed (2/12)] INFO com.damon.flink.sink.StudentSink - Student : DamonTest-2019-07-28 16:32:45.045, Score : 93 16:32:45.925 [Sink: Unnamed (7/12)] INFO com.damon.flink.sink.StudentSink - Student : DamonTest-2019-07-28 16:32:45.045, Score : 62 16:32:46.339 [Sink: Unnamed (2/12)] INFO com.damon.flink.sink.StudentSink - Student : DamonTest-2019-07-28 16:32:46.046, Score : 51 16:32:47.059 [Sink: Unnamed (7/12)] INFO com.damon.flink.sink.StudentSink - Student : DamonTest-2019-07-28 16:32:46.046, Score : 61 16:32:47.370 [Sink: Unnamed (7/12)] INFO com.damon.flink.sink.StudentSink - Student : DamonTest-2019-07-28 16:32:47.047, Score : 86 16:32:47.986 [Sink: Unnamed (7/12)] INFO com.damon.flink.sink.StudentSink - Student : DamonTest-2019-07-28 16:32:47.047, Score : 58 16:32:48.392 [Sink: Unnamed (2/12)] INFO com.damon.flink.sink.StudentSink - Student : DamonTest-2019-07-28 16:32:48.048, Score : 66 16:32:48.806 [Sink: Unnamed (2/12)] INFO com.damon.flink.sink.StudentSink - Student : DamonTest-2019-07-28 16:32:48.048, Score : 58 16:32:49.320 [Sink: Unnamed (2/12)] INFO com.damon.flink.sink.StudentSink - Student : DamonTest-2019-07-28 16:32:49.049, Score : 50 16:32:49.388 [flink-akka.actor.default-dispatcher-7] DEBUG org.apache.flink.runtime.taskmanager.TaskManager - Sending heartbeat to JobManager 16:32:49.735 [Sink: Unnamed (7/12)] INFO com.damon.flink.sink.StudentSink - Student : DamonTest-2019-07-28 16:32:49.049, Score : 62 16:32:50.150 [Sink: Unnamed (7/12)] INFO com.damon.flink.sink.StudentSink - Student : DamonTest-2019-07-28 16:32:50.050, Score : 50
- Flink上传项目:将上述demo项目打包成一个jar包,将其上传至本地flink平台,可以在之前的flink ui上可视化上传,也可以使用脚本上传,此处我们使用bin目录下的flink脚本上传:
192:bin damon$ ./flink run -c com.damon.flink.App /Users/damon/Project/flink/flink/target/flink-0.0.1-SNAPSHOT.jar
此时我们再打开flink ui界面,在'Running Jobs'就能看到我们的项目了:
我们继续向kafka发送消息,可以看到其显示了消息的字节数和消息量,同时我们在'Task Managers'也能看到消息处理的具体log:
至此,我们一个本地flink+kafka的demo项目就算是完成了,后期会继续研究yarn模式下的flink部署。