Apache Flink是由Apache软件基金会开发的开源流处理框架,其核心是用Java和Scala编写的分布式流数据流引擎。Flink以数据并行和流水线方式执行任意流数据程序,Flink的流水线运行时系统可以执行批处理和流处理程序。此外,Flink的运行时本身也支持迭代算法的执行(来自百度百科的定义)。说白了就是flink是一个开源的分布式、高性能的流式处理框架,用于实时处理海量数据。在详细讲解flink的源码之前我们先上一个简单的例子。
pom.xml代码
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.zhu.self.flinkcount</groupId>
<artifactId>flinkcount</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>flinkcount</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
<version>1.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>1.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>1.9.0</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.zhu.self.flinkcount.flinkcount.WordCount</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
com.zhu.self.flinkcount.flinkcount.WordCount类
package com.zhu.self.flinkcount.flinkcount;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
public class WordCount {
public static void main(String[] args) throws Exception {
// 定义socket的端口号
int port;
try {
ParameterTool parameterTool = ParameterTool.fromArgs(args);
port = parameterTool.getInt("port");
} catch (Exception e) {
System.err.println("没有指定port参数,使用默认值9000");
port = 9000;
}
// 获取运行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 连接socket获取输入的数据
DataStreamSource<String> text = env.socketTextStream("localhost", port, "\n");
// 计算数据
DataStream<WordWithCount> windowCount = text.flatMap(new FlatMapFunction<String, WordWithCount>() {
public void flatMap(String value, Collector<WordWithCount> out) throws Exception {
String[] splits = value.split(" ");
for (String word : splits) {
out.collect(new WordWithCount(word, 1L));
}
}
})// 打平操作,把每行的单词转为<word,count>类型的数据
.keyBy("word")// 针对相同的word数据进行分组
.timeWindow(Time.seconds(2), Time.seconds(1))// 指定计算数据的窗口大小和滑动窗口大小
.sum("count");
// 把数据打印到控制台
windowCount.print().setParallelism(1);// 使用一个并行度
// 注意:因为flink是懒加载的,所以必须调用execute方法,上面的代码才会执行
env.execute("streaming word count");
}
/**
* 主要为了存储单词以及单词出现的次数
*/
public static class WordWithCount {
public String word;
public long count;
public WordWithCount() {
}
public WordWithCount(String word, long count) {
this.word = word;
this.count = count;
}
@Override
public String toString() {
return "WordWithCount{" + "word='" + word + '\'' + ", count=" + count + '}';
}
}
}
我采用的ide是eclipse,开始打包,执行clean package
1、开始编译,打开一个控制台窗口,执行nc -l 9000
2、显示flink中log文件夹下的日志文件(tail -f /Users/zhuhuiming/flink/flink-1.9.1/log/flink-zhuhuiming-taskexecutor-0-zhuhuimingdeMacBook-Pro.local.out)
3、然后在另外一个控制台窗口执行flink run flinkcount-0.0.1-SNAPSHOT-jar-with-dependencies.jar --port 9000
这里我针对flink配置了环境变量
FLINK_HOME=/Users/zhuhuiming/flink/flink-1.9.1
PATH=$FLINK_HOME/bin:.
4、然后在1打开的窗口中输入:
早上好 今天 天气下雨 阴冷潮湿 注意保暖
在2窗口中就会输出一些单词统计信息,如下图
这里留下一个问题,为什么每个词都会出现两次?滑动窗口的工作原理是什么样的?后面我们会进行详细解答。