1 Flink简介
- Apache Flink是一个开源的分布式,高性能,高可用,准确的额流处理框架。
- 主要由Java实现
- 支持实时流(Stream)处理和批处理(Batch),批数据只是流数据的一个极限特例。
- Flink原生的支持了迭代计算,内存管理和程序优化。
上图是Flink的特点。
关于批处理和流处理的理解可以参照我之前的博文:https://blog.csdn.net/GoSaint/article/details/100085835
2 Flink的安装
- 下载安装包
- 解压到/usr/local下面
- 启动Flink
caozg@caozg-PC:~/Desktop$ cd /usr/local/flink-1.9.0/bin/
caozg@caozg-PC:/usr/local/flink-1.9.0/bin$ ls
config.sh flink-daemon.sh mesos-taskmanager.sh start-cluster.bat stop-zookeeper-quorum.sh
find-flink-home.sh historyserver.sh pyflink-gateway-server.sh start-cluster.sh taskmanager.sh
flink jobmanager.sh pyflink-shell.sh start-scala-shell.sh yarn-session.sh
flink.bat mesos-appmaster-job.sh sql-client.sh start-zookeeper-quorum.sh zookeeper.sh
flink-console.sh mesos-appmaster.sh standalone-job.sh stop-cluster.sh
caozg@caozg-PC:/usr/local/flink-1.9.0/bin$ ./start-cluster.sh
3 WordCount程序
- pom依赖
<dependencies> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-java_2.12</artifactId> <version>1.9.0</version> <!--<scope>provided</scope>--> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-java --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-java</artifactId> <version>1.9.0</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-nop</artifactId> <version>1.7.2</version> </dependency> </dependencies>
2 Java代码:
import org.apache.flink.api.common.JobExecutionResult; import org.apache.flink.api.common.functions.FlatMapFunction; import org.apache.flink.api.java.utils.ParameterTool; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.windowing.time.Time; import org.apache.flink.util.Collector; public class WordCount { public static void main(String[] args) { //定义socket的端口号 int port; try{ ParameterTool parameterTool = ParameterTool.fromArgs(args); port = parameterTool.getInt("port"); }catch (Exception e){ port = 9998; } //获取运行环境 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); //连接socket获取输入的数据 DataStreamSource<String> text = env.socketTextStream("localhost", port, "\n"); //计算数据 DataStream<WordWithCount> windowCount = text.flatMap(new FlatMapFunction<String, WordWithCount>() { public void flatMap(String value, Collector<WordWithCount> out) throws Exception { String[] splits = value.split("\\s"); for (String word:splits) { out.collect(new WordWithCount(word,1L)); } } })//打平操作,把每行的单词转为<word,count>类型的数据 .keyBy("word")//针对相同的word数据进行分组 .timeWindow(Time.seconds(2),Time.seconds(1))//指定计算数据的窗口大小和滑动窗口大小 .sum("count"); //把数据打印到控制台 windowCount.print() .setParallelism(1);//使用一个并行度 //注意:因为flink是懒加载的,所以必须调用execute方法,上面的代码才会执行 try { JobExecutionResult streaming_word_count = env.execute("streaming word count"); System.out.println(streaming_word_count); } catch (Exception e) { e.printStackTrace(); } } /** * 主要为了存储单词以及单词出现的次数 */ public static class WordWithCount{ public String word; public long count; public WordWithCount(){} public WordWithCount(String word, long count) { this.word = word; this.count = count; } @Override public String toString() { return "WordWithCount{" + "word='" + word + '\'' + ", count=" + count + '}'; } }
3 在服务器上nc -l 9998,然后输入单词
[root@caozg]# nc -l 9998 hello world hello Flink
控制台结果如下:
WordWithCount{word='world', count=1} WordWithCount{word='hello', count=1} WordWithCount{word='world', count=1} WordWithCount{word='hello', count=1} WordWithCount{word='hello', count=1} WordWithCount{word='Flink', count=1} WordWithCount{word='hello', count=1} WordWithCount{word='Flink', count=1}