文章目录
1、什么是Flink
Flink是架构在有界和无界流数据上的有状态的分布式计算引擎,既可以处理流数据又可以处理批数据(批是流的特例)
1.1 有界无界流
- Unbounded streams:无界流,即流数据,定义了开始,没有定义结束
- Bounded streams:界流,即批数据,定义了开始以及结束。
将连续的批处理放大即是流处理
1.2 部署模式
Flink可以部署在 Hadoop YARN,Apache Mesos,Kubernetes ,Stand-alone ,Cloud ,Local,和Spark很类似,同Spark一样 on yarn模式 是没有HA的,只作为一个提交程序的客户端。 Stand-alone 模式必须要有部署HA。同一份代码可以在任何环境运行! 无需更改,开发学习的过程使用local 模式即可
2.Flink的编程模型
2.1编程模型从低到高如下所示:
Stateful streaming process (low) =》DataStream / DataSet API (Core) =》Table API =》SQL
- DataStream API:处理无界数据
- DataSet API:处理有界数据
2.2Time
- Event Time:数据产生的时间
- Ingestion time:数据摄入的时间,即数据进入Flink
- Processing Time:数据处理时间
2.2 常见的计算引擎编程过程
MapRreduce: input ==> map(reduce) ==> output
Spark : input ==> transformations ==> actions ==> output
Storm : input ==>Spout ==>Bolt ==> output
Flink : source ==> transformations ==> sink
3.Flink vs Spark Streaming vs Structured Streaming vs Strom(基本/trident) 简单对比
Strom(基本)是实时流计算, Storm(trident)是微批次计算。storm已过时。
- 实时性:Flink 、Storm、Structured Streaming 是真实时流处理,而Spark是微批次处理
- 延时性:Flink 、Storm 、Structured Streaming 延时较低,Spark延时较高
- 吞吐量: Storm(基本)吞吐量最低
4.WC编程案例
相较于java,使用scala进行编程使得代码更加优美,简洁。编程五步走,非常的有规则的编程。
注意:Flink加载数据以及transformation操作是lazy的,当执行execute时才会真正的触发执行。
4.1依赖的pom.xml文件
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.wsk.flink</groupId>
<artifactId>flink-train</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.11.8</scala.version>
<flink.version>1.8.1</flink.version>
<scala.binary.version>2.11</scala.binary.version>
</properties>
<repositories>
<repository>
<id>alimaven</id>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
</repository>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.5</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<buildcommands>
<buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
</buildcommands>
<additionalProjectnatures>
<projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
</additionalProjectnatures>
<classpathContainers>
<classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
<classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
</classpathContainers>
</configuration>
</plugin>
</plugins>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>
4.2wc代码案列
package com.wsk.flink.streaming
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
/**
* socket 窗口单词统计
*
*
*/
object SocketWindowWCApp {
var hostname : String = _
var port: Int = 9099
def main(args: Array[String]): Unit = {
try {
val params = ParameterTool.fromArgs(args)
hostname = if (params.has("hostname")) params.get("hostname") else "10.199.140.143"
port = params.getInt("port")
} catch {
case e: Exception => {
System.err.println("No port specified. Please run 'SocketWindowWordCount " +
"--hostname <hostname> --port <port>', where hostname (localhost by default) and port " +
"is the address of the text server")
System.err.println("To start a simple text server, run 'netcat -l <port>' " +
"and type the input text into the command line")
// return
}
}
//step1:获取 运行的ENV 环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//step2: source 获取DataStream
// import org.apache.flink.api.scala._
val lines = env.socketTextStream(hostname, port)
//step3:transformations 转换
val windowCounts = lines
.flatMap { w => w.split(",") }
.map { w => WordCount(w, 1) }
.keyBy("word")
.timeWindow(Time.seconds(4), Time.seconds(2))
.sum("count")
//step4:sink
windowCounts.print().setParallelism(1)
//step5: exe
env.execute("Socket Window WordCount")
}
case class WordCount(word: String, count: Long)
}