文章目录
开发工具
-
官方建议使用Intellij IDEA,因为它默认集成scala和maven环境,使用更加方便
-
开发flink程序,可以使用java或者scala语言。
个人建议,使用scala,因为实现起来更加简洁。使用java代码实现函数式编程比较别扭。 -
建议使用maven国内镜像仓库地址
(1)国外仓库下载较慢,可以使用国内阿里云的maven仓库
(2)注意:如果发现国内源下载提示找不到依赖的时候,记得切换回国外源
(3)国内镜像仓库配置:在maven中找到conf下的settings.xml文件,在mirrors标签中添加以下内容
<mirror>
<id>alimaven</id>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<mirrorOf>central</mirrorOf>
</mirror>
Flink 依赖(scala和java)pom.xml文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.ljj</groupId>
<artifactId>FlinkExample</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<scala.binary.version>2.11</scala.binary.version>
<flink.version>1.7.0</flink.version>
</properties>
<dependencies>
<!-- log start -->
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
<!-- log end -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
<!-- provided在这表示此依赖只在代码编译的时候使用,运行和打包的时候不使用 -->
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<!--<scope>provided</scope>-->
</dependency>
</dependencies>
<build>
<plugins>
<!-- 编译插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<!-- scala编译插件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.1.6</version>
<configuration>
<scalaCompatVersion>2.11</scalaCompatVersion>
<scalaVersion>2.11.8</scalaVersion>
<encoding>UTF-8</encoding>
</configuration>
<executions>
<execution>
<id>compile-scala</id>
<phase>compile</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>test-compile-scala</id>
<phase>test-compile</phase>
<goals>
<goal>add-source</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- 打jar包插件(会包含所有依赖) -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.6</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<!-- 可以设置jar包的入口类(可选) -->
<mainClass>com.ljj.SocketWindowWordCountJava</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
Flink 程序开发步骤
- 获得一个执行环境
- 加载/创建 初始化数据
- 指定操作数据的transaction算子
- 指定把计算好的数据放在哪
- 调用execute()触发执行程序
注意:Flink程序是延迟计算的,只有最后调用execute()方法的时候才会真正触发执行程序。
延迟计算好处:你可以开发复杂的程序,但是Flink可以将复杂的程序转成一个Plan,将Plan作为一个整体单元执行!
Flink 流处理
- 需求分析
手工通过socket实时产生一些单词,使用flink实时接收数据,对指定时间窗口内(例如:2秒)的数据进行聚合统计,并且把时间窗口内计算的结果打印出来 - 代码开发
添加对应的java依赖或者scala依赖 - 执行
1:在hadoop100上执行 nc -l 9000
2:在本机启动idea中的代码
java代码:
SocketWindowWordCountJava 类
package com.ljj;
import com.ljj.domain.WordCount;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* 滑动窗口计算
*
* 通过socket模拟产生单词数据
* flink对数据进行统计计算
*
* 需要实现每隔1秒对最近2秒内的数据进行汇总计算
*
*/
public class SocketWindowWordCountJava {
private final static Logger logger = LoggerFactory.getLogger(SocketWindowWordCountJava.class);
public static void main(String[] args) throws Exception {
//获取需要的端口号
int port;
try{
ParameterTool parameterTool = ParameterTool.fromArgs(args);
port = parameterTool.getInt("port");
}catch (Exception e){
logger.error("No port set. use default port 9000--java");
port = 9000;
}
//获取flink的运行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
String hostname = "192.168.2.102";
String delimiter = "\n";
//连接socket获取输入的数据
DataStreamSource<String> text = env.socketTextStream(hostname, port, delimiter);
//接收到得数据
// a 1
// a 1
// c 1
DataStream<WordCount> wordCounts = text.flatMap(new FlatMapFunction<String, WordCount>() {
public void flatMap(String s, Collector<WordCount> collector) throws Exception {
String[] splits = s.split("\\s");
for (String word : splits) {
collector.collect(new WordCount(word, 1));
}
}
}).keyBy("word")
.timeWindow(Time.seconds(2), Time.seconds(1)) //指定时间窗口大小为2秒,指定时间间隔为1秒
.sum("count");
//把数据打印到控制台并且设置并行度
wordCounts.print().setParallelism(1);
//这一行代码一定要实现,否则程序不执行
env.execute("Socket window count");
}
}
WordCount 类
package com.ljj.domain;
public class WordCount {
public String word;
public long count;
public WordCount(){}
public WordCount(String word,long count){
this.word = word;
this.count = count;
}
@Override
public String toString() {
return "WordCount{" +
"word='" + word + '\'' +
", count=" + count +
'}';
}
}
scala代码:
- 在项目main/src下面创建scala目录
- 将scala设置为sources目录(如果是普通目录,程序是不会编译的)
- 新建class文件的时候如果找到Scala Classs,需要添加添加scala依赖
SocketWindowWordCountScala 类
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time
/**
* 滑动窗口计算
* 通过socket模拟产生单词数据
* flink对数据进行统计计算
* 需要实现每隔1秒对最近2秒内的数据进行汇总计算
*
*/
object SocketWindowWordCountScala {
def main(args: Array[String]): Unit = {
//获取需要的端口号
val port: Int = try{
ParameterTool.fromArgs(args).getInt("port")
}catch {
case e : Exception=>{
System.err.println("No port set. use default port 9000--scala")
}
9000
}
//必须添加隐式转换
import org.apache.flink.api.scala._
//获取flink的运行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//连接socket获取输入的数据
val text = env.socketTextStream("192.168.2.102",port,'\n')
//解析数据。分组,窗口计算,聚合求sum
val wordCounts = text.flatMap(line => line.split("\\s")) //打平,把每一个单词分开
.map(w => WordWithCount(w, 1)) //把单词转换成Word,1这种形式
.keyBy("word") //分组
.timeWindow(Time.seconds(2), Time.seconds(1)) //指定窗口大小,指定间隔时间
.sum("count")
//打印到控制台
wordCounts.print().setParallelism(1)
//执行任务
env.execute("Socket window count")
}
case class WordWithCount(word:String,count:Long)
}
注意:必须要添加这一行隐式转换,否则下面的flatmap方法会报错
可以通过下面链接了解详情
链接:https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/types_serialization.html#type-information-in-the-scala-api
Flink 批处理
- 需求分析
读取一个文件,统计文件里单词出现的总次数,并且把计算的结果打印出来 - 代码开发
添加对应的java依赖或者scala依赖 - 执行
在本机启动idea中的代码
java代码:
BatchWordCountJava 类
package com.ljj;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
public class BatchWordCountJava {
public static void main(String[] args) throws Exception{
String inputPath = "D:\\data\\file";
String outPath = "D:\\data\\result.txt";
//获取运行环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//获取文件中的内容
DataSource<String> text = env.readTextFile(inputPath);
DataSet<Tuple2<String, Integer>> counts = text.flatMap(new Tokenizer()).groupBy(0).sum(1);
counts.writeAsCsv(outPath,"\n"," ").setParallelism(1);
env.execute("batch word count");
}
public static class Tokenizer implements FlatMapFunction<String,Tuple2<String,Integer>>{
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
String[] tokens = value.toLowerCase().split("\\W+");
for (String token: tokens) {
if(token.length()>0){
out.collect(new Tuple2<String, Integer>(token,1));
}
}
}
}
}
java代码:
BatchWordCountScala 类
package com.ljj
import org.apache.flink.api.scala.ExecutionEnvironment
object BatchWordCountScala {
def main(args: Array[String]): Unit = {
val inputPath = "D:\\data\\file"
val outPut = "D:\\data\\result2.txt"
val env = ExecutionEnvironment.getExecutionEnvironment
val text = env.readTextFile(inputPath)
//引入隐式转换
import org.apache.flink.api.scala._
val counts = text.flatMap(_.toLowerCase.split("\\W+"))
.filter(_.nonEmpty)
.map((_,1))
.groupBy(0)
.sum(1)
counts.writeAsCsv(outPut,"\n"," ").setParallelism(1)
env.execute("batch word count")
}
}
Flink Streaming和Batch的区别
- 流处理Streaming
StreamExecutionEnvironment
DataStreaming - 批处理Batch
ExecutionEnvironment
DataSet
Flink local模式安装
- 把刚才的代码在集群上运行
- 依赖环境
linux机器
jdk1.8及以上【配置JAVA_HOME环境变量】 - 下载地址
https://archive.apache.org/dist/flink/flink-1.7.0/flink-1.7.0-bin-hadoop27-scala_2.11.tgz - local模式快速安装启动
解压:tar -zxvf /flink-1.7.0-bin-hadoop27-scala_2.11.tgz
cd flink-1.7.0
启动:./bin/start-cluster.sh
停止:./bin/stop-cluster.sh - 访问web界面
http://hadoop102:8081
在集群上执行程序
- 编译
- 需要在pom文件中添加build配置,打包时指定入口类全类名【或者在运行时动态指定】
- pom.xml 注释掉< scope>provided</ scope>
- mvn clean package
- 执行
- 在hadoop102上启动local flink集群
- 在hadoop102上执行 nc -l 9000
- 在hadoop102上执行./bin/flink run FlinkExample-xxxxxx.jar --port 9000
- 在hadoop102上执行tail -f log/flink--taskexecutor-.out 查看日志输出
- 停止任务
- web ui界面停止
- 命令行执行bin/flink cancel < job-id>