Flink学习
Flink是什么
基于数据流的有状态的计算,分布式处理引擎框架,作用于有无界有界的数据流上。
无界流:有头没有尾,源源不断,无穷无尽。不可能等待所有数据结束才去处理。
有界流:有始有终,可以等待所有数据都准备好了才去处理,可以理解为批处理。
Flik应用程序可以处理数据来了就处理,还可以先把数据存下来再处理。
分层接口API
越往下级别越高,但表达能力越低
- Stateful Event-Driven Application,底层使用ProcessFunction,需要实现它提供的方法
- Stream & Batch Data Processing (工作中用的比较多)
- High-level Analytics Api 比如:SQL/Table API(还不够成熟)
Flink运行多样化
可以运行在Hadoop Yarn、Apache Mesos、Kubernetes这些资源管理器上,也可以运行在flink集群上。flink可以自动去识别这些资源管理器。
业界流处理框架对比
-
Spark: Streaming 结构化流 批处理为主
流式处理是批处理的一个特例(mini batch)
-
Flink: 流式为主,批处理是流式处理的一个特例
-
Storm: 流式 Tuple
Use Cases(使用场景)
- Event-driven 数据驱动应用程序
- Data Analytics 数据分析应用程序
- Data Pipeline 数据管道应用程序
如何高效的学习Flink
- 官网
- 通过源码 maven把源码关联上 也可以通过官网的examples比如github上flink-example模块
Flink开发批处理应用程序
需求:词频统计(Word count)
一个文件,统计文件中每个单词出现的次数
分隔符\t
统计结果我们直接打印在控制台(生产上肯定是Sink到目的地)
实现:Flink+Java 要求Maven版本是3.0.4(or higher)
out of the box :OOTB 开箱即用
第一种创建方式
1.执行以下命令得到项目
mvn archetype:generate -DarchetypeGroupId=org.apache.flink -DarchetypeArtifactId=flink-quickstart-java -DarchetypeVersion=1.7.0 -DarchetypeCatalog=local
2.使用idea导入到工程
开发流程/开发八股文编程
- set up the batch execution environment
- read data
- transform operations 开发的核心所在:开发业务逻辑
- env.execute
功能拆解
-
读取数据
hello welcome
-
每一行的数据按照指定的分隔符拆分
hello
welcome
-
为每一个单词附上次数为1
(hello,1)
(welcome,1)
-
合并操作 groupBy
代码示例:
package com.wj;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
/**
* 使用Java API来开发Flink的批处理应用程序
*/
public class BatchWCJavaApp {
public static void main(String[] args) throws Exception {
String input = "C:\\Users\\Administrator\\Desktop\\hello.txt";
//step1: 获取运行环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//step2: read data
DataSource<String> text = env.readTextFile(input);
//step3: transform
text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> collector) throws Exception {
String[] tokens = value.toLowerCase().split("\t");
for (String token : tokens) {
if (token.length()>0){
collector.collect(new Tuple2<String,Integer>(token,1));
}
}
}
}).groupBy(0).sum(1).print(); //0表示key,1表示value
}
}
Flink+Scala
前置条件:Maven 3.0.4(or higher) and java 8.x
mvn archetype:generate -DarchetypeGroupId=org.apache.flink -DarchetypeArtifactId=flink-quickstart-scala -DarchetypeVersion=1.7.0 -DarchetypeCatalog=local
scala代码示例:
package com.wj
import org.apache.flink.api.scala.ExecutionEnvironment
/**
* 使用Scala开发Flink的批处理应用程序
*/
object BatchWCScalaApp {
def main(args: Array[String]): Unit = {
val input = "C:\\Users\\Administrator\\Desktop\\hello.txt"
val env = ExecutionEnvironment.getExecutionEnvironment
val text = env.readTextFile(input)
//引入隐式转换
import org.apache.flink.api.scala._
text.flatMap(_.toLowerCase.split("\t"))
.filter(_.nonEmpty)
.map((_,1))
.groupBy(0)
.sum(1)
.print()
}
}
Flink Java vs Scala
- 算子 map filter
- 简洁性
流式编程Java
启动一个本地服务并监听端口:nc -lk 999
java把的流式编程代码示例:
package com.wj;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
/**
* 使用Java API来开发Flink的实时处理应用程序
* wc统计的数据我们源自于socket
*/
public class StreamingWCJavaApp {
public static void main(String[] args) throws Exception {
//step1:获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//step2:读取数据
DataStreamSource<String> text = env.socketTextStream("localhost", 9999);
//step3:transform
text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> collector) throws Exception {
String[] tokens = value.toLowerCase().split(",");
for (String token : tokens) {
if (token.length()>0){
collector.collect(new Tuple2<String,Integer>(token,1));
}
}
}
}).keyBy(0).timeWindow(Time.seconds(5)).sum(1).print().setParallelism(1);
//流式编程里面一定要执行
env.execute("StreamingWCJavaApp");
}
}
对上面代码重构示例:
package com.wj;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
/**
* 使用Java API来开发Flink的实时处理应用程序
* wc统计的数据我们源自于socket
*/
public class StreamingWCJava02App {
public static void main(String[] args) throws Exception {
//获取参数
int port = 0;
try {
ParameterTool tool = ParameterTool.fromArgs(args);
port = tool.getInt("port");
} catch (Exception e) {
System.err.println("系统端口未设置,使用默认端口9999");
port = 9999;
}
//step1:获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//step2:读取数据
DataStreamSource<String> text = env.socketTextStream("localhost", port);
//step3:transform
text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> collector) throws Exception {
String[] tokens = value.toLowerCase().split(",");
for (String token : tokens) {
if (token.length()>0){
collector.collect(new Tuple2<String,Integer>(token,1));
}
}
}
}).keyBy(0).timeWindow(Time.seconds(5)).sum(1).print().setParallelism(1);
//流式编程里面一定要执行
env.execute("StreamingWCJavaApp");
}
}
项目参数配置:
流式编程Scala
代码示例:
package com.wj
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time
/**
* 使用Scala开发Flink的实时处理应用程序
*/
object SteamingWCScalaApp {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
//引入隐式转换
import org.apache.flink.api.scala._
val text = env.socketTextStream("localhost", 9999)
text
.flatMap(_.split(","))
.map((_,1))
.keyBy(0)
.timeWindow(Time.seconds(5))
.sum(1)
.print()
.setParallelism(1)
env.execute("SteamingWCScalaApp")
}
}
编程模型及核心概念
Flink核心Api
大数据处理的流程:
MapReduce: input -> map(reduce) -> output
Storm: input -> Spout/Bolt -> output
Spark: input -> transformation/action -> output
Flink: intput -> transformation/sink -> output
Dataset(批处理) and DataStream(流处理) 是不可变集合,不能添加和删除,必须要有一个源头DataSource,可以通过map,filter等转换算子会产生新的Dataset and DataStream。
Flink编程模型
- 获取执行的上下文也就是环境
- 加载你初始化的数据
- 指定transformation在这些个数据上
- 指定你计算出的结果你要写到哪里去(sink)
- 触发这个编程的执行
Flink延迟执行
只有执行excute()方法后,你的所有操作才得到执行。不管你的程序是在本地还是执行在集群环境上。
指定key之字段选择
Java版代码示例如下,Scala也是类似,不同区别就是Scala中的offset是使用_1表示第一个元素。Java版中是0表示第一个元素开始。
package com.wj.course;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
/**
* 使用Java API来开发Flink的实时处理应用程序
* wc统计的数据我们源自于socket
*/
public class StreamingWCJavaApp {
public static void main(String[] args) throws Exception {
//step1:获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//step2:读取数据
DataStreamSource<String> text = env.socketTextStream("localhost", 9999);
//step3:transform
text.flatMap(new FlatMapFunction<String, WC>() {
@Override
public void flatMap(String value, Collector<WC> collector) throws Exception {
String[] tokens = value.toLowerCase().split(",");
for (String token : tokens) {
if (token.length()>0){
collector.collect(new WC(token.trim(),1));
}
}
}
})//.keyBy("word")
.keyBy(new KeySelector<WC, String>() {
@Override
public String getKey(WC wc) throws Exception {
return wc.word;
}
})
.timeWindow(Time.seconds(5))
.sum("count")
.print()
.setParallelism(1);
//流式编程里面一定要执行
env.execute("StreamingWCJavaApp");
}
public static class WC{
private String word;
private int count;
public WC(){}
public WC(String word,int count){
this.word = word;
this.count = count;
}
@Override
public String toString() {
return "WC{" +
"word='" + word + '\'' +
", count=" + count +
'}';
}
public String getWord() {
return word;
}
public void setWord(String word) {
this.word = word;
}
public int getCount() {
return count;
}
public void setCount(int count) {
this.count = count;
}
}
}
DataSet API开发描述
简要概述:
Source: 源/源头
reading files
local collections
Source => Flink(transformations) ==> Sink
Sink: 目的地
(distributed) files
or to standard output
DataSource
基于文件
基于集合
代码示例:
scala版
package com.wj.flink.datasource
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.configuration.Configuration
object DataSetDataSourceApp {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
//fromCollection(env)
// textFile(env)
//csvFile(env)
// readRecursiveFiles(env)
readCompressionFiles(env)
}
def readCompressionFiles(env:ExecutionEnvironment): Unit ={
val filePath = "C:\\Users\\Administrator\\Desktop\\inputs\\compression"
env.readTextFile(filePath).print()
}
//递归文件的读取
def readRecursiveFiles(env:ExecutionEnvironment): Unit ={
val filePath = "C:\\Users\\Administrator\\Desktop\\inputs\\nested"
env.readTextFile(filePath).print()
println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
val parameters = new Configuration()
parameters.setBoolean("recursive.file.enumeration",true)
env.readTextFile(filePath).withParameters(parameters).print()
}
//csv文件的读取
def csvFile(env: ExecutionEnvironment): Unit ={
import org.apache.flink.api.scala._
val filePath = "C:\\Users\\Administrator\\Desktop\\inputs\\people.csv"
// env.readCsvFile[(String,Int,String)](filePath,ignoreFirstLine = true).print();
// env.readCsvFile[(String,Int)](filePath,ignoreFirstLine = true,includedFields = Array(0,1)).print();
// case class MyCaseClass(name:String,age:Int)
// env.readCsvFile[MyCaseClass](filePath,ignoreFirstLine = true,includedFields = Array(0,1)).print();
env.readCsvFile[Person](filePath,ignoreFirstLine = true,pojoFields = Array("name","age","work")).print();
}
def textFile(env:ExecutionEnvironment): Unit ={
//读文件
// val filePath = "C:\\Users\\Administrator\\Desktop\\hello.txt"
// env.readTextFile(filePath).print()
//读文件夹
val filePath = "C:\\Users\\Administrator\\Desktop\\inputs"
env.readTextFile(filePath).print()
}
def fromCollection(env:ExecutionEnvironment): Unit ={
import org.apache.flink.api.scala._
val data = 1 to 10
env.fromCollection(data).print()
}
}
java版
package com.wj.flink.datasource;
import org.apache.flink.api.java.ExecutionEnvironment;
import java.util.Arrays;
import java.util.concurrent.ExecutionException;
public class JavaDataSetDataSourceApp {
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// formCollection(env);
textFile(env);
}
public static void textFile(ExecutionEnvironment env) throws Exception {
String filePath = "C:\\Users\\Administrator\\Desktop\\hello.txt";
env.readTextFile(filePath).print();
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~");
filePath = "C:\\Users\\Administrator\\Desktop\\inputs";
env.readTextFile(filePath).print();
}
public static void formCollection(ExecutionEnvironment env )throws Exception {
env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)).print();
}
}
Transformation
DataSetTransformatiionApp.scala
package com.wj.flink.datasource
import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import scala.collection.mutable.ListBuffer
object DataSetTransformatiionApp {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
// mapFunction(env)
// filterFunction(env)
// mapPartitionFunction(env)
// firstFunction(env)
// flatMapFunction(env)
// distinctFunction(env)
// joinFunction(env)
// outerJoinFunction(env)
crossFunction(env)
}
//笛卡尔积
def crossFunction(env:ExecutionEnvironment): Unit ={
val info1 = List("曼联","曼城")
val info2 = List(3,1,0)
val data1 = env.fromCollection(info1)
val data2 = env.fromCollection(info2)
data1.cross(data2).print()
}
//外连接
def outerJoinFunction(env:ExecutionEnvironment): Unit ={
val info1 = ListBuffer[(Int,String)]() //编号 名字
info1.append((1,"PK哥"))
info1.append((2,"J哥"))
info1.append((3,"小队长"))
info1.append((4,"猪头胡"))
val info2 = ListBuffer[(Int,String)]() //编号 城市
info2.append((1,"北京"))
info2.append((2,"上海"))
info2.append((3,"成都"))
info2.append((5,"杭州"))
val data1 = env.fromCollection(info1)
val data2 = env.fromCollection(info2)
//注意where指左边的字段,equalTo指右边的字段
//左外连接是以左边的数据为基准的,左边的数据会全部出来
// data1.leftOuterJoin(data2).where(0).equalTo(0).apply((first,second)=>{
// if (second == null){
// (first._1,first._2,"-")
// }else{
// (first._1,first._2,second._2)
// }
// }).print()
//右外连接
// data1.rightOuterJoin(data2).where(0).equalTo(0).apply((first,second)=>{
// if (first == null){
// (second._1,"-",second._2)
// }else{
// (first._1,first._2,second._2)
// }
// }).print()
//全外连接
data1.fullOuterJoin(data2).where(0).equalTo(0).apply((first,second)=>{
if (first == null){
(second._1,"-",second._2)
}else if(second==null){
(first._1,first._2,"-")
}else{
(first._1,first._2,second._2)
}
}).print()
}
//内连接
def joinFunction(env:ExecutionEnvironment): Unit ={
val info1 = ListBuffer[(Int,String)]() //编号 名字
info1.append((1,"PK哥"))
info1.append((2,"J哥"))
info1.append((3,"小队长"))
info1.append((4,"猪头胡"))
val info2 = ListBuffer[(Int,String)]() //编号 城市
info2.append((1,"北京"))
info2.append((2,"上海"))
info2.append((3,"成都"))
info2.append((5,"杭州"))
val data1 = env.fromCollection(info1)
val data2 = env.fromCollection(info2)
//注意where指左边的字段,equalTo指右边的字段
data1.join(data2).where(0).equalTo(0).apply((first,second)=>{
(first._1,first._2,second._2)
}).print()
}
//去重
def distinctFunction(env:ExecutionEnvironment): Unit ={
val info = ListBuffer[String]()
info.append("hadoop,spark")
info.append("hadoop,flink")
info.append("flink,flink")
val data = env.fromCollection(info)
data.flatMap(_.split(",")).distinct().print()
}
//一个元素产生多个元素
def flatMapFunction(env:ExecutionEnvironment): Unit ={
val info = ListBuffer[String]()
info.append("hadoop,spark")
info.append("hadoop,flink")
info.append("flink,flink")
val data = env.fromCollection(info)
// data.print()
// data.map(_.split(",")).print()
// data.flatMap(_.split(",")).print()
data.flatMap(_.split(",")).map((_,1)).groupBy(0).sum(1).print()
}
def firstFunction(env:ExecutionEnvironment): Unit ={
val info = ListBuffer[(Int,String)]()
info.append((1,"Hadoop"))
info.append((1,"Spark"))
info.append((1,"Flink"))
info.append((2,"Java"))
info.append((2,"Spring Boot"))
info.append((3,"Linux"))
info.append((4,"Vue"))
val data = env.fromCollection(info)
// data.first(3).print()
// data.groupBy(0).first(2).print()//分组以后,求组内的前几个
data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print()
}
//DataSource 100个元素,把结果存储到数据库中
def mapPartitionFunction(env:ExecutionEnvironment): Unit ={
val students = new ListBuffer[String]
for(i<- 1 to 100){
students.append("students: "+i)
}
val data = env.fromCollection(students).setParallelism(4)
// data.map(x=>{
// //把每一个元素存储到数据库中去,肯定需要先获取到一个connection
// val connection = DBUtils.getConnection()
// println(connection+".......")
//
// //TODO ...保存数据到DB
// DBUtils.returnConnection(connection)
// }).print()
//注意: 用mapPartition 不用每次都要去建立连接,从而不会导致系统资源消耗太多,根据env的setParallelism(4)并行度有关系
data.mapPartition(x=>{
val connection = DBUtils.getConnection()
println(connection+".......")
//TODO ...保存数据到DB
DBUtils.returnConnection(connection)
x
}).print()
}
def filterFunction(env: ExecutionEnvironment): Unit ={
val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
data.map(_+1).filter(_>5).print()
}
def mapFunction(env:ExecutionEnvironment): Unit ={
val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
// data.map((x:Int)=>x+1).print()
// data.map(x=>x+1).print()
data.map(_+1).print()
}
}
JavaDataSetTransformationApp.java
package com.wj.flink.datasource;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.MapPartitionFunction;
import org.apache.flink.api.common.operators.Order;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class JavaDataSetTransformationApp {
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// mapFunction(env);
// filterFunction(env);
// mapPartitionFunction(env);
firstFunction(env);
}
public static void firstFunction(ExecutionEnvironment env) throws Exception {
List<Tuple2<Integer,String>> info = new ArrayList<Tuple2<Integer, String>>();
info.add(new Tuple2(1,"Hadoop"));
info.add(new Tuple2(1,"Spark"));
info.add(new Tuple2(1,"Flink"));
info.add(new Tuple2(2,"Java"));
info.add(new Tuple2(2,"Spring Boot"));
info.add(new Tuple2(3,"Linux"));
info.add(new Tuple2(4,"Vue"));
DataSource<Tuple2<Integer, String>> data = env.fromCollection(info);
data.first(3).print();
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~");
data.groupBy(0).first(2).print();
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~");
data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print();
}
public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception {
List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
DataSource<Integer> data = env.fromCollection(list);
// data.map(new MapFunction<Integer, Integer>() {
// public Integer map(Integer integer) throws Exception {
// String connection = DBUtils.getConnection();
// System.out.println("connection:"+connection);
// DBUtils.returnConnection(connection);
// return integer;
// }
// }).print();
data.mapPartition(new MapPartitionFunction<Integer, Integer>() {
public void mapPartition(Iterable<Integer> iterable, Collector<Integer> collector) throws Exception {
String connection = DBUtils.getConnection();
System.out.println("connection:"+connection);
DBUtils.returnConnection(connection);
}
}).print();
}
public static void filterFunction(ExecutionEnvironment env) throws Exception {
List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
DataSource<Integer> data = env.fromCollection(list);
data.map(new MapFunction<Integer, Integer>() {
public Integer map(Integer integer) throws Exception {
return integer+1;
}
}).filter(new FilterFunction<Integer>() {
public boolean filter(Integer integer) throws Exception {
return integer>5;
}
}).print();
}
public static void mapFunction(ExecutionEnvironment env) throws Exception {
List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
DataSource<Integer> data = env.fromCollection(list);
data.map(new MapFunction<Integer, Integer>() {
public Integer map(Integer integer) throws Exception {
return integer+1;
}
}).print();
}
}
DBUtils.scala
package com.wj.flink.datasource
import scala.util.Random
object DBUtils {
def getConnection()={
new Random().nextInt(10)+""
}
def returnConnection(connection:String): Unit ={
}
}
Slink
代码示例如下scala版
package com.wj.flink.datasource
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.FileSystem.WriteMode
object DataSetSinkApp {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
val data = 1 to 10
val text = env.fromCollection(data)
val filePath = "C:\\Users\\Administrator\\Desktop\\sinkout"
text.writeAsText(filePath,WriteMode.OVERWRITE).setParallelism(2)
env.execute("DataSetSinkApp")
}
}
Flink计算器
代码示例如下scala版
package com.wj.flink.datasource
import org.apache.flink.api.common.accumulators.LongCounter
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.java.ExecutionEnvironment
import org.apache.flink.configuration.Configuration
import org.apache.flink.core.fs.FileSystem.WriteMode
/**
* 基于Flink编程的计数器开发三部曲
* step1:定义一个计数器
* step2:注册计数器
* step3:获取计算器
*/
object CounterApp {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
val data = env.fromElements("hadoop","spark","flink","pyspark","strom")
// data.map(new RichMapFunction[String,Long] {
// var counter = 0l
// override def map(in: String): Long = {
// counter+=1
// println("counter : "+counter)
// counter
// }
// }).setParallelism(3).print()
// data.print()
val info = data.map(new RichMapFunction[String,String] {
//step1:定义一个计数器
val counter = new LongCounter()
override def open(parameters: Configuration): Unit = {
//step2:注册计数器
getRuntimeContext.addAccumulator("ele-counts-scala",counter)
}
override def map(in: String): String = {
counter.add(1)
in
}
})
info.writeAsText("C:\\Users\\Administrator\\Desktop\\sink-scala-counter-out",
WriteMode.OVERWRITE).setParallelism(3)
val jobResult = env.execute("CounterApp")
//step3:获取计算器
val num = jobResult.getAccumulatorResult[Long]("ele-counts-scala")
println("num: "+num)
}
}
Flink分布式缓存
代码示例如下:scala版
package com.wj.flink.datasource
import org.apache.commons.io.FileUtils
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.java.ExecutionEnvironment
import org.apache.flink.configuration.Configuration
/**
* step1:注册一个本地文件、hdfs文件
*
* step2:在open方法中获取到分布式缓存的内容即可
*/
object DistributedCacheApp {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
val filePath = "C:\\Users\\Administrator\\Desktop\\inputs\\hello.txt"
//step1:注册一个本地文件、hdfs文件
env.registerCachedFile(filePath,"pk-scala-dc")
import org.apache.flink.api.scala._
val data = env.fromElements("hadoop","spark","flink","pyspark","storm")
data.map(new RichMapFunction[String,String] {
//step2:在open方法中获取到分布式缓存的内容即可
override def open(parameters: Configuration): Unit = {
val dcFile = getRuntimeContext.getDistributedCache().getFile("pk-scala-dc")
val lines = FileUtils.readLines(dcFile) //java
/**
* 此时会出现异常,java集合和scala集合不兼容的问题
*/
import scala.collection.JavaConverters._
for (ele <- lines.asScala){ //scala
println(ele)
}
}
override def map(in: String): String = {
in
}
}).print()
}
}
DataStream API开发
DataSource
包含来自文件,集合,元素,自定义datasource
package com.wj.flink.datasource
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
object DataStreamSourceApp{
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
// socketFunction(env)
// nonparallelSourceFunction(env)
// parallelSourceFunction(env)
richparallelSourceFunction(env)
env.execute("DataStreamSourceApp")
}
/**
* 使用自定义的datasource
* 注意并行度可以设置大于1
* @param env
*/
def richparallelSourceFunction(env:StreamExecutionEnvironment): Unit ={
val data = env.addSource(new CustomRichParallelSourceFunction()).setParallelism(2)
data.print()
}
/**
* 使用自定义的datasource
* 注意并行度可以设置大于1
* @param env
*/
def parallelSourceFunction(env:StreamExecutionEnvironment): Unit ={
val data = env.addSource(new CustomParallelSourceFunction()).setParallelism(2)
data.print()
}
/**
* 使用自定义的datasource
* 注意并行度只能设置成 1
* @param env
*/
def nonparallelSourceFunction(env:StreamExecutionEnvironment): Unit ={
val data = env.addSource(new CustomNonparallelSourceFunction()).setParallelism(1)
data.print().setParallelism(1)
}
/**
* 数据源来自于socket
* @param env 上下文
*
* nc -lk 9999 来启动一个服务
*/
def socketFunction(env:StreamExecutionEnvironment): Unit ={
val data = env.socketTextStream("localhost", 9999)
data.print().setParallelism(1) //设置并行度,不同地方设置效果是不一样的!
}
}
CustomNonparallelSourceFunction.scala
package com.wj.flink.datasource
import org.apache.flink.streaming.api.functions.source.SourceFunction
/**
* 自定义的datasource, 不是并行的
*/
class CustomNonparallelSourceFunction extends SourceFunction[Long] {
var count = 1L
var isRunning = true
override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
while (isRunning){
ctx.collect(count)
count+=1
Thread.sleep(1000)
}
}
override def cancel(): Unit = {
isRunning = false
}
}
CustomParallelSourceFunction.scala
package com.wj.flink.datasource
import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction, SourceFunction}
/**
* 自定义个可以设置并行度的datasource
*/
class CustomParallelSourceFunction extends ParallelSourceFunction[Long]{
var count = 1L
var isRunning = true
override def run(sourceContext: SourceFunction.SourceContext[Long]): Unit = {
while (isRunning){
sourceContext.collect(count)
count+=1
Thread.sleep(1000)
}
}
override def cancel(): Unit = {
isRunning=false
}
}
CustomRichParallelSourceFunction.scala
package com.wj.flink.datasource
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}
class CustomRichParallelSourceFunction extends RichParallelSourceFunction[Long] {
var count = 1L
var isRunning = true
override def open(parameters: Configuration): Unit = super.open(parameters)
override def close(): Unit = super.close()
override def run(sourceContext: SourceFunction.SourceContext[Long]): Unit = {
while (isRunning){
sourceContext.collect(count)
count+=1
Thread.sleep(1000)
}
}
override def cancel(): Unit = {
isRunning = false
}
}
Transformation
代码示例如下:
package com.wj.flink.datasource
import java.{lang, util}
import org.apache.flink.streaming.api.collector.selector.OutputSelector
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
object DataStreamTransformationApp {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
// filterFunction(env)
// unionFunction(env)
splitSelectunionFunction(env)
env.execute("DataStreamTransformationApp")
}
def splitSelectunionFunction(env:StreamExecutionEnvironment): Unit ={
import org.apache.flink.api.scala._
val data = env.addSource(new CustomNonparallelSourceFunction)
//返回的类型是: SplitStream[T]
var splits = data.split(new OutputSelector[Long] {
val list = new util.ArrayList[String]()
override def select(value: Long): lang.Iterable[String] = {
if (value%2==0){
list.add("even")
}else{
list.add("odd")
}
list
}
})
splits.select("even","odd").print().setParallelism(1)
}
def unionFunction(env:StreamExecutionEnvironment): Unit ={
import org.apache.flink.api.scala._
val data1 = env.addSource(new CustomNonparallelSourceFunction)
val data2 = env.addSource(new CustomNonparallelSourceFunction)
data1.union(data2).print().setParallelism(1)
}
def filterFunction(env:StreamExecutionEnvironment): Unit ={
import org.apache.flink.api.scala._
val data = env.addSource(new CustomNonparallelSourceFunction)
data.map(x=>{
println("received: "+x)
x
}).filter(_%2==0).print().setParallelism(1)
}
}
自定义Sink
自定义Sink总结:
1.RichSinkFunction T就是你想要写入对象的类型
2.重写方法
open/close 生命周期方法
invoke 每条记录执行一次
代码示例:
首先引入mysql-connector-java包
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.25</version>
</dependency>
接着自定类SinkToMySQL
package com.wj.flink.datasource;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
/**
* 把flink数据写入到mysql中
*
* 自定义Sink总结:
* 1.RichSinkFunction<T> T就是你想要写入对象的类型
* 2.重写方法
* open/close 生命周期方法
* invoke 每条记录执行一次
*/
public class SinkToMySQL extends RichSinkFunction<Student> {
Connection connection;
PreparedStatement pstmt;
private Connection getConnection(){
Connection conn = null;
try {
Class.forName("com.mysql.jdbc.Driver");
String url = "jdbc:mysql://localhost:3306/imooc_flink";
conn = DriverManager.getConnection(url,"root","root");
} catch (Exception e) {
e.printStackTrace();
}
return conn;
}
/**
* 在open方法中建立connection
* @param parameters
* @throws Exception
*/
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
connection = getConnection();
String sql = "insert into student(id,name,age) values(?,?,?)";
pstmt = connection.prepareStatement(sql);
System.out.println("open");
}
//每记录插入时调用一次
public void invoke(Student value, Context context) throws Exception {
System.out.println("invoke~~~~~~~~~~~~~~~~~~~~");
//为前面的占位符赋值
pstmt.setInt(1,value.getId());
pstmt.setString(2,value.getName());
pstmt.setInt(3,value.getAge());
pstmt.executeUpdate();
}
/**
* 在close中要释放资源
* @throws Exception
*/
@Override
public void close() throws Exception {
super.close();
if (pstmt!=null){
pstmt.close();
}
if (connection!=null){
connection.close();
}
}
}
最后编写主方法测试
package com.wj.flink.datasource;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
/**
* 测试自定义sink并把数据写入mysql
*
* windows用 nc -l -p 9999 命令监听端口
*/
public class JavaCustomSinkToMySQL {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> source = env.socketTextStream("localhost", 7777);
SingleOutputStreamOperator<Student> studentStream = source.map(new MapFunction<String, Student>() {
public Student map(String value) throws Exception {
String[] splits = value.split(",");
Student stu = new Student();
stu.setId(Integer.parseInt(splits[0]));
stu.setName(splits[1]);
stu.setAge(Integer.parseInt(splits[2]));
return stu;
}
});
//把数据写出去,到mysql
studentStream.addSink(new SinkToMySQL());
env.execute("JavaCustomSinkToMySQL");
}
}
Table & SQL API
DataSet&DataStream API
-
熟悉两套API:Dataset/DataStream Java/Scala
MapReduce ==> Hive SQL
Spark ==> Spark SQL
Flink ==> SQL
-
Flink是支持批处理/流处理,如何做到API层面的统一
集成环境
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
scala代码示例:
package com.wj.flink.datasource
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.table.api.TableEnvironment
import org.apache.flink.types.Row
object TableSQLAPI {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
val tableEnv = TableEnvironment.getTableEnvironment(env)
val filePath = "C:\\Users\\Administrator\\Desktop\\table-api\\sales.csv"
import org.apache.flink.api.scala._
//已经拿到DataSet
val csv = env.readCsvFile[SalesLog](filePath,ignoreFirstLine = true)
// csv.print()
//DataSet ==> Table
val salesTable = tableEnv.fromDataSet(csv)
//注册一张表即 Table ==> table
tableEnv.registerTable("sales",salesTable)
//sql查询
val resultTable = tableEnv.sqlQuery("select customerId,sum(amountPaid) money from sales group by customerId")
//把结果输出
tableEnv.toDataSet[Row](resultTable).print()
}
case class SalesLog(transactionId:String,
customerId:String,
itemId:String,
amountPaid:Double)
}
java示例:
package com.wj.flink.datasource;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.java.BatchTableEnvironment;
import org.apache.flink.types.Row;
public class JavaTableSQLAPI {
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment tableEnv = BatchTableEnvironment.getTableEnvironment(env);
String filePath = "C:\\Users\\Administrator\\Desktop\\table-api\\sales.csv";
DataSource<Sales> csv = env.readCsvFile(filePath)
.ignoreFirstLine()
.pojoType(Sales.class,"transactionId","customerId","itemId","amountPaid");
// csv.print();
Table sales = tableEnv.fromDataSet(csv);
tableEnv.registerTable("sales",sales);
Table resultTable = tableEnv.sqlQuery("select customerId,sum(amountPaid) money from sales group by customerId");
DataSet<Row> rowDataSet = tableEnv.toDataSet(resultTable, Row.class);
rowDataSet.print();
}
public static class Sales{
public String transactionId;
public String customerId;
public String itemId;
public Double amountPaid;
}
}
Flink的Time理解
-
事件时间
每条日志里就有条时间戳,最好的最准确的
-
摄取时间
进入到flink的时间,source采取到的时间,可靠性更好
-
处理时间
不一定准确的,依赖本地clock
Window
概念:分为带key和不带key,带key是已多任务并行处理元素,不带key的并行度为1
窗口分配器:定义如何将数据分配给窗口
一个元素有可能被分配到一个或者多个窗口
窗口类型:
-
tumbling windows 滚动窗口
固定大小,窗口不会重叠
-
sliding windows 滑动窗口
固定大小,有可能会重叠,这种情况可能被分配到多个窗口
-
session windows 会话窗口
-
global windows 全局窗口
Flink对于window来说有两大类
如下图:
滚动创建和滑动窗口代码示例:
package com.wj.flink.datasource
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.api.scala._
object WindowsApp {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("localhost", 9999)
text.flatMap(_.split(","))
.map((_,1))
.keyBy(0)
// .timeWindow(Time.seconds(5)) //默认采用处理时间 滚动窗口,元素不会重复
.timeWindow(Time.seconds(10),Time.seconds(5))//滑动窗口,元素可能会重复
.sum(1)
.print()
.setParallelism(1)
env.execute("WindowsApp")
}
}
window functions之reducefunction
代码示例:
package com.wj.flink.datasource
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time
object WindowsReduceApp {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("localhost", 9999)
//原来传递进来的数据是字符串,此处我们就使用数值类型,通过数值类型来演示增量效果
text.flatMap(_.split(","))
.map(x=>(1,x.toInt)) //1,2,3 ==》(1,1)(1,2)(1,3)
.keyBy(0) //因为key都是1,所以所有元素都到一个task去执行
.timeWindow(Time.seconds(5))
.reduce((v1,v2)=>{ //不是等待窗口所有的数据进行一次性处理,而是数据两两处理
println(v1+"......"+v2)
(v1._1,v1._2+v2._2)
})
.print()
.setParallelism(1)
env.execute("WindowsReduceApp")
}
}
Connector之Kafka
Flink对接kafka作为source使用
代码示例如下:
package com.wj.flink.datasource
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.CheckpointingMode
/**
* 对接kafka数据源
*/
object KafkaConnectorConsumerApp {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
//checkpoint常用设置参数,不管是生产者还是消费者都可以使用
env.enableCheckpointing(4000)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
env.getCheckpointConfig.setCheckpointTimeout(1000)
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
val topic = "wjtest"
val properties = new Properties()
properties.setProperty("bootstrap.servers","192.168.162.128:9092")
properties.setProperty("group.id","test")
//FlinkKafkaConsumer 自动帮我们管理offset的提交
val data = env.addSource(new FlinkKafkaConsumer[String](topic,new SimpleStringSchema(),
properties))
data.print()
env.execute("KafkaConnectorConsumerApp")
}
}
Flink对接kafka作为sink使用
代码示例如下:
package com.wj.flink.datasource
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchemaWrapper
object KafkaConnectorProducerApp {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val data = env.socketTextStream("localhost", 9999)
val topic = "wjtest"
val properties = new Properties()
properties.setProperty("bootstrap.servers","192.168.162.128:9092")
val kafkaSink = new FlinkKafkaProducer[String](topic,
new KeyedSerializationSchemaWrapper[String](new SimpleStringSchema()),properties)
//把数据sink到kafka上去
data.addSink(kafkaSink)
env.execute("KafkaConnectorProducerApp")
}
}
Flink部署
单机部署
前置条件:JDK8 、Maven3
通过下载Flink源码进行编译,不是使用直接下载二进制包
下载到:
- 服务器:wget https//github.com/apache/flink/archive/release-1.7.0.tar.gz
- 本地:https://github.com/apache/flink/archive/release-1.7.0.tar.gz
编译命令:mvn clean install -DskipTests -Pvendor-repos -Dhadoop.version=2.6.0-cdh5.15.1 -Dfast
把编译好的文件E:\IDEA_WORK_SPACE\flink-release-1.7.0\flink-release-1.7.0\flink-dist\target\flink-1.7.0-bin 放到服务器上执行以下命令:
./bin/start-cluster.sh
停止命令: ./stop-cluster.sh
浏览器窗口查看管理台:http://192.168.162.128:8081/#/overview
运行示例:
- 在服务器上 执行 nc -lk 9000
- 提交任务:./bin/flink run examples/streaming/SocketWindowWordCount.jar --port 9000
Standalone分布式
Flink 同一个目录,集群里面机器 部署的目录都是一样的
每个机器需要添加ip和hostname的映射关系
条件:
-
Java1.8.x or higher
-
ssh 多个机器之间要互通
ping hadoop000
ping hadoop001
ping hadoop002
-
配置flink-conf.yaml
jobmanager.rpc.address: 10.0.0.1 配置主节点ip
jobmanager 主节点
taskmanager 从节点
slaves 每一行配置一个ip/hosts
-
常用配置
jobmanager.rpc.address master节点的地址
jobmanager.heap.mb jobmanager节点可用的内存
taskmanager.heap.mb taskmanager节点可用的内存
taskmanager.numberOfTaskSlotss 每个机器可用的cpu个数,决定并行度
parallelism.default 任务的并行度
taskmanager.tmp.dirs taskmanager的临时数据存储目录
ON YARN是企业级用的最多的方式 *****
有两种方式:
Flink中常用的优化策略
-
资源
-
并行度
默认是1 适当调整:好几种 ==》 项目实战
-
数据倾斜
100task 98-99跑完了, 1-2很慢 ==》能跑完、跑不完
group by: 二次聚合
random_key + random
key - random
join on xxx=xxx
repartition-repartition strategy 大大
broadcast-forward strategy 大小
项目综合实战
接入的数据类型就是日志
离线:Flume==>HDFS
实时:Kafka==>流处理引擎==>ES==>Kibana
项目功能
-
统计一分钟内每个域名访问产生的流量
Flink接受Kafka的数据进行处理
-
统计一分钟内每个用户产生的流量
域名和用户是有对应关系的
Flink接受Kafka的进行 + Flink读取域名和用户的配置数据 进行处理
数据来源
Mock ******
项目架构
Mock数据:务必掌握的
数据敏感
多团队协作,你依赖了其他团队提供的服务或者接口
通过Mock的方式往Kafka的broker里面发送数据
Java/Scala Code: producer
Kafka控制台消费者:consumer
项目需求:最近一分钟每个域名对应的流量
项目代码示例
对接Kafka来mock数据
package com.wj.flink.project;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Properties;
import java.util.Random;
public class PKKafkaProducer {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers","192.168.162.128:9092");
properties.setProperty("key.serializer", StringSerializer.class.getName());
properties.setProperty("value.serializer", StringSerializer.class.getName());
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(properties);
String topic = "wjtest";
//通过死循环一直不停往Kafka的Broker里面产生数据
while (true){
//构建随机数据
StringBuilder builder = new StringBuilder();
builder.append("imooc").append("\t")
.append("CN").append("\t")
.append(getLevels()).append("\t")
.append(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(new Date())).append("\t")
.append(getIps()).append("\t")
.append(getDomains()).append("\t")
.append(getTrafffic()).append("\t");
System.out.println(builder.toString());
//发送数据到Kafka
producer.send(new ProducerRecord<String, String>(topic,builder.toString()));
Thread.sleep(2000);
}
}
private static int getTrafffic() {
return new Random().nextInt(10000);
}
private static String getDomains() {
String[] domins = {
"v1.go2yd.com",
"v2.go2yd.com",
"v3.go2yd.com",
"v4.go2yd.com",
"vmi.go2yd.com"
};
return domins[new Random().nextInt(domins.length)];
}
private static String getIps() {
String[] ips = {
"223.104.18.110",
"113.101.75.194",
"27.17.127.135",
"183.225.139.16",
"112.1.66.34",
"175.148.211.190",
"183.227.58.21",
"59.83.198.84",
"117.28.38.28",
"117.59.39.169"
};
return ips[new Random().nextInt(ips.length)];
}
//生产level数据
public static String getLevels(){
String[] levels = {"M", "E"};
return levels[new Random().nextInt(levels.length)];
}
}
清洗数据写入ES
package com.wj.flink.project
import java.text.SimpleDateFormat
import java.util
import java.util.Properties
import org.apache.flink.api.common.functions.RuntimeContext
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.shaded.zookeeper.org.apache.zookeeper.jute.compiler.generated.SimpleCharStream
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.streaming.connectors.elasticsearch.{ElasticsearchSinkFunction, RequestIndexer}
import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink
import org.apache.flink.util.Collector
import org.apache.http.HttpHost
import org.elasticsearch.action.index.IndexRequest
import org.elasticsearch.client.Requests
import org.slf4j.LoggerFactory
object LogAnalysis {
def main(args: Array[String]): Unit = {
//在生成上记录日志
val logger = LoggerFactory.getLogger("LogAnalysis")
val env = StreamExecutionEnvironment.getExecutionEnvironment
//设置处理时间模式是日志产生时间
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val topic = "wjtest"
val properties = new Properties()
properties.setProperty("bootstrap.servers","192.168.162.128:9092")
properties.setProperty("group.id","test")
val consumer = new FlinkKafkaConsumer[String](topic, new SimpleStringSchema(), properties)
//接受kafka数据
val data = env.addSource(consumer)
val logData = data.map(x => {
val splits = x.split("\t")
val level = splits(2)
val timeStr = splits(3)
var time = 0l
try {
val sourceFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
time = sourceFormat.parse(timeStr).getTime
} catch {
case e: Exception =>{
logger.error(s"time parse error: $timeStr",e.getMessage)
}
}
val domain = splits(5)
val traffic = splits(6).toLong
(level, time, domain, traffic)
}).filter(_._2!=0).filter(_._1=="E")
.map(x=>{
(x._2,x._3,x._4) //1 level(抛弃) 2 time 3 domain 4 traffic
})
/**
* 在生成上进行业务处理的时候,一定要考虑处理的健壮性以及你数据的准确性
* 脏数据或者是不符合业务规则的数据是需要全部过滤掉之后
* 再进行相应业务逻辑的处理
*
* 对于我们业务来说,我们只需要统计level=E的即可
* 对于level非E的,不作为我们业务指标的统计范畴
*
* 数据清洗:就是按照我们业务规则吧原始输入的数据进行一定业务规则处理
* 使得满足我们的业务需求为准
*/
// logData.print().setParallelism(1)
//利用匿名内部类,获取水印来解决数据无序的问题
var resultData = logData.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[(Long, String, Long)] {
val maxOutOfOrderness = 10000L
var currentMaxTimestamp: Long = _ //scala里面 _是占位符
override def getCurrentWatermark: Watermark = {
new Watermark(currentMaxTimestamp - maxOutOfOrderness)
}
override def extractTimestamp(t: (Long, String, Long), l: Long): Long = {
val timestamp = t._1
currentMaxTimestamp = Math.max(timestamp,currentMaxTimestamp)
timestamp
}
}).keyBy(1) //此处是按照域名进行keyBy的
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
.apply(new WindowFunction[(Long, String, Long),(String, String, Long),Tuple,TimeWindow] { //参数:输入 输出 key window
override def apply(key: Tuple, window: TimeWindow, input: Iterable[(Long, String, Long)], out: Collector[(String, String, Long)]): Unit = {
val domain = key.getField(0).toString
var sum = 0l
val iterator = input.iterator
val timeArr = new util.ArrayList[Long]()
val sourceFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm")
while (iterator.hasNext){
val next = iterator.next()
sum+=next._3 //traffic求和
//TODO 是能拿到你这个window里面的时间的 next._1
timeArr.add(next._1)
}
/**
* 第一个参数:这一分钟的时间 2019-09-09 20:20
* 第二个参数:域名
* 第三个参数:traffic的和
*/
val time = sourceFormat.format(timeArr.get(0))
out.collect((time,domain,sum))
}
}) //.print().setParallelism(1)
//最终要把数据写入到ES中并在Kibana中展示结果
val httpHosts = new util.ArrayList[HttpHost]
httpHosts.add(new HttpHost("192.168.162.128",9200,"http"))
val esSinkBuilder = new ElasticsearchSink.Builder[(String,String,Long)](httpHosts,
new ElasticsearchSinkFunction[(String, String, Long)] {
def createIndexRequest(element: (String, String, Long)):IndexRequest={
val json = new util.HashMap[String, Any]()
json.put("time",element._1)
json.put("domain",element._2)
json.put("traffics",element._3)
val id = element._1+"-"+element._2
Requests.indexRequest()
.index("cdn")
.`type`("traffic")
.id(id)
.source(json)
}
override def process(t: (String, String, Long), runtimeContext: RuntimeContext, requestIndexer: RequestIndexer): Unit = {
requestIndexer.add(createIndexRequest(t))
}
})
esSinkBuilder.setBulkFlushMaxActions(1)
//最后把数据写出到es
resultData.addSink(esSinkBuilder.build())
env.execute("LogAnalysis")
}
}
ES部署
要求:
- CentOS7.x
- 非root hadoop
引入依赖:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-elasticsearch6_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
下载并解压ES(不要用root用户)
tar -zxvf elasticsearch-6.6.2.tar.gz -C …/install/
修改配置文件elasticsearch.yml
network.host: 0.0.0.0
启动:
./elasticsearch
./elasticsearch -d 后台启动
最后访问地址:http://192.168.162.128:9200/ 成功即可
Kibana部署
下载解压:
tar zxvf kibana-6.6.2-linux-x86_64.tar.gz -C …/install/
配置文件kibana.yml修改:
server.host: “wangjun”
elasticsearch.hosts: [“http://wangjun:9200”]
启动:
./kibana
最后
访问http://192.168.162.128:5601/ 成功即可
创建索引相关命令:
curl -XPUT 'http://wangjun:9200/cdn' //创建索引库
curl -H "Content-Type: application/json" -XPOST 'http://wangjun:9200/cdn/traffic/_mapping
"traffic":{
"properties":{
"domain":{"type":"text"},
"traffics":{"type":"long"},
"time":{"type":"data","format":"yyyy-MM-dd HH:mm"}
}
}
'
//创建type (!!!有问题 es6.x以后不推荐使用type)
需求:CDN业务
userid对应多个域名
用户id和域名的映射关系:
从日志里能拿到domain,还得从另外一个表(MYSQL)
里面去获取userid和domain的映射关系
Sql语句:
create table user_domain_config(
id int unsigned auto_increment,
user_id varchar(40) not null,
domain varchar(40) not null,
primary key (id)
);
insert into user_domain_config(user_id,domain) values
('8000000','v1.go2yd.com');
insert into user_domain_config(user_id,domain) values
('8000000','v2.go2yd.com');
insert into user_domain_config(user_id,domain) values
('8000000','v3.go2yd.com');
insert into user_domain_config(user_id,domain) values
('8000000','v4.go2yd.com');
insert into user_domain_config(user_id,domain) values
('8000000','vmi.go2yd.com');
在做实时数据清洗的时候,不仅需要处理raw日志,还需要关联Mysql表里的数据。
自定义一个Filink去读取Mysql数据的数据源,然后把两个stream关联起来。
代码示例:
package com.wj.flink.project
import java.sql.{Connection, DriverManager, PreparedStatement}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}
import scala.collection.mutable
/**
* 自定义对接mysql的数据源
*/
class PKMySQLSource extends RichParallelSourceFunction[mutable.HashMap[String,String]]{
var connection:Connection = null
var ps:PreparedStatement = null
//Open:建立连接
override def open(parameters: Configuration): Unit = {
val driver = "com.mysql.jdbc.Driver"
Class.forName(driver)
var url = "jdbc:mysql://localhost:3306/flink"
var user = "root"
var password = "root"
connection = DriverManager.getConnection(url,user,password)
val sql = "select user_id,domain from user_domain_config"
ps = connection.prepareStatement(sql)
}
//释放资源
override def close(): Unit = {
if (ps!=null){
ps.close()
}
if (connection!=null){
connection.close()
}
}
/**
* 此处是代码的关键:要从mysql表中把数据读取出来转成Map进行数据的封装
* @param sourceContext
*/
override def run(sourceContext: SourceFunction.SourceContext[mutable.HashMap[String, String]]): Unit = {
println("run function invoke.....")
val set = ps.executeQuery()
val map = new mutable.HashMap[String, String]()
while (set.next()){
map.put(set.getString(2),set.getString(1));
}
sourceContext.collect(map)
}
override def cancel(): Unit = {
???
}
}
LogAnalysis02$.scala
package com.wj.flink.project
import java.text.SimpleDateFormat
import java.util
import java.util.Properties
import org.apache.flink.api.common.functions.RuntimeContext
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.functions.co.CoFlatMapFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.streaming.connectors.elasticsearch.{ElasticsearchSinkFunction, RequestIndexer}
import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.util.Collector
import org.apache.http.HttpHost
import org.elasticsearch.action.index.IndexRequest
import org.elasticsearch.client.Requests
import org.slf4j.LoggerFactory
import scala.collection.mutable
object LogAnalysis02 {
def main(args: Array[String]): Unit = {
//在生成上记录日志
val logger = LoggerFactory.getLogger("LogAnalysis")
val env = StreamExecutionEnvironment.getExecutionEnvironment
//设置处理时间模式是日志产生时间
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val topic = "wjtest"
val properties = new Properties()
properties.setProperty("bootstrap.servers","192.168.162.128:9092")
properties.setProperty("group.id","test")
val consumer = new FlinkKafkaConsumer[String](topic, new SimpleStringSchema(), properties)
//接受kafka数据
val data = env.addSource(consumer)
val logData = data.map(x => {
val splits = x.split("\t")
val level = splits(2)
val timeStr = splits(3)
var time = 0l
try {
val sourceFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
time = sourceFormat.parse(timeStr).getTime
} catch {
case e: Exception =>{
logger.error(s"time parse error: $timeStr",e.getMessage)
}
}
val domain = splits(5)
val traffic = splits(6).toLong
(level, time, domain, traffic)
}).filter(_._2!=0).filter(_._1=="E")
.map(x=>{
(x._2,x._3,x._4) //1 level(抛弃) 2 time 3 domain 4 traffic
})
/**
* 在生成上进行业务处理的时候,一定要考虑处理的健壮性以及你数据的准确性
* 脏数据或者是不符合业务规则的数据是需要全部过滤掉之后
* 再进行相应业务逻辑的处理
*
* 对于我们业务来说,我们只需要统计level=E的即可
* 对于level非E的,不作为我们业务指标的统计范畴
*
* 数据清洗:就是按照我们业务规则吧原始输入的数据进行一定业务规则处理
* 使得满足我们的业务需求为准
*/
// logData.print().setParallelism(1)
val mysqlData = env.addSource(new PKMySQLSource).setParallelism(1)
// mysqlData.print().setParallelism(1)
val connectData = logData.connect(mysqlData)
.flatMap(new CoFlatMapFunction[(Long,String,Long),mutable.HashMap[String, String],String] {
var userDomainMap = mutable.HashMap[String,String]()
//log
override def flatMap1(value: (Long, String, Long), collector: Collector[String]): Unit = {
print("flatMap1 invoke .....")
val domain = value._2
val userId = userDomainMap.getOrElse(domain,"")
println("~~~~~~~~~~~~~"+userId)
collector.collect(value._1+"\t"+value._2+"\t"+value._3+"\t"+userId)
}
//mysql
override def flatMap2(value: mutable.HashMap[String, String], collector: Collector[String]): Unit = {
userDomainMap = value
}
})
connectData.print()
env.execute("LogAnalysis02")
}
}
Flink进行数据清洗的总结:
-
读取Kafka的数据
-
读取MySql的数据
-
connect
-
业务逻辑的处理分析:水印 WindowFunction
==>ES 注意数据类型 <= Kibana 图形化的统计结果展示
-
Kibana 各个环节的监控 监控图形化