Flink

Apache Flink

概述

Flink是构建在Data Stream之上一款有状态计算框架。由于该款框架出现的较晚2014.12月发布,通常被人们认为是第3代流计算框架。

第一代:MapReduce 2006年 批 磁盘 M->R 矢量 | 2014.9 Storm诞生 流 延迟低/吞吐小

第二代:Spark RDD 2014.2 批 内存 DAG (若干Stage) | 使用micro-batch 模拟 流处理 DStream 延迟高/吞吐大

第三代Flink Datastream 2014.12 流计算 内存 Datafollow Graph(若干个Task) | Flink Dataset在流计算构建批处理

流处理应用领域:风险控制/智能交通/疾病预测/互联网金融/…

Flink 架构

宏观战略

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UGJ72Q9M-1585096000829)(assets/1573616701836.png)]

Flink VS Spark

Spark计算核心是构建在RDD的批处理之上,通过批模拟流计算。而Flink构建流处理之上,通过流模拟批。

Flink计算架构

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iXXpOzj6-1585096000830)(assets/1573617576378.png)]

JobManagers- 所谓Master ,负责协调分布式任务执行。 负责调度任务,协调checkpoint,协调故障恢复等。

There is always at least one Job Manager. A high-availability setup will have multiple JobManagers, one of which one is always the leader, and the others are standby.

TaskManagers- 所谓slaves(工作节点/Worker),负责真正任务执行,执行一些Task(等价Spark Stage)下的subtask。负责流计算当中数据缓存或者数据shuffle.计算机节点连接JobManager汇报自身状态信息,并且告知主节点自己分配到任务的计算状态。

There must always be at least one TaskManager.

client - 主要是在任务计算之前将任务翻译成Dataflow Graph,将该Dataflow Graph提交给JobManagers。

Task - Flink会将任务通过Operator Chain的方式将一个任务划分为若干个Task,每个Task都有自己的并行度,根据设置并行度创建相应的subtask(线程)。通过Operator Chain可以减少线程-线程间通信成本和系统开销。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CQq8XdR6-1585096000831)(assets/1573627188537.png)]

**Task Slots ** - 每个Task Slot代表TaskManager 计算资源子集。Task Slot可以均分TaskManager 的内存。比如说一个TaskManager 有3个Task Slot.则每个Task slot就代表1/3的内存空间。不同job的subtask之间可以通过Task Slot进行隔离。同一个Job的不同task的subtask可以共享Task slots。**默认所有的subtask共享的是同一个资源组default,**因此一个Job所需的Task Slots的数量就等于该Job下Task的最大并行度。

Flink环境搭建

  • 设置CentOS进程数和文件数(重启生效) -可选
[root@Spark ~]# vi /etc/security/limits.conf

* soft nofile 204800
* hard nofile 204800
* soft nproc 204800
* hard nproc 204800
  • 配置主机名(重启生效)
[root@Spark ~]# cat /etc/hostname
Spark
  • 设置IP映射
[root@Spark ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.11.100  Spark
  • 防火墙服务
[root@Spark ~]# systemctl stop firewalld
[root@Spark ~]# systemctl disable firewalld
[root@Spark ~]# firewall-cmd --state
not running
  • 安装JDK1.8+
[root@Spark ~]# rpm -ivh jdk-8u171-linux-x64.rpm 
[root@Spark ~]# ls -l /usr/java/
total 4
lrwxrwxrwx. 1 root root   16 Mar 26 00:56 default -> /usr/java/latest
drwxr-xr-x. 9 root root 4096 Mar 26 00:56 jdk1.8.0_171-amd64
lrwxrwxrwx. 1 root root   28 Mar 26 00:56 latest -> /usr/java/jdk1.8.0_171-amd64
[root@Spark ~]# vi .bashrc 
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
[root@Spark ~]# source ~/.bashrc
  • SSH配置免密
[root@Spark ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
4b:29:93:1c:7f:06:93:67:fc:c5:ed:27:9b:83:26:c0 root@CentOS
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|         o   . . |
|      . + +   o .|
|     . = * . . . |
|      = E o . . o|
|       + =   . +.|
|        . . o +  |
|           o   . |
|                 |
+-----------------+
[root@Spark ~]# ssh-copy-id CentOS
The authenticity of host 'centos (192.168.40.128)' can't be established.
RSA key fingerprint is 3f:86:41:46:f2:05:33:31:5d:b6:11:45:9c:64:12:8e.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'centos,192.168.40.128' (RSA) to the list of known hosts.
root@centos's password: 
Now try logging into the machine, with "ssh 'CentOS'", and check in:

  .ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.
[root@Spark ~]# ssh root@CentOS
Last login: Tue Mar 26 01:03:52 2019 from 192.168.40.1
[root@Spark ~]# exit
logout
Connection to CentOS closed.
  • 安装配置Flink
[root@Spark ~]# tar -zxf flink-1.8.1-bin-scala_2.11.tgz -C /usr/
[root@Spark ~]# cd /usr/flink-1.8.1/
[root@Spark ~]# vi conf/flink-conf.yaml
jobmanager.rpc.address: Spark
taskmanager.numberOfTaskSlots: 3

[root@Spark ~]#  vi conf/slaves
Spark
[root@Spark ~]# ./bin/start-cluster.sh

Quick Start

  • 引入依赖
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-scala -->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-scala_2.11</artifactId>
    <version>1.8.1</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-scala_2.11</artifactId>
    <version>1.8.1</version>
</dependency>
  • Quick start
//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)

//3.对数据做转换 - operator
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.print()

//执行计算
fsEnv.execute("FlinkWordCountsQuickStart")

将程序打包 ,通过UI页面或者./bin/flink run执行

[root@Spark flink-1.8.1]# ./bin/flink run 
							-c com.baizhi.quickstart.FlinkWordCountsQuickStart 
							-p 3 
							/root/flink-1.0-SNAPSHOT-jar-with-dependencies.jar
[root@Spark flink-1.8.1]# ./bin/flink list -m Spark:8081
Waiting for response...
------------------ Running/Restarting Jobs -------------------
13.11.2019 16:49:36 : 36e8f1ec3173ccc2c5e1296d1564da87 : FlinkWordCountsQuickStart (RUNNING)
--------------------------------------------------------------
No scheduled jobs.
[root@Spark flink-1.8.1]# ./bin/flink cancel -m Spark:8081 36e8f1ec3173ccc2c5e1296d1564da87
Cancelling job 36e8f1ec3173ccc2c5e1296d1564da87.
Cancelled job 36e8f1ec3173ccc2c5e1296d1564da87.

DataSource

DataSource指定了流计算的输入,用户可以通过StreamExecutionEnvironment.addSource(sourceFunction),Flink已经预先实现了一些DataSource的实现,如果用户需要自定义自己的实现可以通过实现SourceFunction接口(非并行Source)或者ParallelSourceFunction 接口(实现并行Source)或者继承RichParallelSourceFunction .

File Based(了解)

readTextFile(path) - 读取文本文件,底层通过TextInputFormat 一行行读取文件数据,返回是一个DataStream[String] - 仅仅处理一次

//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

//2.创建DataStream -细化
val filePath="file:///D:\\data"
val dataStream: DataStream[String] = fsEnv.readTextFile(filePath)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

readFile(fileInputFormat, path) - 读取文本文件,底层通过指定输入格式 - 仅仅处理一次

//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

//2.创建DataStream -细化
val filePath="file:///D:\\data"
val inputFormat = new TextInputFormat(null)
val dataStream: DataStream[String] = fsEnv.readFile(inputFormat,filePath)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

readFile(fileInputFormat, path, watchType, interval, pathFilter) - 以上两个方法底层调用都是该方法。

 //1.创建StreamExecutionEnvironment
    val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

    //2.创建DataStream -细化
    val filePath="file:///D:\\data"
    val inputFormat = new TextInputFormat(null)

    inputFormat.setFilesFilter(new FilePathFilter {
      override def filterPath(path: Path): Boolean = {
        if(path.getName().startsWith("1")){ //过滤不符合的文件
          return true
        }
        false
      }
    })
    val dataStream: DataStream[String] = fsEnv.readFile(inputFormat,filePath,
      FileProcessingMode.PROCESS_CONTINUOUSLY,1000)
    //3.对数据做转换
    dataStream.flatMap(_.split("\\s+"))
      .map((_,1))
      .keyBy(0)
      .sum(1)
      .print()

    fsEnv.execute("FlinkWordCountsQuickStart")

定期的扫描文件,如果文件内容被修改了,该文件会被完整的重新读取。因此可能会产生重复计算。

Collection(测试)

//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.fromCollection(List("this is a demo","hello world"))
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

自定义SourceFunction(掌握)

class UserDefineParallelSourceFunction extends ParallelSourceFunction[String]{

  val lines=Array("this is a demo","hello world","hello flink")
  @volatile
  var isRunning=true
  //运行
  override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {
    while (isRunning){
      Thread.sleep(1000)
      sourceContext.collect(lines(new Random().nextInt(lines.length)))
    }
  }
  //取消
  override def cancel(): Unit = {
    isRunning=false
  }
}         
//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.addSource(new UserDefineParallelSourceFunction)
dataStream.setParallelism(10)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

Flink和Kafka Source(重点)

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka_2.11</artifactId>
    <version>1.8.1</version>
</dependency>
//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

//2.创建DataStream -细化
val props = new Properties()
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,
                  "Spark:9092")
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")

val flinkKafkaConsumer = new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),props)

val dataStream: DataStream[String] = fsEnv.addSource(flinkKafkaConsumer)
dataStream.setParallelism(10)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

只能获取value信息,如果用户需要获取key/offset/partition信息用户需要定制KafkaDeserializationSchema

获取Record元数据信息

class UserDefineKafkaDeserializationSchema
extends KafkaDeserializationSchema[(Int,Long,String,String,String)]{
    override def isEndOfStream(t: (Int, Long, String, String, String)): Boolean = {
        return false;
    }

    override def deserialize(r: ConsumerRecord[Array[Byte], Array[Byte]]): (Int, Long, String, String, String) = {
        if(r.key()==null){
            (r.partition(),r.offset(),r.topic(),"",new String(r.value()))
        }else{
            (r.partition(),r.offset(),r.topic(),StringUtils.arrayToString(r.key()),new String(r.value()))
        }
    }
    //告知返回值类型
    override def getProducedType: TypeInformation[(Int, Long, String, String, String)] = {
        createTypeInformation[(Int, Long, String, String, String)]
    }
}

//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

//2.创建DataStream -细化
val props = new Properties()
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,
                  "Spark:9092")
props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")

val flinkKafkaConsumer = new FlinkKafkaConsumer[(Int,Long,String,String,String)]("topic01",
                                                                                 new UserDefineKafkaDeserializationSchema(),props)

val dataStream: DataStream[(Int,Long,String,String,String)] = fsEnv.addSource(flinkKafkaConsumer)

dataStream.print()

fsEnv.execute("FlinkWordCountsQuickStart")

Data Sink

Data sinks负责消费Data Stream的数据,将数据写出到外围系统,例如:文件/网络/NoSQL/RDBMS/Message Queue等。Flink底层也预定义了一些常用的Sinks,同时用户也可以根据实际需求定制Data Sink通过集成SinkFunction或者RichSinkFunction。

https://ci.apache.org/projects/flink/flink-docs-release-1.9/zh/dev/datastream_api.html#data-sinks

File Based(主要用于测试)

  • writeAsText()|writeAsCsv(…)|writeUsingOutputFormat()|writeToSocket at-least-once

    write*这样的方法主要用于测试,因为他们不参与检查点,数据刷新的时机由OutputFormat履行,这意味着不是所有的数据都会立即被刷新到文件系统,在故障情况下,会有数据丢失

    addSink方法可以使用检查点并保证精确一次的写出,需要用户自定义输出格式或者使用Bucketing File Sink

//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

val output = new CsvOutputFormat[Tuple2[String, Int]](new Path("file:///D:/fink-results"))
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
    .map((_,1))
    .keyBy(0)
    .sum(1)
    .map(t=> new Tuple2(t._1,t._2))
    .writeUsingOutputFormat(output)

fsEnv.execute("FlinkWordCountsQuickStart")
  • Bucketing File Sink (exactly-once)
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-filesystem_2.11</artifactId>
    <version>1.8.1</version>
</dependency>

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.9.2</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.9.2</version>
</dependency>
//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

val bucketSink = new BucketingSink[String]("hdfs://Spark:9000/bucketSink")
bucketSink.setBucketer(new DateTimeBucketer("yyyy-MM-dd-HH", ZoneId.of("Asia/Shanghai")))

//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.map(t=>t._1+"\t"+t._2)
.addSink(bucketSink)
//addSink()调用自定义接收
fsEnv.execute("FlinkWordCountsQuickStart")

print()/ printToErr()

//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.print("error")
//给特定的数据添加前缀,以区分来自不同流的数据
fsEnv.execute("FlinkWordCountsQuickStart")

自定义Sink(熟练)

class UserDefineRichSinkFunction extends RichSinkFunction[(String,Int)]{
    override def open(parameters: Configuration): Unit = {
        println("open")
    }
    override def invoke(value: (String, Int)): Unit = {
        println("insert into xxx "+value)
    }

    override def close(): Unit = {
        println("close")
    }
}
//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",7788)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.addSink(new UserDefineRichSinkFunction)

fsEnv.execute("FlinkWordCountsQuickStart")

Redis Sink(掌握)

参考:https://bahir.apache.org/docs/flink/current/flink-streaming-redis/

<dependency>
    <groupId>org.apache.bahir</groupId>
    <artifactId>flink-connector-redis_2.11</artifactId>
    <version>1.0</version>
</dependency>
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment


val conf = new FlinkJedisPoolConfig.Builder()
.setHost("Spark")
.setPort(6379).build()

//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",7788)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.addSink(new RedisSink(conf,new UserDefineRedisMapper))

fsEnv.execute("FlinkWordCountsQuickStart")
class UserDefineRedisMapper extends RedisMapper[(String,Int)]{
  override def getCommandDescription: RedisCommandDescription = {
    new RedisCommandDescription(RedisCommand.HSET,"word-count")
  }

  override def getKeyFromData(t: (String, Int)): String = {
    t._1
  }

  override def getValueFromData(t: (String, Int)): String = {
    t._2.toString
  }
}

在安装Redis如果访问不到,需要关闭Redis protect-model:no

Kafka Sink(掌握)

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka_2.11</artifactId>
    <version>1.8.1</version>
</dependency>
//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

val props = new Properties()
props.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"Spark:9092")
//不建议覆盖
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,classOf[ByteArraySerializer])
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,classOf[ByteArraySerializer])


props.put(ProducerConfig.RETRIES_CONFIG,"3")
props.put(ProducerConfig.ACKS_CONFIG,"-1")
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG,"true")
props.put(ProducerConfig.BATCH_SIZE_CONFIG,"100")
props.put(ProducerConfig.LINGER_MS_CONFIG,"500")


//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",7788)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.addSink(new FlinkKafkaProducer[(String, Int)]("topicxx",new UserDefineKeyedSerializationSchema,props))

fsEnv.execute("FlinkWordCountsQuickStart")
class UserDefineKeyedSerializationSchema extends KeyedSerializationSchema[(String,Int)]{
  override def serializeKey(t: (String, Int)): Array[Byte] = {
    t._1.getBytes()
  }

  override def serializeValue(t: (String, Int)): Array[Byte] = {
    t._2.toString.getBytes()
  }

  override def getTargetTopic(t: (String, Int)): String = "topic01"
}

Operator(会用)

DataStream Transformations

Datastream -> Datasteam

Map

Takes one element and produces one element. A map function that doubles the values of the input stream:

dataStream.map { x => x * 2 }
FlatMap

Takes one element and produces zero, one, or more elements. A flatmap function that splits sentences to words:

dataStream.flatMap { str => str.split(" ") }
Filter

Evaluates a boolean function for each element and retains those for which the function returns true(保留返回值为true的元素). A filter that filters out zero values:

dataStream.filter { _ != 0 }
Union

Union of two or more data streams creating a new stream containing all the elements from all the streams. Note: If you union a data stream with itself you will get each element twice in the resulting stream.

联合多个数据流,新的数据流会包含多个流的所有元素

dataStream.union(otherStream1, otherStream2, ...)

DataStream,DataStream → ConnectedStreams

Connect

“Connects” two data streams retaining(保留) their types, allowing for shared state between the two streams.

someStream : DataStream[Int] = ...
otherStream : DataStream[String] = ...

val connectedStreams = someStream.connect(otherStream)
CoMap, CoFlatMap

Similar to map and flatMap on a connected data stream

connectedStreams.map(
    (_ : Int) => true,
    (_ : String) => false
)
connectedStreams.flatMap(
    (_ : Int) => true,
    (_ : String) => false
)

案例小节

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

val s1 = fsEnv.socketTextStream("Spark",9999)
val s2 = fsEnv.socketTextStream("Spark",8888)
s1.connect(s2).flatMap(
    (line:String)=>line.split("\\s+"),//s1流转换逻辑
    (line:String)=>line.split("\\s+")//s2流转换逻辑
)
.map((_,1))
.keyBy(0)
.sum(1)
.print()

fsEnv.execute("ConnectedStream")

DataStream → SplitStream

Split

Split the stream into two or more streams according to some criterion.

val split = someDataStream.split(
  (num: Int) =>
    (num % 2) match {
      case 0 => List("even")
      case 1 => List("odd")
    }
)
Select

Select one or more streams from a split stream.

val even = split.select "even"
val odd = split.select "odd"
val all = split.select("even","odd")

案例小节(过期了)

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

val logStream = fsEnv.socketTextStream("Spark",9999)

val splitStream: SplitStream[String] = logStream.split(new OutputSelector[String] {
    override def select(out: String): lang.Iterable[String] = {
        if (out.startsWith("INFO")) {
            val array = new util.ArrayList[String]()
            array.add("info")
            return array
        } else  {
            val array = new util.ArrayList[String]()
            array.add("error")
            return array
        }
    }
})

splitStream.select("info").print("info")
splitStream.select("error").printToErr("error")

fsEnv.execute("ConnectedStream")

用法二(优先)

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

val logStream = fsEnv.socketTextStream("Spark",9999)


val errorTag = new OutputTag[String]("error")

val dataStream = logStream.process(new ProcessFunction[String, String] {
    override def processElement(line: String,
                                context: ProcessFunction[String, String]#Context,
                                collector: Collector[String]): Unit = {
        if (line.startsWith("INFO")) {
            collector.collect(line)
        }else{
            context.output(errorTag,line)//分支输出
        }
    }
})

dataStream.print("正常信息")
dataStream.getSideOutput(errorTag).print("错误信息")

fsEnv.execute("ConnectedStream")

DataStream → KeyedStream

KeyBy

Logically partitions a stream into disjoint(不相交) partitions, each partition containing elements of the same key. Internally, this is implemented with hash partitioning.

将一个流逻辑分区为多个不相交的流,分出的每个流中含有相同的key,在内部这是通过hash实现的

dataStream.keyBy("someKey") // Key by field "someKey"
dataStream.keyBy(0) // Key by the first element of a Tuple
Reduce

A “rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value.

在一个keyed data stream上将目前的的元素与之前最近的元素联合返回一个新的联合值

keyedStream.reduce { _ + _ }
Fold

A “rolling” fold on a keyed data stream with an initial value. Combines the current element with the last folded value and emits the new value.

类似于Reduce,只是加了一个初值

val result: DataStream[String] =
keyedStream.fold("start")((str, i) => { str + "-" + i })
Aggregations

Rolling aggregations on a keyed data stream. The difference between min and minBy is that min returns the minimum value, whereas minBy returns the element that has the minimum value in this field (same for max and maxBy).

加By返回具有所找值的元素,不加返回所找的值

keyedStream.sum(0)
keyedStream.sum("key")
keyedStream.min(0)
keyedStream.min("key")
keyedStream.max(0)
keyedStream.max("key")
keyedStream.minBy(0)
keyedStream.minBy("key")
keyedStream.maxBy(0)
keyedStream.maxBy("key")

Physical partitioning

Flink提供了一些分区方案,可供用户选择,分区目的是为了任务之间数据的能够均衡分布。

分区方案 说明
Custom partitioning 需要用户实现分区策略
dataStream.partitionCustom(partitioner, “someKey”)
Random partitioning 将当前的数据随机分配给下游任务
dataStream.shuffle()
Rebalancing (Round-robin partitioning) 轮询将上游的数据均分下游任务
dataStream.rebalance()
Rescaling 缩放分区数据,例如上游2个并行度/下游4个 ,上游会将1个分区的数据发送给下游前两个分区,后1个分区,会发送下游后两个。
dataStream.rescale()
Broadcasting 上游会将分区所有数据,广播给下游的所有任务分区。
dataStream.broadcast()

Task chaining and resource groups(了解)

连接两个Operator 转换,尝试将两个Operator 转换放置到一个线程当中,可以减少线程消耗,避免不必要的线程通信。用户可以通过 StreamExecutionEnvironment.disableOperatorChaining()禁用chain操作。

val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)

    //3.对数据做转换
    dataStream.filter(line => line.startsWith("INFO"))
    .flatMap(_.split("\\s+"))
    .map((_,1))
    .map(t=>WordPair(t._1,t._2))
    .print()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DmocUaRN-1585096000832)(assets/1573799970593.png)]

为了方便,Flink提供一下算子用于修改chain的行为

算子 操作 说明
Start new chain someStream.filter(…).map(…).startNewChain().map(…) 开启新chain,将当前算子和filter断开
Disable chaining someStream.map(…).disableChaining() 当前算子和前后都要断开chain操作
Set slot sharing group someStream.filter(…).slotSharingGroup(“name”) 设置操作任务所属资源Group,影响任务对TaskSlots占用。
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)

//3.对数据做转换
dataStream.filter(line => line.startsWith("INFO"))
.flatMap(_.split("\\s+"))
.startNewChain()
.slotSharingGroup("g1")
.map((_,1))
.map(t=>WordPair(t._1,t._2))
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

State & Fault Tolerance(重点)

Flink将流计算的状态分为两类:Keyed Sate \ Opertator State.其中Keyed Sate状态是操作符中key绑定的,而 Operator State只可以和操作符绑定。无论是Keyed state还是Operator State Flink对状态的管理分为两种形式 Managed StateRaw Sate

Managed State - 由于状态处于被管理,因此状态结构和信息都是被Flink预制好的,因此使用Managed State Flink可以更好的对存储做优化。

Raw Sate - 该状态是原生的数据,只有在用户自定义Operator实现的时候,才开会用到,并且Flink在存储原生状态的时候,仅仅存储了字节数组,因此Flink无法获取有关注状态任何信息,因此在实际的开发中基本不用。

All datastream functions can use managed state, but the raw state interfaces can only be used when implementing operators. Using managed state (rather than raw state) is recommended, since with managed state Flink is able to automatically redistribute state when the parallelism is changed, and also do better memory management.

Managed Keyed State

针对于Keyed State状态flink提供了丰富的状态变量,以便用户完成状态存储。目前有以下几种状态:

类型 说明 方法
ValueState 这个状态主要存储一个可以用作更新的值 update(T)
T value()
clear()
ListState 这将存储List集合元素 add(T)
addAll(List)
Iterable get()
update(List)
clear()
ReducingState 这将保留一个值,该值表示添加到状态的所有值的汇总
需要用户提供ReduceFunction
add(T)
T get()
clear()
AggregatingState<IN, OUT> 这将保留一个值,该值表示添加到状态的所有值的汇总
需要用户提供AggregateFunction
add(IN)
T get()
clear()
FoldingState<T, ACC> 这将保留一个值,该值表示添加到状态的所有值的汇总
这个状态变量在Flink1.4就不提倡用了,在未来会被丢弃,建议使用Aggregating替代他。

需要用户提供FoldFunction
add(IN)
T get()
clear()
MapState<UK, UV> 这个状态会保留一个Map集合元素 put(UK, UV)
putAll(Map<UK, UV>)
entries()
keys()
values()
clear()

如果想拿到一个State的引用,必须创建相应SateDescriptor,Flink提供了以下的以下SateDescriptor

ValueStateDescriptor,

ListStateDescriptor,

ReducingStateDescriptor,

FoldingStateDescriptor

MapStateDescriptor,

AggregatingStateDescriptor

创建完SateDescriptor用户需要在Rich Function获取RuntimeConext对象,然后调用该对象的相应方法获取Sate对象

  • ValueState<T> getState(ValueStateDescriptor<T>)
  • ReducingState<T> getReducingState(ReducingStateDescriptor<T>)
  • ListState<T> getListState(ListStateDescriptor<T>)
  • AggregatingState<IN, OUT> getAggregatingState(AggregatingStateDescriptor<IN, ACC, OUT>)
  • FoldingState<T, ACC> getFoldingState(FoldingStateDescriptor<T, ACC>)
  • MapState<UK, UV> getMapState(MapStateDescriptor<UK, UV>)

ValueState

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.disableOperatorChaining()//
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.map(new RichMapFunction[(String,Int),(String,Int)] {
    var valueState:ValueState[Int]=_

    override def open(parameters: Configuration): Unit = {
        val vsd = new ValueStateDescriptor[Int]("wordcount",createTypeInformation[Int])
        valueState= getRuntimeContext.getState(vsd)
    }

    override def map(value: (String, Int)): (String, Int) = {
        var historyValue = valueState.value()
        if(historyValue==null){
            historyValue=0
        }
        //更新历史
        valueState.update(historyValue+value._2)
        (value._1,valueState.value())
    }
})
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

ReduceState

//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.disableOperatorChaining()//
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.map(new RichMapFunction[(String,Int),(String,Int)] {
    var reduceState:ReducingState[Int]=_

    override def open(parameters: Configuration): Unit = {
        val rsd = new ReducingStateDescriptor[Int]("wordcount",new ReduceFunction[Int] {
            override def reduce(value1: Int, value2: Int): Int = value1+value2
        },createTypeInformation[Int])
        reduceState= getRuntimeContext.getReducingState(rsd)
    }

    override def map(value: (String, Int)): (String, Int) = {
        reduceState.add(value._2)
        (value._1,reduceState.get())
    }
})
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

AggregatingState

 //1.创建StreamExecutionEnvironment
    val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
    fsEnv.disableOperatorChaining()//
    //2.创建DataStream -细化
    val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
    //3.对数据做转换 1 zhangsan 销售部 10000
    dataStream.map(_.split("\\s+"))
      .map(ts=>Employee(ts(0),ts(1),ts(2),ts(3).toDouble))
      .keyBy("dept")
      .map(new RichMapFunction[Employee,(String,Double)] {
        var aggregatingState:AggregatingState[Double,Double]= _

        override def open(parameters: Configuration): Unit = {
          val asd=new AggregatingStateDescriptor[Double,(Double,Int),Double]("agggstate",
            new AggregateFunction[Double,(Double,Int),Double] {
              override def createAccumulator(): (Double, Int) = (0.0,0)

              override def add(value: Double, accumulator: (Double, Int)): (Double, Int) = {
                var total=accumulator._1
                var count=accumulator._2
                (total+value,count+1)
              }
              override def merge(a: (Double, Int), b: (Double, Int)): (Double, Int) = {
                (a._1+b._1,a._2+b._2)
              }
              override def getResult(accumulator: (Double, Int)): Double = {
                accumulator._1/accumulator._2
              }

            }
            ,createTypeInformation[(Double,Int)])

          aggregatingState=getRuntimeContext.getAggregatingState(asd)
        }

        override def map(value: Employee): (String, Double) = {
          aggregatingState.add(value.salary)
          (value.dept,aggregatingState.get())
        }
      })
      .print()
    
    fsEnv.execute("FlinkWordCountsQuickStart")

List State

//1.创建StreamExecutionEnvironment
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
//3.对数据做转换 zhangsan 123456
dataStream.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1)))
.keyBy(0)
.map(new RichMapFunction[(String,String),String] {
    var historyPasswords:ListState[String]=_

    override def open(parameters: Configuration): Unit = {
        val lsd = new ListStateDescriptor[String]("pwdstate",createTypeInformation[String])
        historyPasswords=getRuntimeContext.getListState(lsd)
    }
    override def map(value: (String, String)): String = {
        var list = historyPasswords.get().asScala.toList
        list= list.::(value._2)
        list = list.distinct //去重
        historyPasswords.update(list.asJava)

        value._1+"\t"+list.mkString(",")
    }
})
.print()
fsEnv.execute("FlinkWordCountsQuickStart")
}

MapState

package com.cjh.flink.keyedState

import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.common.state.{MapState, MapStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala._

object MapStateDemo {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment

        val dataStream = env.socketTextStream("spark", 6666)
        dataStream
        .flatMap(_.split("\\s+"))
        .map((_, 1))
        .keyBy(0)
        .map(new RichMapFunction[(String, Int), (String, Int)] {
            var mapState: MapState[String, Int] = _

            override def map(value: (String, Int)): (String, Int) = {
                mapState.put(value._1, mapState.get(value._1) + value._2)
                (value._1, mapState.get(value._1))
            }

            override def open(parameters: Configuration): Unit = {
                val msd = new MapStateDescriptor("mapstate", createTypeInformation[String], createTypeInformation[Int])
                mapState = getRuntimeContext.getMapState(msd)
            }
        })
        .print()

        env.execute("mapstate")
    }
}

Managed Operator State

如果用户想去使用Operator State,用户可以实现一个通用接口CheckpointedFunction 或者实现ListCheckpointed<T extends Serializable>

CheckpointFunction

目前CheckpointFunction仅仅支持List风格状态,每个Operator实例维护者一个SubList,整个系统会将所有的Operator实例sublist进行逻辑拼接。在系统恢复的时候,系统可以在多个Operator实例中进行分发状态,在状态分发时,遵循两种策略:Even-Split(均分)/Union(联合|广播)

public interface CheckpointedFunction {
    void snapshotState(FunctionSnapshotContext var1) throws Exception;
    void initializeState(FunctionInitializationContext var1) throws Exception;
}
package com.cjh.flink.operatorstate

import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.runtime.state.{FunctionInitializationContext, FunctionSnapshotContext}
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.scala._
import scala.collection.JavaConverters._

import scala.collection.mutable.ListBuffer

object CheckpointFunctionDemo {
      def main(args: Array[String]): Unit = {
            val env = StreamExecutionEnvironment.getExecutionEnvironment
            env.setParallelism(1)
            val dataStream = env.socketTextStream("spark", 6666)

            dataStream
                .map(t => t + "   666\t")
                .addSink(new UDBufferSink(5))

            env.execute("operatorstate")
      }
}

class UDBufferSink(threshold: Int = 0) extends SinkFunction[String] with CheckpointedFunction {
      @transient
      private var checkpointedState: ListState[String] = _ //检查点状态集合
      private val bufferedElements = ListBuffer[String]() //输出缓冲区

      //执行
      override def invoke(value: String, context: SinkFunction.Context[_]): Unit = {
            bufferedElements += value //将输入添加进输出缓冲
            //println(bufferedElements.size + " " + (bufferedElements.size >= threshold))
            if (bufferedElements.size == threshold) {
                  bufferedElements.foreach(print)
                  bufferedElements.clear()
            } //当达到了设定的阈值,处理缓冲区中的数据,然后清空缓冲区
      }

      //快照,用以保存状态数据
      override def snapshotState(context: FunctionSnapshotContext): Unit = {
            checkpointedState.clear()//先清空检查点状态集合
            for (elem <- bufferedElements) {//将输出缓冲区中的数据存入检查点状态集合
                  checkpointedState.add(elem)
            }
      }

      //初始化状态
      override def initializeState(context: FunctionInitializationContext): Unit = {
            //先定义状态值的类型信息
            val descriptor = new ListStateDescriptor[String]("lsd", createTypeInformation[String])

            //获取目前的状态集合
            checkpointedState = context.getOperatorStateStore.getListState(descriptor)

            if (context.isRestored) {//判断程序是否在恢复状态,如果是将检查点集合中的状态值添加进输出缓冲集合中
                  for (elem <- checkpointedState.get().asScala) {
                        bufferedElements += elem
                  }
            }
      }

}

设置将检查点存储在HDFS

关闭Flink

查询hadoop classpath

[root@Spark bin]# hadoop classpath
/usr/hadoop-2.9.2/etc/hadoop:/usr/hadoop-2.9.2/share/hadoop/common/lib/*:/usr/hadoop-2.9.2/share/hadoop/common/*:/usr/hadoop-2.9.2/share/hadoop/hdfs:/usr/hadoop-2.9.2/share/hadoop/hdfs/lib/*:/usr/hadoop-2.9.2/share/hadoop/hdfs/*:/usr/hadoop-2.9.2/share/hadoop/yarn:/usr/hadoop-2.9.2/share/hadoop/yarn/lib/*:/usr/hadoop-2.9.2/share/hadoop/yarn/*:/usr/hadoop-2.9.2/share/hadoop/mapreduce/lib/*:/usr/hadoop-2.9.2/share/hadoop/mapreduce/*:/usr/hadoop-2.9.2/contrib/capacity-scheduler/*.jar

配置环境变量

vi .bashrc

JAVA_HOME=/usr/java/latest
HADOOP_HOME=/usr/hadoop-2.9.2
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
CLASSPATH=.
export HADOOP_HOME
export JAVA_HOME
export PATH
export CLASSPATH
HADOOP_CLASSPATH=`hadoop classpath`
export HADOOP_CLASSPATH

[root@Spark ~]# source .bashrc
#查看是否配置成功
[root@Spark ~]# echo $HADOOP_CLASSPATH
/usr/hadoop-2.9.2/etc/hadoop:/usr/hadoop-2.9.2/share/hadoop/common/lib/*:/usr/hadoop-2.9.2/share/hadoop/common/*:/usr/hadoop-2.9.2/share/hadoop/hdfs:/usr/hadoop-2.9.2/share/hadoop/hdfs/lib/*:/usr/hadoop-2.9.2/share/hadoop/hdfs/*:/usr/hadoop-2.9.2/share/hadoop/yarn:/usr/hadoop-2.9.2/share/hadoop/yarn/lib/*:/usr/hadoop-2.9.2/share/hadoop/yarn/*:/usr/hadoop-2.9.2/share/hadoop/mapreduce/lib/*:/usr/hadoop-2.9.2/share/hadoop/mapreduce/*:/usr/hadoop-2.9.2/contrib/capacity-scheduler/*.jar

启动HDFS

start-dfs.sh

配置flink配置文件

[root@Spark ~]# vim /usr/flink-1.8.1/conf/flink-conf.yaml
#注意以下配置前面都有一个空格
 state.backend: rocksdb
 state.checkpoints.dir: hdfs:///flink-checkpoints#在一台机器上不用写为state.checkpoints.dir: hdfs://spark:50070/flink-checkpoints
 state.savepoints.dir: hdfs:///flink-savepoints#同上
#开启状态的增量存储
 state.backend.incremental: true

查看flink目前正在运行的任务

[root@Spark bin]# ./flink list -m spark:8081
Waiting for response...
------------------ Running/Restarting Jobs -------------------
18.11.2019 22:48:14 : 176aa8e55e2fa2f291381c98a6c7aac8 : operatorstate (RUNNING)
--------------------------------------------------------------
No scheduled jobs.

取消任务,并将状态保存在默认配置的savepoints中

[root@Spark bin]# ./flink cancel -m spark:8081 -s 62e06a69c34d24036c0f09db368dc77d
Cancelling job 176aa8e55e2fa2f291381c98a6c7aac8 with savepoint to default savepoint directory.
Cancelled job 176aa8e55e2fa2f291381c98a6c7aac8. Savepoint stored in hdfs://Spark:9000/flink-savepoints/savepoint-176aa8-c2050fd50de2.

从savepoint节点恢复

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PsTvJw2i-1585096000833)(assets/1574089131212.png)]

ListCheckpointed

该接口是CheckpointFunction一个变体,仅仅支持List style风格状态的Even-Split方案

public interface ListCheckpointed<T extends Serializable> {
    //返回值即是需要存储的状态
    List<T> snapshotState(long var1, long var3) throws Exception;
    //状态初始化|恢复逻辑
    void restoreState(List<T> var1) throws Exception;
}

package com.cjh.flink.operatorstate

import org.apache.flink.streaming.api.checkpoint.ListCheckpointed
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala._
import java.lang.{Long => JLong}
import java.util
import java.util.Collections //将java.lang.Long定义为JLong,为的是区分Scala的Long
import scala.collection.JavaConverters._

object ListCheckpointedDemo {
      def main(args: Array[String]): Unit = {
            val env = StreamExecutionEnvironment.getExecutionEnvironment
            env.setParallelism(1)
            val dataStream = env.addSource(new UDCounterSource)

            dataStream
                .map(counter => "counter: " + counter)
                .print()

            env.execute("listcheckpoint")
      }
}

class UDCounterSource extends RichParallelSourceFunction[Long] with ListCheckpointed[JLong] {
      @volatile
      private var isRunning = true //程序取消的标志
      private var offset = 0L //偏移量
      //检查点存储状态
      override def snapshotState(checkpointId: Long, timestamp: Long): util.List[JLong] = {
            println("存储:" + offset)
            Collections.singletonList(offset) //返回一个存储一个值的集合,里面的元素就是偏移量
      }

      //状态恢复
      override def restoreState(state: util.List[JLong]): Unit = {
            //将状态值赋值给offset,虽然只有一个
            for (elem <- state.asScala) {
                  offset = elem
            }
      }

      //执行方法
      override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
            val lock = ctx.getCheckpointLock //设置锁标记,防止其他线程修改状态值
            while (isRunning) {
                  Thread.sleep(1000)
                  lock.synchronized({
                        ctx.collect(offset) //将偏移量输出给下游
                        offset += 1
                  })
            }
      }

      override def cancel(): Unit = isRunning = false
}

State Time-To-Live(TTL)

keyed state任意类型都可以指定TTL存活时间(配置状态时效性),如果状态配置TTL,并且该状态已经失效了,Flink将尽最大努力清楚过期的状态。TTL除了支持单一值的TTL时效,针对集合类型例如 MapState|ListState中的元素,每一个元素都有自己的TTL失效时间。

基本使用

dataStream.flatMap(_.split("\\s+"))
.map((_, 1))
.keyBy(0)
.map(new RichMapFunction[(String, Int), (String, Int)] {
    var valueState: ValueState[Int] = _

    override def open(parameters: Configuration): Unit = {
        val vsd = new ValueStateDescriptor[Int]("wordcount", createTypeInformation[Int])

        val ttlConfig = StateTtlConfig
        .newBuilder(Time.seconds(5)) //状态存活时间
        .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite) //更新时机
        .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired) //永不反回过期数据
        //以上两项就是默认配置,不配置这两项还是用的这些值
        .cleanupFullSnapshot()//系统重启或者恢复时清除过期数据,并不能解决运行中的过期数据
        .cleanupIncrementally(100, false)//增量清理
        .cleanupInBackground()//开启后台清除策略,
        //根据State Backend(状态存储的位置【jobmanager,teskmanager,rocksdb】)采取默认的清除策略
        .cleanupIncrementally(10,true)//基于内存backend,
        //10表示清除的记录数,true表示来一个记录就开启清除,false表示用户对任意state操作就会触发clean操作
        .cleanupInRocksdbCompactFilter(1000)//基于RocksDbbackend,
        //当RocksDB 累计合并1000条记录的时候,查询一次过期的记录,并且将过期的记录清理掉
        .build

        //开启TTL特性
        vsd.enableTimeToLive(ttlConfig)

        valueState = getRuntimeContext.getState(vsd)
    }

    override def map(value: (String, Int)): (String, Int) = {
        var historyValue = valueState.value()
        if (historyValue == null) {
            historyValue = 0
        }
        //更新历史
        valueState.update(historyValue + value._2)
        (value._1, valueState.value())
    }
})
.print()

该newBuilder方法的第一个参数是必需的,它是生存时间值。

setUpdateType():设置State更新类型,配置参数有:

StateTtlConfig.UpdateType.Disabled:状态不过期
StateTtlConfig.UpdateType.OnCreateAndWrite:仅仅是创建和写入权限(默认)
StateTtlConfig.UpdateType.OnReadAndWrite:读和写权限

setStateVisibility():State可见状态,配置参数有:

StateTtlConfig.StateVisibility.NeverReturnExpired:过期值永不返回
StateTtlConfig.StateVisibility.ReturnExpiredIfNotCleanedUp:如果可用,则返回(没清除,还存在,返回)
注意:开启TTL之后,系统会额外消耗内存存储时间戳(Processing Time),如果用户以前没有开启TTL配置,在启动之前修改代码开启了TTL,在做状态恢复的时候系统启动不起来,会抛出兼容性失败以及StateMigrationException的异常。

清除Expired State

在默认情况下,仅当明确读出过期状态时,通过调用ValueState.value()方法才会清除过期的数据,这意味着,如果系统一直未读取过期的状态,则不会将其删除,可能会导致存储状态数据的文件持续增长。

State的清除策略有3种,分别是:

EMPTY_STRATEGY(空策略,不清除),

FULL_STATE_SCAN_SNAPSHOT(清除完整快照)、

INCREMENTAL_CLEANUP(增量清除)、

ROCKSDB_COMPACTION_FILTER(RocksDB压缩过滤清除)。

full_state_scan_snapshot(清除完整快照)

Cleanup in full snapshot

系统会从上一次状态恢复的时间点,加载所有的State快照,在加载过程中会剔除那些过期的数据,这并不会影响磁盘已存储的状态数据,该状态数据只会在Checkpoint的时候被覆盖,但是依然解决不了在运行时自动清除过期且没有用过的数据。

import org.apache.flink.api.common.state.StateTtlConfig
import org.apache.flink.api.common.time.Time
val ttlConfig = StateTtlConfig
    .newBuilder(Time.seconds(1))
    .cleanupFullSnapshot
    .build

只能用于memory或者snapshot状态的后端实现,不支持RocksDB State Backend。

代码示例:

StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.minutes(5))
.cleanupIncrementally(100, false)
.build();

cleanupIncrementally(int cleanupSize, boolean runCleanupForEveryRecord ):

Cleanup in background

可以开启后台清除策略,根据State Backend采取默认的清除策略(不同状态的后端存储,清除策略不同)

import org.apache.flink.api.common.state.StateTtlConfig
val ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(1))
.cleanupInBackground
.build
Incremental_cleanup:增量清理

对应的清除策略是:INCREMENTAL_CLEANUP,调用cleanupIncrementally()方法,激活此策略。

  • Incremental cleanup(基于内存backend)
import org.apache.flink.api.common.state.StateTtlConfig
val ttlConfig = StateTtlConfig.newBuilder(Time.seconds(5))
.setUpdateType(UpdateType.OnCreateAndWrite)
.setStateVisibility(StateVisibility.NeverReturnExpired)
.cleanupIncrementally(100,true) //默认值 5 | false
.build()

第一个参数表示每一次触发cleanup的时候,系统会一次处理100个元素。第二个参数是false,表示只要用户对任意一个state进行操作,系统都会触发cleanup策略;第二个参数是true,表示只要系统接收到记录数(即使用户没有操作状态)就会触发cleanup策略。

目前这个增量清理,只支持Heap State(JobManager内存),如果是RocksDB将不生效。

如果将堆状态后端与同步快照一起使用,则全局迭代器将保留所有键的副本。因为它的具体实现不支持并发修改而进行迭代。启用此功能将增加内存消耗。异步快照没有此问题。

如果默认后台清理,则将为Heap State激活此策略。其中对于每条记录,有5项检查不清除:

如果状态没有访问或没有处理记录,则过期状态将持续存在。

增量清理所花费的时间会增加记录处理延迟。

目前,仅针对Heap状态后端实施增量清理。为RocksDB设置它将不起作用。

如果堆状态后端与同步快照一起使用,则全局迭代器会在迭代时保留所有键的副本,因为它的特定实现不支持并发修改。启用此功能会增加内存消耗。异步快照没有此问题。

对于现有作业,可以随时激活或停用此清理策略StateTtlConfig,例如在从保存点重新启动之后。
  • RocksDB compaction

RocksDB是一个嵌入式的key-value存储,其中key和value是任意的字节流,底层进行异步压缩,会将key相同的数据进行compact(压缩),以减少state文件大小,但是并不对过期的state进行清理,因此可以通过配置compactFilter,让RocksDB在compact的时候对过期的state进行排除,RocksDB数据库的这种过滤特性,默认关闭,如果想要开启,可以在flink-conf.yaml中配置 state.backend.rocksdb.ttl.compaction.filter.enabled:true 或者在应用程序的API里设置RocksDBStateBackend::enableTtlCompactionFilter。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2aJKsD87-1585096000834)(assets/1571212802688.png)]

import org.apache.flink.api.common.state.StateTtlConfig 
val ttlConfig = StateTtlConfig.newBuilder(Time.seconds(5))
.setUpdateType(UpdateType.OnCreateAndWrite)
.setStateVisibility(StateVisibility.NeverReturnExpired)
.cleanupInRocksdbCompactFilter(1000) //默认配置1000
.build()

这里的1000表示,系统在做Compact的时候,会检查1000个元素是否失效,如果失效,则清除该过期数据。

Broadcast State(状态广播)

在Flink中除了Operator Sate或者Keyed Sate,还存在第三种状态,称为广播状态,该广播状态可以将A流中的计算结果,广播给B流。B流只可以通过只读的方式读取A流状态。A流状态可以在A流实时更新。

  • non-keyed :DataStream 连接 BroadcastStream

  • keyed:KeydStream 连接 BroadcastStream

    package com.baizhi.broadcast
    
    import org.apache.flink.api.common.state.MapStateDescriptor
    import org.apache.flink.streaming.api.scala._
    import org.apache.flink.api.common.state.{MapStateDescriptor, ValueState, ValueStateDescriptor}
    import org.apache.flink.configuration.Configuration
    import org.apache.flink.streaming.api.functions.co.{BroadcastProcessFunction, KeyedBroadcastProcessFunction}
    import org.apache.flink.util.Collector
    import org.apache.flink.streaming.api.scala._
    
    object FlinkKeyedBroadcaststate {
        def main(args: Array[String]): Unit = {
            val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
            //2.创建DataStream -细化
            //001 zhangsan 手机通讯
            //002 zhangsan 母音用品
            val userKeyedDatastream = fsEnv.socketTextStream("Spark", 9999)
            .map(line => line.split("\\s+"))
            .map(ts => (ts(2),1))
            .keyBy(0)
    
    
            val msd = new MapStateDescriptor[String,String]("rule",createTypeInformation[String],createTypeInformation[String])
            //母音用品 2 -5
            val broadcastStream = fsEnv.socketTextStream("Spark", 8888)
            .map(line => line.split("\\s+"))
            .map(ts => (ts(0), ts(1)+":"+ts(2)))
            .broadcast(msd)
    
            userKeyedDatastream.connect(broadcastStream)
            .process(new UserDefineKeyedBroadcastProcessFunction(msd))
            .print()
    
            fsEnv.execute("FlinkWordCountsQuickStart")
        }
    }
    //============================================================================
    class UserDefineKeyedBroadcastProcessFunction(msd:MapStateDescriptor[String,String]) extends
    KeyedBroadcastProcessFunction[String,(String,Int),(String,String),String]{
    
        var counts:ValueState[Int]=_
    
        override def open(parameters: Configuration): Unit = {
            println("open")
            val vsd = new ValueStateDescriptor[Int] ("count",createTypeInformation[Int])
            counts=getRuntimeContext.getState(vsd)
        }
        override def processElement(in1: (String, Int),
                                    readOnlyContext: KeyedBroadcastProcessFunction[String, (String, Int), (String, String), String]#ReadOnlyContext,
                                    collector: Collector[String]): Unit = {
    
            var history=counts.value()
            if(history==null){
                history=0
            }
            counts.update(history+1)
            println(in1._1+" "+counts.value())
            val readOnlyState = readOnlyContext.getBroadcastState(msd)
            if(readOnlyState.get(in1._1)!=null){
                var value=readOnlyState.get(in1._1)//次数:优惠金额
                val threshold=value.split(":")(0).toInt
                if(counts.value()> threshold){
                    println("满足推送条件")
                    collector.collect(in1._1+"\t"+value.split(":")(1))
                    counts.clear()//清除状态
                }else{
                    println("不满足条件,当前是:"+counts.value()+"需要:"+threshold)
                }
            }
    
        }
    
        override def processBroadcastElement(in2: (String, String), context: KeyedBroadcastProcessFunction[String, (String, Int), (String, String), String]#Context, collector: Collector[String]): Unit = {
            val state = context.getBroadcastState(msd)
            // 类别     次数:优惠金额
            state.put(in2._1,in2._2)
        }
    }
    

Checkpoint & Savepoint

Chackpoint是一种机制,Flink会定期存储流计算的状态信息,该检查点的协调任务由JobManager负责协调。JobManager会定期给下游的任务发送barrier(栅栏)信号给下游的节点,下游的任务收到barrier信号之后会预先提交自己的状态,并且将该barrier继续传递下游,下游接受信号后也会预先提交自己的状态,并且会通知JobManager状态持久化情况,只有当所有下游的状态提交都是ok状态时候,JobManager才会标记当前一次checkpoint是成功的。(自动触发过程,无需人工干预)

Savepoint一种手动触发的checkpoint机制。需要人工干预。flink cancel --wirthSavepoint

默认checkpoint没有开启的,需要用户去配置对应的job作业。

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

// checkpoint 频率 12次/分钟
fsEnv.enableCheckpointing(5000,CheckpointingMode.EXACTLY_ONCE)
// 每次Checkpoint时长不得超过4s
fsEnv.getCheckpointConfig.setCheckpointTimeout(4000)
// 此次chk距离上一次chk时间不得少于2s,同一时刻只能有一个chk
fsEnv.getCheckpointConfig.setMinPauseBetweenCheckpoints(2000);
// 如果用取消任务,但是没有添加--withSavepoint,系统保留checkpoint数据
fsEnv.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
// 如果检查点恢复失败,放弃任务执行
fsEnv.getCheckpointConfig.setFailOnCheckpointingErrors(true);

val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.map(new RichMapFunction[(String,Int),(String,Int)] {
    var valueState:ValueState[Int]=_

    override def open(parameters: Configuration): Unit = {
        val vsd = new ValueStateDescriptor[Int]("wordcount",createTypeInformation[Int])
        valueState= getRuntimeContext.getState(vsd)
    }

    override def map(value: (String, Int)): (String, Int) = {
        var historyValue = valueState.value()
        if(historyValue==null){
            historyValue=0
        }
        //更新历史
        valueState.update(historyValue+value._2)
        (value._1,valueState.value())
    }
})
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

State backend(状态后端)

参考:https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/state/state_backends.html

MemoryStateBackend- 将快照数据存储在JobManager的内存中,每个State大小,默认不得超过5MB,总的State大小不得大于JobManager的内存。一般用于开发测试阶段,状态数据比较小,但是速度快。

FsStateBackend- 将程序计算的状态数据存储在TaskManager的内存中,当系统做checkpoint的时候,系统会将数据异步写进文件系统。JobManager在内存中存储少许的元数据信息。一般用在生产环境,大state需要存储。

RocksDBStateBackend- 将程序计算的状态数据存储在TaskManager运行所在RocksDB的数据库文件中,系统会以增量方式将完成检查点。在chk的时候,TaskManager会将本地的RocksDB的数据库数据信息异步写入到 远程文件系统。JobManager在内存中存储少许的元数据信息。一般用在生产环境,超大state需要存储。(key,value 不可以大于 2^31 bytes)

FsStateBackend VS RocksDBStateBackend:FsStateBackend 受限于TaskManager内存,效率高。RocksDBStateBackend仅仅受限于TaskManager本机磁盘,同时由于数据是存储在磁盘中可能序列化和反序列化,因此性能可能有所下降。

Window

https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/operators/windows.html

窗口计算是流计算的核心,通过使用窗口对无限的流数据划分成固定大小的 buckets(窗口/桶),然后基于落入同一个bucket(窗口)中的元素执行计算。Flink将窗口计算分为两大类。

一类基于keyed-stream窗口计算。

stream
       .keyBy(...)               <-  分组
       .window(...)              <-  必须: "assigner" 窗口分配器
      [.trigger(...)]            <-  可选: "trigger" 每一种类型的窗口系统都有默认触发器
      [.evictor(...)]            <-  可选: "evictor" 可以剔除窗口中元素
      [.allowedLateness(...)]    <-  可选: "lateness" 可以处理迟到数据
      [.sideOutputLateData(...)] <-  可选: "output tag" 可以Side Out获取迟到的元素
       .reduce/aggregate/fold/apply()      <-  必须: "function"
      [.getSideOutput(...)]      <-  可选: 获取Sideout数据 例如迟到数据

直接对non-keyed Stream窗口计算

stream
       .windowAll(...)           <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"

Window Lifecycle(窗口生命周期)

简而言之,一旦应属于该窗口的第一个元素到达,就会创建一个窗口,并且当时间|WaterMarker(Event TimeProcess Time)超过其Window End 时间加上用户指定的允许延迟时,该窗口将被完全删除。窗口触发计算前提-水位线 没过窗口的End Time。这个时候窗口处于Ready状态,这个时候Flink才会对窗口做真正的输出计算。

Trigger:负责监控窗口,只有满足触发器的条件,窗口才会触发。(例如 水位线计算)

evictor: 在窗口触发之后在应用聚合函数之前或之后剔除窗口中的元素。

Window Assigners(窗口分配器)

Window Assigners定义了如何将元素分配给窗口。在定义完窗口之后,用户可以使用reduce/aggregate/folder/apply等算子实现对窗口的聚合计算。

  • Tumbling Windows :滚动,窗口长度和滑动间隔相等,窗口之间没有重叠。(时间
dataStream.flatMap(_.split("\\s+"))
    .map((_,1))
    .keyBy(0)
    .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
    .reduce((v1,v2)=>(v1._1,v1._2+v2._2))//前后两个元素聚合
    .print()
  • Sliding Windows:滑动,窗口长度 大于 滑动间隔,窗口之间存在数据重叠。(时间
dataStream.flatMap(_.split("\\s+"))
    .map((_,1))
    .keyBy(0)
    .window(SlidingProcessingTimeWindows.of(Time.seconds(4),Time.seconds(2)))
    .fold(("",0))((z,v)=>(v._1,z._2+v._2))//带初始值的两个元素聚合
    .print()
  • Session Windows: 会话窗口,窗口没有固定大小,每个元素都会形成一个新窗口,如果窗口的间隔小于指定时间,这些窗口会进行合并。(时间
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
.aggregate(new AggregateFunction[(String,Int),(String,Int),(String,Int)] {//聚合实现,需要自己定义AggregateFunction
    override def createAccumulator(): (String, Int) = {
        ("",0)
    }
    override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) = {
        (value._1,value._2+accumulator._2)
    }
    override def getResult(accumulator: (String, Int)): (String, Int) = {
        accumulator
    }
    override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
        (a._1,a._2+b._2)
    }
})
.print()
  • Global Windows:全局窗口,窗口并不是基于时间划分窗口,因此不存在窗口长度和时间概念。需要用户定制触发策略,窗口才会触发。
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(_._1)
.window(GlobalWindows.create())//创建GlobalWindow
.trigger(CountTrigger.of(4))//使用API提供的数量触发器
.apply(new WindowFunction[(String,Int),(String,Int),String, GlobalWindow] {
    override def apply(key: String, window: GlobalWindow, inputs: Iterable[(String, Int)],
                       out: Collector[(String, Int)]): Unit = {
        println("key:"+key+" w:"+window)
        inputs.foreach(t=>println(t))
        out.collect((key,inputs.map(_._2).sum))
    }
})
.print()

Window Function

定义Window Assigners后,我们需要指定要在每个窗口上执行的计算。 这是Window Function的职责,一旦系统确定某个窗口已准备好进行处理,该Window Function将用于处理每个窗口的元素。Flink提供了以下Window Function处理函数:

  • ReduceFunction/reduce
new ReduceFunction[(String, Int)] {
    override def reduce(v1: (String, Int), v2: (String, Int)): (String, Int) = {
        (v1._1,v1._2+v2._2)
    }
}
  • AggregateFunction
new AggregateFunction[(String,Int),(String,Int),(String,Int)] {
    override def createAccumulator(): (String, Int) = {
        ("",0)
    }
    override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) = {
        (value._1,value._2+accumulator._2)
    }
    override def getResult(accumulator: (String, Int)): (String, Int) = {
        accumulator
    }
    override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
        (a._1,a._2+b._2)
    }
}
  • FoldFunction(废弃)
new FoldFunction[(String,Int),(String,Int)] {
    override def fold(accumulator: (String, Int), value: (String, Int)): (String, Int) = {
        (value._1,accumulator._2+value._2)
    }
}

不能用在mergeable(可合并的)window和SessionWindows中。

  • apply/WindowFunction(旧版-一般不推荐)

可以获取窗口的中的所有元素,并且可以拿到一些元数据信息,无法操作窗口状态。

new WindowFunction[(String,Int),(String,Int),String, GlobalWindow] {
    override def apply(key: String, window: GlobalWindow, inputs: Iterable[(String, Int)],
                       out: Collector[(String, Int)]): Unit = {
        println("key:"+key+" w:"+window)
        inputs.foreach(t=>println(t))
        out.collect((key,inputs.map(_._2).sum))
    }
}

在keyBy的时候,不能使用下标,只能使用keyBy(_._1)

  • ProcessWindowFunction(重点掌握)

可以获取窗口的中的所有元素,并且拿到一些元数据信息。是WindowFunction的替代方案,因为该接口可以直接操作窗口的State|全局State

获取窗口状态

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)

//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(_._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.process(new ProcessWindowFunction[(String,Int),(String,Int),String,TimeWindow] {

    override def process(key: String,
                         context: Context,
                         elements: Iterable[(String, Int)],
                         out: Collector[(String, Int)]): Unit = {

        val w = context.window
        val sdf = new SimpleDateFormat("HH:mm:ss")

        println(sdf.format(w.getStart)+" ~ "+ sdf.format(w.getEnd))

        val total = elements.map(_._2).sum
        out.collect((key,total))
    }
})
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

配合Reduce|Aggregate|FoldFunction

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)

//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(_._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce((v1:(String,Int),v2:(String,Int))=>(v1._1,v1._2+v2._2),
        new ProcessWindowFunction[(String,Int),(String,Int),String,TimeWindow] {

    override def process(key: String,
                         context: Context,
                         elements: Iterable[(String, Int)],
                         out: Collector[(String, Int)]): Unit = {

        val w = context.window
        val sdf = new SimpleDateFormat("HH:mm:ss")

        println(sdf.format(w.getStart)+" ~ "+ sdf.format(w.getEnd))

        val total = elements.map(_._2).sum
        out.collect((key,total))
    }
})
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

操作WindowState|GlobalState

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)

//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(_._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce((v1:(String,Int),v2:(String,Int))=>(v1._1,v1._2+v2._2),
        new ProcessWindowFunction[(String,Int),String,String,TimeWindow] {
            var windowStateDescriptor:ReducingStateDescriptor[Int]=_
            var globalStateDescriptor:ReducingStateDescriptor[Int]=_

            override def open(parameters: Configuration): Unit = {
                windowStateDescriptor = new ReducingStateDescriptor[Int]("wcs",new ReduceFunction[Int] {
                    override def reduce(value1: Int, value2: Int): Int = value1+value2
                },createTypeInformation[Int])
                globalStateDescriptor = new ReducingStateDescriptor[Int]("gcs",new ReduceFunction[Int] {
                    override def reduce(value1: Int, value2: Int): Int = value1+value2
                },createTypeInformation[Int])
            }

            override def process(key: String,
                                 context: Context,
                                 elements: Iterable[(String, Int)],
                                 out: Collector[String]): Unit = {

                val w = context.window
                val sdf = new SimpleDateFormat("HH:mm:ss")

                val windowState = context.windowState.getReducingState(windowStateDescriptor)
                val globalState = context.globalState.getReducingState(globalStateDescriptor)

                elements.foreach(t=>{
                    windowState.add(t._2)
                    globalState.add(t._2)
                })
                out.collect(key+"\t"+windowState.get()+"\t"+globalState.get())
            }
        })
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

Trigger (触发器)

Trigger确定窗口(由Window Assigner形成)何时准备好由Window Function处理。 每个Window Assigner都带有一个默认Trigger。 如果默认Trigger不适合您的需求,则可以使用trigger(…)指定自定义触发器。

窗口类型 触发器 触发时机
event-time window(Tumbling/Sliding/Session) EventTimeTrigger 一旦watermarker没过窗口的末端,该触发器便会触发
processing-time window(Tumbling/Sliding/Session) ProcessingTimeTrigger 一旦系统时间没过窗口末端,该触发器便会触发
GlobalWindow 并不是基于时间的窗口 NeverTrigger 永远不会触发。
public class UserDefineDeltaTrigger<T, W extends Window> extends Trigger<T, W> {

    private final DeltaFunction<T> deltaFunction;
    private final double threshold;
    private final ValueStateDescriptor<T> stateDesc;

    private UserDefineDeltaTrigger(double threshold, DeltaFunction<T> deltaFunction, TypeSerializer<T> stateSerializer) {
        this.deltaFunction = deltaFunction;
        this.threshold = threshold;
        this.stateDesc = new ValueStateDescriptor("last-element", stateSerializer);
    }

    public TriggerResult onElement(T element, long timestamp, W window, TriggerContext ctx) throws Exception {
        ValueState<T> lastElementState = (ValueState)ctx.getPartitionedState(this.stateDesc);
        if (lastElementState.value() == null) {
            lastElementState.update(element);
            return TriggerResult.CONTINUE;
        } else if (this.deltaFunction.getDelta(lastElementState.value(), element) > this.threshold) {
            lastElementState.update(element);
            return TriggerResult.FIRE_AND_PURGE;//发送后删除该状态
        } else {
            //TriggerResult.FIRE;仅发送不删除状态
            return TriggerResult.CONTINUE;//继续
        }
    }

    public TriggerResult onEventTime(long time, W window, TriggerContext ctx) {
        return TriggerResult.CONTINUE;
    }

    public TriggerResult onProcessingTime(long time, W window, TriggerContext ctx) throws Exception {
        return TriggerResult.CONTINUE;
    }

    public void clear(W window, TriggerContext ctx) throws Exception {
        ((ValueState)ctx.getPartitionedState(this.stateDesc)).clear();
    }

    public String toString() {
        return "DeltaTrigger(" + this.deltaFunction + ", " + this.threshold + ")";
    }

    public static <T, W extends Window> UserDefineDeltaTrigger<T, W> of(double threshold, DeltaFunction<T> deltaFunction, TypeSerializer<T> stateSerializer) {
        return new UserDefineDeltaTrigger(threshold, deltaFunction, stateSerializer);
    }
}

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)

var deltaTrigger=UserDefineDeltaTrigger.of[(String,Double),GlobalWindow](10.0,new DeltaFunction[(String, Double)] {
    override def getDelta(lastData: (String, Double), newData: (String, Double)): Double = {
        newData._2-lastData._2
    }
},createTypeInformation[(String,Double)].createSerializer(fsEnv.getConfig))

//3.对数据做转换  10
// a  100.0
dataStream.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1).toDouble))
.keyBy(_._1)
.window(GlobalWindows.create())
.trigger(deltaTrigger)
.apply(new WindowFunction[(String,Double),(String,Int),String, GlobalWindow] {
    override def apply(key: String, window: GlobalWindow, inputs: Iterable[(String, Double)],
                       out: Collector[(String, Int)]): Unit = {
        println("key:"+key+" w:"+window)
        inputs.foreach(t=>println(t))
    }
})
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

Evictors(剔除器)

Evictors可以在触发器触发后,应用Window Function之前 和/或 之后从窗口中删除元素。 为此,Evictor界面有两种方法:

public interface Evictor<T, W extends Window> extends Serializable {

	/**
	 * Optionally evicts elements. Called before windowing function.
	 *
	 * @param elements The elements currently in the pane.
	 * @param size The current number of elements in the pane.
	 * @param window The {@link Window}
	 * @param evictorContext The context for the Evictor
     */
	void evictBefore(Iterable<TimestampedValue<T>> elements, int size, W window, EvictorContext evictorContext);

	/**
	 * Optionally evicts elements. Called after windowing function.
	 *
	 * @param elements The elements currently in the pane.
	 * @param size The current number of elements in the pane.
	 * @param window The {@link Window}
	 * @param evictorContext The context for the Evictor
	 */
	void evictAfter(Iterable<TimestampedValue<T>> elements, int size, W window, EvictorContext evictorContext);
	}
}
public class UserDefineErrorEvictor<W extends  Window> implements Evictor<String, W> {
    private  boolean isEvictorBefore;
    private  String  content;

    public UserDefineErrorEvictor(boolean isEvictorBefore, String content) {
        this.isEvictorBefore = isEvictorBefore;
        this.content=content;
    }

    public void evictBefore(Iterable<TimestampedValue<String>> elements, int size, W window, EvictorContext evictorContext) {
        if(isEvictorBefore){
            evict(elements,  size,  window,  evictorContext);
        }
    }

    public void evictAfter(Iterable<TimestampedValue<String>> elements, int size, W window, EvictorContext evictorContext) {
        if(!isEvictorBefore){
            evict(elements,  size,  window,  evictorContext);
        }
    }
    private  void evict(Iterable<TimestampedValue<String>> elements, int size, W window, EvictorContext evictorContext) {
        Iterator<TimestampedValue<String>> iterator = elements.iterator();
        while(iterator.hasNext()){
            TimestampedValue<String> next = iterator.next();
            String value = next.getValue();
            if(value.contains(content)){
                iterator.remove();
            }
        }
    }
}

EventTime Window

Flink在流式传输程序中支持不同的时间概念。包含:Processing Time(处理时间)/Event Time(时间时间)/Ingestion Time(摄入时间)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1Zj14EQw-1585096000836)(assets/1574219706130.png)]

如果用户不指定Flink处理时间属性,默认使用的是ProcessingTime(处理时间).其中Ingestion(摄入时间)和Processing Time都是系统产生的,不同的是Ingestion Time是Source Function产生,而Processing Time由计算节点产生,无需用户指定时间抽取策略。

Flink中用于衡量事件时间进度的机制是水位线(水印)。 水印作为数据流的一部分流动,并带有时间戳t。 Watermark(t)声明事件时间已在该流中达到时间t,这意味着该流中不应再有时间戳t’<= t的元素。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vFZ0EZs5-1585096000837)(assets\1574262767646.png)]

watermarker(T)= max Event time seen by Process Node(处理元素最大时间)  - maxOrderness 最大乱序时间  

水位线计算

 .assignTimestampsAndWatermarks(AssignerWithPeriodicWatermarks|AssignerWithPunctuatedWatermarks)
  • AssignerWithPeriodicWatermarks:会定期的计算watermarker的值
//设置水位线计算频率 1s
fsEnv.getConfig.setAutoWatermarkInterval(1000)
class UserDefineAssignerWithPeriodicWatermarks extends AssignerWithPeriodicWatermarks[(String,Long)] {

    var  maxOrderness=2000L
    var  maxSeenTime=0L
    var sdf=new SimpleDateFormat("HH:mm:ss")
    override def getCurrentWatermark: Watermark = {
        // println("watermarker:"+sdf.format(maxSeenTime-maxOrderness))
        new Watermark(maxSeenTime-maxOrderness)
    }

    override def extractTimestamp(element: (String, Long), previousElementTimestamp: Long): Long = {
        maxSeenTime=Math.max(element._2,maxSeenTime)
        element._2
    }
}
  • AssignerWithPunctuatedWatermarks:系统每接收一个元素,就会触发水位线的计算
class UDAssignTimeStampsAndWatermarks extends AssignerWithPeriodicWatermarks[(String, Long)] {
      var intervalTime= 3000L //水位线允许与最大时间的差值
      var maxSeenTime = 0L //最大事件时间

      override def getCurrentWatermark: Watermark = {
            new Watermark(maxSeenTime - intervalTime)
      }

      override def extractTimestamp(element: (String, Long), previousElementTimestamp: Long): Long = {
            maxSeenTime = Math.max(element._2, maxSeenTime)
            element._2
      }
}

基本案例

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.setParallelism(1)
//设置时间特性
fsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//设置水位线计算频率 1s
fsEnv.getConfig.setAutoWatermarkInterval(1000)
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
//3.对数据做转换
//a 时间戳
dataStream.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1).toLong))
.assignTimestampsAndWatermarks(new UserDefineAssignerWithPeriodicWatermarks)
.windowAll(TumblingEventTimeWindows.of(Time.seconds(5)))
.apply(new AllWindowFunction[(String,Long),String,TimeWindow] {
    var sdf=new SimpleDateFormat("HH:mm:ss")

    override def apply(window: TimeWindow,
                       input: Iterable[(String, Long)],
                       out: Collector[String]): Unit = {
        println(sdf.format(window.getStart)+" ~ "+ sdf.format(window.getEnd))
        out.collect(input.map(t=>t._1+"->" +sdf.format(t._2)).reduce((v1,v2)=>v1+" | "+v2))
    }
})
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

迟到数据处理

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.setParallelism(1)
//设置时间特性
fsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//设置水位线计算频率 1s
fsEnv.getConfig.setAutoWatermarkInterval(1000)

//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)

//3.对数据做转换
//a 时间戳
dataStream.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1).toLong))
.assignTimestampsAndWatermarks(new UserDefineAssignerWithPeriodicWatermarks)
.windowAll(TumblingEventTimeWindows.of(Time.seconds(5)))
.allowedLateness(Time.seconds(2)) // w - window End < 2 数据还可以参与计算
.apply(new AllWindowFunction[(String,Long),String,TimeWindow] {
    var sdf=new SimpleDateFormat("HH:mm:ss")

    override def apply(window: TimeWindow,
                       input: Iterable[(String, Long)],
                       out: Collector[String]): Unit = {
        println(sdf.format(window.getStart)+" ~ "+ sdf.format(window.getEnd))
        out.collect(input.map(t=>t._1+"->" +sdf.format(t._2)).reduce((v1,v2)=>v1+" | "+v2))
    }
})
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

当 窗口end时间 <watermarkwer < 窗口End时间 + 迟到时间,有数据落入到该触发过的窗口,系统会将这些数据定义为迟到的数据,并且可以加入到窗口的计算。

太迟的数据

如果当前水位线的时间T - 窗口的End时间 >= 最大迟到的时间,此时如果有数据落入到窗口中,该数据默认Flink是丢弃的,如果需要获取这些没有参与计算的数据用户可以通过sideout手段获取,这些太迟的数据。

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.setParallelism(1)
//设置时间特性
fsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//设置水位线计算频率 1s
fsEnv.getConfig.setAutoWatermarkInterval(1000)

//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
val lateTag = new OutputTag[(String,Long)]("late")
//3.对数据做转换
//a 时间戳
val stream = dataStream.map(_.split("\\s+"))
.map(ts => (ts(0), ts(1).toLong))
.assignTimestampsAndWatermarks(new UserDefineAssignerWithPeriodicWatermarks)
.windowAll(TumblingEventTimeWindows.of(Time.seconds(5)))
.allowedLateness(Time.seconds(2)) // w - window end time < 2 数据还可以参与计算
.sideOutputLateData(lateTag) //将太迟的数据,输出到
.apply(new AllWindowFunction[(String, Long), String, TimeWindow] {
    var sdf = new SimpleDateFormat("HH:mm:ss")

    override def apply(window: TimeWindow,
                       input: Iterable[(String, Long)],
                       out: Collector[String]): Unit = {
        println(sdf.format(window.getStart) + " ~ " + sdf.format(window.getEnd))
        out.collect(input.map(t => t._1 + "->" + sdf.format(t._2)).reduce((v1, v2) => v1 + " | " + v2))
    }
})
stream.print("窗口")
stream.getSideOutput(lateTag).print("迟到数据:")

fsEnv.execute("FlinkWordCountsQuickStart")

Watermarks in Parallel Streams

watermarker 在Source Function 之后直接生成。 Source Function 的每个并行子任务通常独立生成其watermarker 。随着watermarker 在流程序中的流动,它们会增加计算节点的EventTime。 每当Operator更新了事件时间,该事件事件都会为其后Operator在下游生成新的watermarker。当下游操作符接收到多个watermarker的值得时候,系统会选择最小的watermarker。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4oD0X50l-1585096000837)(assets/1574232742554.png)]

Join(连接)

Window Join

stream.join(otherStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(<WindowAssigner>)
    .apply(<JoinFunction>)

Tumbling Window Join

When performing a tumbling window join, all elements with a common key and a common tumbling window are joined as pairwise combinations and passed on to a JoinFunction or FlatJoinFunction. Because this behaves like an inner join, elements of one stream that do not have elements from another stream in their tumbling window are not emitted!

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QlxUJjZs-1585096000838)(assets/1574235315186.png)]

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.setParallelism(1)
//设置时间特性
fsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//设置水位线计算频率 1s
fsEnv.getConfig.setAutoWatermarkInterval(1000)

// 001 zhangsan 时间戳
val userStrem: DataStream[(String,String,Long)] = fsEnv.socketTextStream("Spark",9999)
.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1),ts(2).toLong))
.assignTimestampsAndWatermarks(new UserAssignerWithPunctuatedWatermarks)
// 001 100.0 时间戳
val orderStream: DataStream[(String,Double,Long)] = fsEnv.socketTextStream("Spark",8888)
.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1).toDouble,ts(2).toLong))
.assignTimestampsAndWatermarks(new OrderAssignerWithPunctuatedWatermarks)

userStrem.join(orderStream)
.where(_._1)
.equalTo(_._1)
.window(TumblingEventTimeWindows.of(Time.seconds(2)))
.apply((v1,v2,out:Collector[String])=>{
    out.collect(v1._1+"\t"+v1._2+"\t"+v2._2)
})
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

Sliding Window Join

When performing a sliding window join, all elements with a common key and common sliding window are joined as pairwise combinations and passed on to the JoinFunction or FlatJoinFunction. Elements of one stream that do not have elements from the other stream in the current sliding window are not emitted!

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2tX4nJMZ-1585096000838)(assets/1574236304206.png)]

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.setParallelism(1)
//设置时间特性
fsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//设置水位线计算频率 1s
fsEnv.getConfig.setAutoWatermarkInterval(1000)

// 001 zhangsan 时间戳
val userStrem: DataStream[(String,String,Long)] = fsEnv.socketTextStream("Spark",9999)
.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1),ts(2).toLong))
.assignTimestampsAndWatermarks(new UserAssignerWithPunctuatedWatermarks)
// 001 100.0 时间戳
val orderStream: DataStream[(String,Double,Long)] = fsEnv.socketTextStream("Spark",8888)
.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1).toDouble,ts(2).toLong))
.assignTimestampsAndWatermarks(new OrderAssignerWithPunctuatedWatermarks)

userStrem.join(orderStream)
.where(_._1)
.equalTo(_._1)
.window(SlidingEventTimeWindows.of(Time.seconds(2),Time.seconds(1)))
.apply((v1,v2,out:Collector[String])=>{
    out.collect(v1._1+"\t"+v1._2+"\t"+v2._2)
})
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

Session Window Join

When performing a session window join, all elements with the same key that when “combined” fulfill the session criteria are joined in pairwise combinations and passed on to the JoinFunction or FlatJoinFunction. Again this performs an inner join, so if there is a session window that only contains elements from one stream, no output will be emitted!

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VuMzF3fw-1585096000839)(assets/1574236873131.png)]

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.setParallelism(1)
//设置时间特性
fsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//设置水位线计算频率 1s
fsEnv.getConfig.setAutoWatermarkInterval(1000)

// 001 zhangsan 时间戳
val userStrem: DataStream[(String,String,Long)] = fsEnv.socketTextStream("Spark",9999)
.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1),ts(2).toLong))
.assignTimestampsAndWatermarks(new UserAssignerWithPunctuatedWatermarks)
// 001 100.0 时间戳
val orderStream: DataStream[(String,Double,Long)] = fsEnv.socketTextStream("Spark",8888)
.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1).toDouble,ts(2).toLong))
.assignTimestampsAndWatermarks(new OrderAssignerWithPunctuatedWatermarks)

userStrem.join(orderStream)
.where(_._1)
.equalTo(_._1)
.window(EventTimeSessionWindows.withGap(Time.seconds(2)))
.apply((v1,v2,out:Collector[String])=>{
    out.collect(v1._1+"\t"+v1._2+"\t"+v2._2)
})
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

Interval Join

The interval join joins elements of two streams (we’ll call them A & B for now) with a common key and where elements of stream B have timestamps that lie in a relative time interval to timestamps of elements in stream A.

This can also be expressed more formally as b.timestamp ∈ [a.timestamp + lowerBound; a.timestamp + upperBound] or a.timestamp + lowerBound <= b.timestamp <= a.timestamp + upperBound

where a and b are elements of A and B that share a common key. Both the lower and upper bound can be either negative or positive as long as as the lower bound is always smaller or equal to the upper bound. The interval join currently only performs inner joins.

When a pair of elements are passed to the ProcessJoinFunction, they will be assigned with the larger timestamp (which can be accessed via the ProcessJoinFunction.Context) of the two elements.

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.setParallelism(1)
//设置时间特性
fsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//设置水位线计算频率 1s
fsEnv.getConfig.setAutoWatermarkInterval(1000)

// 001 zhangsan 时间戳
val userkeyedStrem: KeyedStream[(String,String,Long),String] = fsEnv.socketTextStream("Spark",9999)
.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1),ts(2).toLong))
.assignTimestampsAndWatermarks(new UserAssignerWithPunctuatedWatermarks)
.keyBy(t=>t._1)
// 001 100.0 时间戳
val orderStream: KeyedStream[(String,Double,Long),String] = fsEnv.socketTextStream("Spark",8888)
.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1).toDouble,ts(2).toLong))
.assignTimestampsAndWatermarks(new OrderAssignerWithPunctuatedWatermarks)
.keyBy(t=>t._1)

userkeyedStrem.intervalJoin(orderStream)
.between(Time.seconds(-2),Time.seconds(2))
.process(new ProcessJoinFunction[(String,String,Long),(String,Double,Long),String] {
    override def processElement(left: (String, String, Long),
                                right: (String, Double, Long),
                                ctx: ProcessJoinFunction[(String, String, Long), (String, Double, Long), String]#Context,
                                out: Collector[String]): Unit = {
        val leftTimestamp = ctx.getLeftTimestamp
        val rightTimestamp = ctx.getRightTimestamp
        val timestamp = ctx.getTimestamp
        println(s"left:${leftTimestamp},right:${rightTimestamp},timestamp:${timestamp}")
        out.collect(left._1+"\t"+left._2+"\t"+right._2)
    }
})
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

Flink HA搭建

The general idea of JobManager high availability for standalone clusters is that there is a single leading JobManager at any time and multiple standby JobManagers to take over leadership in case the leader fails. This guarantees that there is no single point of failure and programs can make progress as soon as a standby JobManager has taken leadership. There is no explicit distinction between standby and master JobManager instances. Each JobManager can take the role of master or standby.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XCrS4Bfd-1585096000840)(assets/1574241510394.png)]

准备工作

  • 时钟同步

  • IP和主机映射

  • SSH免密登陆

  • 关闭防火墙

  • 安装JDK8

    配置环境变量

    # .bashrc
    
    # User specific aliases and functions
    
    alias rm='rm -i'
    alias cp='cp -i'
    alias mv='mv -i'
    
    # Source global definitions
    if [ -f /etc/bashrc ]; then
            . /etc/bashrc
    fi
    export JAVA_HOME=/home/java/jdk1.8.0_181
    export MAVEN_HOME=/home/maven/apache-maven-3.3.9
    export M2_HOME=/home/maven/apache-maven-3.3.9
    export FINDBUGS_HOME=/home/findbugs/findbugs-3.0.1
    export PROTOBUF_HOME=/home/protobuf/protobuf-2.5.0
    export HADOOP_HOME=/home/hadoop/hadoop-2.6.0
    export HBASE_HOME=/home/hbase/hbase-1.2.4
    export HBASE_MANAGES_ZK=false
    export PATH=$PATH:$JAVA_HOME/bin:$M2_HOME/bin:$MAVEN_HOME/bin:$FINDBUGS_HOME/bin:$PROTOBUF_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin
    export HADOOP_CLASSPATH=`hadoop classpath`
    #注意上一条配置必须放在最后,否则不能生效
    
  • 安装zookeeeper并启动

  • 安装HDFS -HA并启动

  • 搭建Flink -HA

    1.解压安装包到指定目录

    tar -zxf flink-1.8.1-bin-scala_2.11.tgz -C /usr/
    

    2.配置flink配置文件

    [root@hadoopnode0* flink-1.8.1]# vim conf/flink-conf.yaml
    jobmanager.rpc.address: localhost	#集群环境这个配置不用动
    taskmanager.numberOfTaskSlots: 3	#设置各个节点的并行度,总的并行度为各节点之和
     high-availability: zookeeper
     high-availability.storageDir: hdfs:///flink/ha/	#集群环境不用写连接参数
     high-availability.zookeeper.quorum: hadoopnode01:2181,hadoopnode0
    2:2181,hadoopnode03:2181	#配置hadoop集群的连接参数
    #推荐配置,不是必须的
     high-availability.zookeeper.path.root: /flink	#将flink的信息存储在zookeeper根目录下的flink节点里
     high-availability.cluster-id: /default_ns
     state.backend: rocksdb	#状态后端模式
     state.checkpoints.dir: hdfs:///flink-checkpoints	#检查点存储文件夹
     state.savepoints.dir: hdfs:///flink-savepoints	#保存点存储文件夹,可以和检查点文件夹一样,此处是为了区分检查点和保存点才这样配的
     state.backend.incremental: true	#允许增量式的状态存储
     state.backend.rocksdb.ttl.compaction.filter.enabled: true	#激活rocksdb在compact(压缩)时候对于过期数据的排除
    

    配置slaves

    [root@hadoopnode0* flink-1.8.1]# vim conf/slaves
    #集群节点全配
    hadoopnode01
    hadoopnode02
    hadoopnode03
    

    配置master

    [root@hadoopnode0* flink-1.8.1]# vim conf/master
    #集群节点全配
    hadoopnode01:8081
    hadoopnode02:8081
    hadoopnode03:8081
    

    启动flink集群

[root@hadoopnode01 flink-1.8.1]# ./bin/start-cluster.sh
[root@hadoopnode01 flink-1.8.1]# jps
13600 QuorumPeerMain
3415 Jps
313 DataNode
3114 TaskManagerRunner
65406 NameNode
750 DFSZKFailoverController
2591 StandaloneSessionClusterEntrypoint
527 JournalNode


访问web

> hadoopnode0*:8081

分辨JobManager节点

1,只有JobManager主节点才能收到子节点TaskManager报告的日志信息

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RAxp94lb-1585096000841)(assets\1574339628078.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yZB674aR-1585096000842)(assets\1574339734596.png)]



2,只有JobManager主节点的JobManager日志里才可以查到leadership这条记录

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kVehgdqV-1585096000842)(assets\1574340025257.png)]



干掉主节点JobManager

```shell
[root@hadoopnode01 flink-1.8.1]# ./bin/flink-daemon.sh stop standalonesession

zookeeper会再次选出一个JobManager

export M2_HOME=/home/maven/apache-maven-3.3.9
export FINDBUGS_HOME=/home/findbugs/findbugs-3.0.1
export PROTOBUF_HOME=/home/protobuf/protobuf-2.5.0
export HADOOP_HOME=/home/hadoop/hadoop-2.6.0
export HBASE_HOME=/home/hbase/hbase-1.2.4
export HBASE_MANAGES_ZK=false
export PATH=PATH:PATH:JAVA_HOME/bin:M2HOME/bin:M2_HOME/bin:MAVEN_HOME/bin:FINDBUGSHOME/bin:FINDBUGS_HOME/bin:PROTOBUF_HOME/bin:HADOOPHOME/bin:HADOOP_HOME/bin:HADOOP_HOME/sbin:$HBASE_HOME/bin
export HADOOP_CLASSPATH=hadoop classpath
#注意上一条配置必须放在最后,否则不能生效




- 安装zookeeeper并启动

- 安装HDFS -HA并启动

- 搭建Flink -HA 

1.解压安装包到指定目录

```shell
tar -zxf flink-1.8.1-bin-scala_2.11.tgz -C /usr/

2.配置flink配置文件

[root@hadoopnode0* flink-1.8.1]# vim conf/flink-conf.yaml
jobmanager.rpc.address: localhost	#集群环境这个配置不用动
taskmanager.numberOfTaskSlots: 3	#设置各个节点的并行度,总的并行度为各节点之和
 high-availability: zookeeper
 high-availability.storageDir: hdfs:///flink/ha/	#集群环境不用写连接参数
 high-availability.zookeeper.quorum: hadoopnode01:2181,hadoopnode0
2:2181,hadoopnode03:2181	#配置hadoop集群的连接参数
#推荐配置,不是必须的
 high-availability.zookeeper.path.root: /flink	#将flink的信息存储在zookeeper根目录下的flink节点里
 high-availability.cluster-id: /default_ns
 state.backend: rocksdb	#状态后端模式
 state.checkpoints.dir: hdfs:///flink-checkpoints	#检查点存储文件夹
 state.savepoints.dir: hdfs:///flink-savepoints	#保存点存储文件夹,可以和检查点文件夹一样,此处是为了区分检查点和保存点才这样配的
 state.backend.incremental: true	#允许增量式的状态存储
 state.backend.rocksdb.ttl.compaction.filter.enabled: true	#激活rocksdb在compact(压缩)时候对于过期数据的排除

配置slaves

[root@hadoopnode0* flink-1.8.1]# vim conf/slaves
#集群节点全配
hadoopnode01
hadoopnode02
hadoopnode03

配置master

[root@hadoopnode0* flink-1.8.1]# vim conf/master
#集群节点全配
hadoopnode01:8081
hadoopnode02:8081
hadoopnode03:8081

启动flink集群

[root@hadoopnode01 flink-1.8.1]# ./bin/start-cluster.sh
[root@hadoopnode01 flink-1.8.1]# jps
13600 QuorumPeerMain
3415 Jps
313 DataNode
3114 TaskManagerRunner
65406 NameNode
750 DFSZKFailoverController
2591 StandaloneSessionClusterEntrypoint
527 JournalNode

访问web

hadoopnode0*:8081

分辨JobManager节点

1,只有JobManager主节点才能收到子节点TaskManager报告的日志信息

[外链图片转存中…(img-RAxp94lb-1585096000841)]

[外链图片转存中…(img-yZB674aR-1585096000842)]

2,只有JobManager主节点的JobManager日志里才可以查到leadership这条记录

[外链图片转存中…(img-kVehgdqV-1585096000842)]

干掉主节点JobManager

[root@hadoopnode01 flink-1.8.1]# ./bin/flink-daemon.sh stop standalonesession

zookeeper会再次选出一个JobManager

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gWkM5Xyw-1585096000843)(assets\1574340420380.png)]

发布了10 篇原创文章 · 获赞 2 · 访问量 393
展开阅读全文

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 1024 设计师: 上身试试

分享到微信朋友圈

×

扫一扫,手机浏览