1 键值分区状态

1.1 Key分区的知识

1. 根据key和最大并行度(maxParallelism)计算出KeyGroup的索引

首先理解两个概念：

	/**
* Assigns the given key to a key-group index.
*
* @param key the key to assign
* @param maxParallelism the maximum supported parallelism, aka the number of key-groups.
* @return the key-group to which the given key is assigned
*/
public static int assignToKeyGroup(Object key, int maxParallelism) {
return computeKeyGroupForKeyHash(key.hashCode(), maxParallelism);
}

/**
* Assigns the given key to a key-group index.
*
* @param keyHash the hash of the key to assign
* @param maxParallelism the maximum supported parallelism, aka the number of key-groups.
* @return the key-group to which the given key is assigned
*/
public static int computeKeyGroupForKeyHash(int keyHash, int maxParallelism) {
return MathUtils.murmurHash(keyHash) % maxParallelism;
}

2. 根据下游算子的并行度，算子最大并行度，KeyGroup索引计算下游算子的索引

	/**
* Computes the index of the operator to which a key-group belongs under the given parallelism and maximum
* parallelism.
*
* IMPORTANT: maxParallelism must be <= Short.MAX_VALUE to avoid rounding problems in this method. If we ever want
* to go beyond this boundary, this method must perform arithmetic on long values.
*
* @param maxParallelism Maximal parallelism that the job was initially created with.
*                       0 < parallelism <= maxParallelism <= Short.MAX_VALUE must hold.
* @param parallelism    The current parallelism under which the job runs. Must be <= maxParallelism.
* @param keyGroupId     Id of a key-group. 0 <= keyGroupID < maxParallelism.
* @return The index of the operator to which elements from the given key-group should be routed under the given
* parallelism and maxParallelism.
*/
public static int computeOperatorIndexForKeyGroup(int maxParallelism, int parallelism, int keyGroupId) {
return keyGroupId * parallelism / maxParallelism;
}



1.3 代码验证

1.3.1 验证场景及代码实现

package com.hollysys.flink.streaming.state.redistribution

/**
* 验证Keyed 状态并行度改变时，重新分配示例
* keyby的每个key会被分到不同的key group中，状态迁移时，是随着key group进行迁移的
* @author shirukai
*/
object KeyedStateRedistributionExample {
case class Device(id: String, value: Double)

case class Result(taskNumber: Int, deviceId: String, sum: Double)

def main(args: Array[String]): Unit = {
// 1. 创建本地运行环境

// 2. 从socket中获取文本
val streamText: DataStream[String] = env.socketTextStream("127.0.0.1", 9000)
.name("SocketSource")
.uid("SocketSource")

// 3. 文本转换为Device样例类
val deviceStream = streamText.map(text => {
val items = text.split(" ")
Device(items(0), items(1).toDouble)
}).setParallelism(1)
.name("FormatDevice")
.uid("FormatDevice")

// 4. 计算累加和
val resultStream = deviceStream.keyBy(_.id).map(new ValueAccumulator)
// 设置并行度为1
.setParallelism(12)
.name("ValueAccumulator")
.uid("ValueAccumulator")

// 5. 输出到控制台
resultStream.print().name("Print").uid("Print")

// 6. execute
env.execute("KeyedStateRedistributionExample")

}

class ValueAccumulator extends RichMapFunction[Device, Result] {
private var accumulatorState: ValueState[Double] = _

override def open(parameters: Configuration): Unit = {
// 获取状态
accumulatorState = getRuntimeContext.getState(new ValueStateDescriptor[Double]("sum-state", classOf[Double]))
}

override def map(value: Device): Result = {
val sum = accumulatorState.value() + value.value

// 更新状态
accumulatorState.update(sum)
}
}

}



1.3.2 准备数据

  [
"device-1 3 0",
"device-97 3 0",
"device-19 14 1",
"device-77 14 1",
"device-5 31 2",
"device-7 31 2",
"device-2 35 3",
"device-433 35 3",
"device-27 44 4",
"device-146 44 4",
"device-16 62 5",
"device-62 62 5",
"device-37 67 6",
"device-360 67 6",
"device-32 85 7",
"device-69 85 7",
"device-17 94 8",
"device-53 94 8",
"device-8 102 9",
"device-71 102 9",
"device-12 112 10",
"device-256 112 10",
"device-13 120 11",
"device-222 120 11"
]

KeyKeyGroup索引并行度为12时算子索引
device-130
device-9730
device-19141
device-77141
device-5312
device-7312
device-2353
device-433353
device-27444
device-146444
device-16625
device-62625
device-37676
device-360676
device-32857
device-69857
device-17948
device-53948
device-81029
device-711029
device-1211210
device-25611210
device-1312011
device-22212011


/**
* KeyGroup分配测试
*
* @author shirukai
*/
object KeyGroupRangeAssignmentTest {

case class KeyGroup(key: String, group: Int)

def main(args: Array[String]): Unit = {
// 最大并行度
val maxParallelism = 128
// 算子并行度
val parallelism = 12
val map = mutable.SortedMap[Int, mutable.ListBuffer[KeyGroup]]()
for (elem <- 1.until(1000)) {
val key = s"device-$elem" // 计算KeyGroup索引 val keyGroup = KeyGroupRangeAssignment.assignToKeyGroup(key, maxParallelism) // 计算算子索引 val index = KeyGroupRangeAssignment.computeOperatorIndexForKeyGroup(maxParallelism, parallelism, keyGroup) if (!map.contains(index)) { map.put(index, mutable.ListBuffer(KeyGroup(key, keyGroup))) } else { val keyList = map(index) if (keyList.size < 2) { val group = keyList.head.group if (group == keyGroup) { keyList.append(KeyGroup(key, keyGroup)) } } } } implicit val formats: AnyRef with Formats = Serialization.formats(NoTypeHints) println(Serialization.write(map.flatMap(i => { i._2.map(k => s"${k.key} ${k.group}${i._1}")
})))
}
}


1.3.3 验证

1. 首次启动程序，然后逐条发送准备好的数据，输出结果中的算子索引与预期的相同。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lw2duqVH-1600569615037)(https://cdn.jsdelivr.net/gh/shirukai/images/20200828171730.gif)]

KeyKeyGroup索引并行度为12时算子索引输入并行度为12时的预期输出
device-130device-1 1.0Result(0,device-1,1.0)
device-9730device-97 1.0Result(0,device-97,1.0)
device-19141device-19 1.0Result(1,device-19,1.0)
device-77141device-77 1.0Result(1,device-77,1.0)
device-5312device-5 1.0Result(2,device-5,1.0)
device-7312device-7 1.0Result(2,device-7,1.0)
device-2353device-2 1.0Result(3,device-2,1.0)
device-433353device-433 1.0Result(3,device-433,1.0)
device-27444device-27 1.0Result(4,device-27,1.0)
device-146444device-146 1.0Result(4,device-146,1.0)
device-16625device-16 1.0Result(5,device-16,1.0)
device-62625device-62 1.0Result(5,device-62,1.0)
device-37676device-37 1.0Result(6,device-37,1.0)
device-360676device-360 1.0Result(6,device-360,1.0)
device-32857device-32 1.0Result(7,device-32,1.0)
device-69857device-69 1.0Result(7,device-69,1.0)
device-17948device-17 1.0Result(8,device-17,1.0)
device-53948device-53 1.0Result(8,device-53,1.0)
device-81029device-8 1.0Result(9,device-8,1.0)
device-711029device-71 1.0Result(9,device-71,1.0)
device-1211210device-12 1.0Result(10,device-12,1.0)
device-25611210device-256 1.0Result(10,device-256,1.0)
device-1312011device-13 1.0Result(11,device-13,1.0)
device-22212011device-222 1.0Result(11,device-222,1.0)
2. 停止程序，修改map算子的并行度为6，启动程序，再次发送准备好的数据， 相同KeyGroup的记录，会被同一个算子处理。

KeyKeyGroup索引并行度为12时算子索引输入并行度为6时的预期输出
device-130device-1 1.0Result(0,device-1,2.0)
device-9730device-97 1.0Result(0,device-97,2.0)
device-19141device-19 1.0Result(0,device-19,2.0)
device-77141device-77 1.0Result(0,device-77,2.0)
device-5312device-5 1.0Result(1,device-5,2.0)
device-7312device-7 1.0Result(1,device-7,2.0)
device-2353device-2 1.0Result(1,device-2,2.0)
device-433353device-433 1.0Result(1,device-433,2.0)
device-27444device-27 1.0Result(2,device-27,2.0)
device-146444device-146 1.0Result(2,device-146,2.0)
device-16625device-16 1.0Result(2,device-16,2.0)
device-62625device-62 1.0Result(2,device-62,2.0)
device-37676device-37 1.0Result(3,device-37,2.0)
device-360676device-360 1.0Result(3,device-360,2.0)
device-32857device-32 1.0Result(3,device-32,2.0)
device-69857device-69 1.0Result(3,device-69,2.0)
device-17948device-17 1.0Result(4,device-17,2.0)
device-53948device-53 1.0Result(4,device-53,2.0)
device-81029device-8 1.0Result(4,device-8,2.0)
device-711029device-71 1.0Result(4,device-71,2.0)
device-1211210device-12 1.0Result(5,device-12,2.0)
device-25611210device-256 1.0Result(5,device-256,2.0)
device-1312011device-13 1.0Result(5,device-13,2.0)
device-22212011device-222 1.0Result(5,device-222,2.0)

2 算子列表状态（ListCheckpointed）

2.3 代码验证

2.3.1 验证场景及代码实现

1. 先将flatmap算子并行度设置为2，发送准备好的数据，输出结果符合预期
2. 将flatmap算子并行度设置为3，会有2个算子基于之前的状态继续统计，另外一个算子从头统计。

package com.hollysys.flink.streaming.state.redistribution

import java.util
import java.util.Collections

/**
* 验证列表状态，并行度改变时，状态重新分配示例
*
* @author shirukai
*/
object ListStateRedistributionExample {

case class Device(id: String, value: Double)

case class Result(taskNumber: Int, count: Long)

def main(args: Array[String]): Unit = {
// 1. 创建本地运行环境

// 2. 从socket中获取文本
val streamText: DataStream[String] = env.socketTextStream("127.0.0.1", 9000)
.name("SocketSource")
.uid("SocketSource")

// 3. 文本转换为Device样例类
val deviceStream = streamText.map(text => {
val items = text.split(" ")
Device(items(0), items(1).toDouble)
}).setParallelism(1)
.name("FormatDevice")
.uid("FormatDevice")

// 4. 统计value大于6.0的设备总和
val resultStream = deviceStream
// 均匀分配
.rescale
.flatMap(new HighValueCounter(6.0))
.setParallelism(2)
.name("HighValueCounter")
.uid("HighValueCounter")

// 5. 结果输出到控制台
resultStream.print()
.setParallelism(1)
.name("Print")
.uid("Print")

// 6. 提交执行
env.execute("ListStateRedistributionExample")

}

class HighValueCounter(threshold: Double) extends RichFlatMapFunction[Device, Result] with ListCheckpointed[java.lang.Long] {
// 子任务的索引号
// 本地计数器变量
private var highValueCounter = 0L

override def flatMap(value: Device, out: Collector[Result]): Unit = {
if (value.value > threshold) {
// 如果超过阈值计数器加一
highValueCounter += 1
// 发出当前子任务索引和当前计数器值
}

}

/**
* 返回当前状态用以保存到快照中
*
* @param checkpointId 检查点ID
* @param timestamp    检查点时间戳
* @return
*/
override def snapshotState(checkpointId: Long, timestamp: Long): util.List[java.lang.Long] = {
Collections.singletonList(highValueCounter)
}

/**
* 恢复到之前检查点的状态
*
* @param state 检查点中的状态
*/
override def restoreState(state: util.List[java.lang.Long]): Unit = {
import scala.collection.JavaConverters._
for (cnt <- state.asScala) {
highValueCounter += cnt
}
}
}

}



2.3.2 准备数据

device-1 7.0
device-1 8.0
device-1 9.0


2.3.3 验证

1. 设置flatmap的并行度为2，依次发送准备好的三条数据，输出结果与预期相同

输入并行度为2预期输出
device-1 7.0Result(0,1)
device-1 8.0Result(1,1)
device-1 9.0Result(0,2)

2. 设置flatmap的并行度为2，依次发送准备好的三条数据，输出结果与预期相同

输入并行度为3预期输出
device-1 7.0Result(0,3)
device-1 8.0Result(1,2)
device-1 9.0Result(2,1)

3 算子联合列表状态（CheckpointedFunction）

3.2 联合列表状态迁移示意图

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NGx3b2im-1600569615044)(/Users/shirukai/Library/Application Support/typora-user-images/image-20200904151220088.png)]

3.3 代码验证

3.3.1 验证场景及代码实现

1. 并行度为2时，输入四条数据，每个算子中的状态会存两条记录
2. 并行度改为3时，输入三条数据，每个算子汇总的状态会存5条数据，4条是之前扩容迁移后得到的全量，一条是刚刚进入的数据

package com.hollysys.flink.streaming.state.redistribution

import java.util

import org.apache.commons.collections.IteratorUtils

/**
* 验证联合列表状态，并行度改变时，状态迁移示例
*
* @author shirukai
*/
object UnionListStateRedistributionExample {

case class Device(id: String, value: Double)

case class Result(taskId: Int, devices: List[Device])

def main(args: Array[String]): Unit = {
// 1. 创建本地运行环境

// 2. 从socket中获取文本
val streamText: DataStream[String] = env.socketTextStream("127.0.0.1", 9000)
.name("SocketSource")
.uid("SocketSource")

// 3. 文本转换为Device样例类
val deviceStream = streamText.map(text => {
val items = text.split(" ")
Device(items(0), items(1).toDouble)
}).setParallelism(1)
.name("FormatDevice")
.uid("FormatDevice")

val resultStream = deviceStream
// 均匀分配
.rescale
.map(new DeviceCollector)
.setParallelism(3)
.name("DeviceCollector")
.uid("DeviceCollector")

resultStream
.print()
.setParallelism(1)
.name("Print")
.uid("Print")

env.execute("UnionListStateRedistributionExample")
}

class DeviceCollector extends RichMapFunction[Device, Result] with CheckpointedFunction {
private var deviceCollectorState: ListState[Device] = _
private var deviceCollectorCache: util.List[Device] = _

override def map(value: Device): Result = {
import scala.collection.JavaConverters._
}

/**
* 当检查点被请求快照时调用，用以保存当前状态
*
* @param context ct
*/
override def snapshotState(context: FunctionSnapshotContext): Unit = {
// 清空之前的状态
deviceCollectorState.clear()
// 将缓存刷到状态里
}

/**
* 当并行实例被创建时调用，用以初始化状态
*
* @param context ct
*/
override def initializeState(context: FunctionInitializationContext): Unit = {
val deviceCollectorStateDesc = new ListStateDescriptor[Device]("device-collector-state", classOf[Device])
deviceCollectorState = context.getOperatorStateStore.getUnionListState(deviceCollectorStateDesc)
if (context.isRestored) {
// 将状态刷到缓存里
deviceCollectorCache = IteratorUtils.toList(deviceCollectorState.get().iterator()).asInstanceOf[util.List[Device]]
}else{
deviceCollectorCache = new util.ArrayList[Device]()
}
}
}

}



3.3.2 准备数据

device-1 7.0
device-1 8.0
device-1 9.0
device-1 10.0

device-1 11.0
device-1 12.0
device-1 13.0


3.3.3 验证

1. 设置map算子的并行度为2，输入下列数据，输出结果与预期相同

a 修改并行度

val resultStream = deviceStream
// 均匀分配
.rescale
.map(new DeviceCollector)
.setParallelism(2)
.name("DeviceCollector")
.uid("DeviceCollector")


b 数据

输入并行度为2预期输出
device-1 7.0Result(0,List(Device(device-1,7.0)))
device-1 8.0Result(1,List(Device(device-1,8.0)))
device-1 9.0Result(0,List(Device(device-1,7.0), Device(device-1,9.0)))
device-1 10.0Result(1,List(Device(device-1,8.0), Device(device-1,10.0)))

c 拓扑图

d 运行示例

2. 设置map算子的并行度为3，输入下列数据，输出结果与预期相同

a 修改并行度

val resultStream = deviceStream
// 均匀分配
.rescale
.map(new DeviceCollector)
.setParallelism(3)
.name("DeviceCollector")
.uid("DeviceCollector")


b 数据

输入并行度为3预期输出
device-1 11.0Result(0,List(Device(device-1,7.0), Device(device-1,9.0), Device(device-1,8.0), Device(device-1,10.0), Device(device-1,11.0)))
device-1 12.0Result(1,List(Device(device-1,7.0), Device(device-1,9.0), Device(device-1,8.0), Device(device-1,10.0), Device(device-1,12.0)))
device-1 13.0Result(2,List(Device(device-1,7.0), Device(device-1,9.0), Device(device-1,8.0), Device(device-1,10.0), Device(device-1,13.0)))

c 拓扑图

d 运行示例

4 广播状态

4.3 代码验证

4.3.1 验证场景及代码实现

1. 先将process算子并行度设置为2，发送准备好的数据，输出结果符合预期
2. 将process算子并行度设置为4，发送准备的数据，新增的3、4任务会分别拷贝原来1、2的状态，输出结果符合预期

package com.hollysys.flink.streaming.state.redistribution

/**
* 验证广播状态，并行度改变时，状态迁移示例
*
* @author shirukai
*/

case class Device(id: String, value: Double)

case class Rule(id: String, rule: String, var taskId: Int)

case class Result(taskId: Int, device: Device, rules: List[Rule])

private val stateDescriptor = new MapStateDescriptor("rule-state",
createTypeInformation[String],
createTypeInformation[Rule])

def main(args: Array[String]): Unit = {
// 1. 创建本地运行环境

// 2. 从socket中获取文本
val deviceStreamText: DataStream[String] = env.socketTextStream("127.0.0.1", 9000)
.name("DeviceSocketSource")
.uid("DeviceSocketSource")

val ruleStreamText: DataStream[String] = env.socketTextStream("127.0.0.1", 9001)
.name("RuleSocketSource")
.uid("RuleSocketSource")

// 3. 文本转换为样例类
val deviceStream = deviceStreamText.map(text => {
val items = text.split(" ")
Device(items(0), items(1).toDouble)
}).setParallelism(1)
.name("FormatDevice")
.uid("FormatDevice")

val ruleStream = ruleStreamText.map(text => {
val items = text.split(" ")
Rule(items(0), items(1), -1)
}).setParallelism(1)
.name("FormatRule")
.uid("FormatRule")

// 4. 绑定规则
val resultStream = deviceStream
.rescale
.process(new RuleBinding)
.setParallelism(2)
.name("RuleBinding")
.uid("RuleBinding")

resultStream
.print()
.setParallelism(1)
.name("Print")
.uid("Print")

}

class RuleBinding extends BroadcastProcessFunction[Device, Rule, Result] {
override def processElement(value: Device,
out: Collector[Result]): Unit = {
import scala.collection.JavaConverters._
val result = Result(taskId, value, state.immutableEntries().asScala.map(_.getValue).toList)
out.collect(result)
}

out: Collector[Result]): Unit = {
state.put(value.id, value)
}
}

}



3.3.2 准备数据

# 设备流
device-1 8.0
# 规则流
id-1 rule-1


3.3.3 验证

1. 设置process算子的并行度为2，输入下列数据，输出结果与预期相同

a 修改并行度

    val resultStream = deviceStream
.rescale
.process(new RuleBinding)
.setParallelism(2)
.name("RuleBinding")
.uid("RuleBinding")


b 数据

首先在规则socket中输入数据

id-1 rule-1


然后在设备socket中输入如下数据

输入并行度为2预期输出
device-1 8.0Result(0,Device(device-1,8.0),List(Rule(id-1,rule-1,0)))
device-1 8.0Result(1,Device(device-1,8.0),List(Rule(id-1,rule-1,1)))

c 拓扑图

d 运行示例

2. 设置process算子的并行度为3，输入下列数据，输出结果与预期相同

a 修改并行度

    val resultStream = deviceStream
.rescale
.process(new RuleBinding)
.setParallelism(4)
.name("RuleBinding")
.uid("RuleBinding")


b 数据

输入并行度为3预期输出
device-1 8.0Result(0,Device(device-1,8.0),List(Rule(id-1,rule-1,0)))
device-1 8.0Result(1,Device(device-1,8.0),List(Rule(id-1,rule-1,1)))
device-1 8.0Result(2,Device(device-1,8.0),List(Rule(id-1,rule-1,0)))
device-1 8.0Result(3,Device(device-1,8.0),List(Rule(id-1,rule-1,1)))

c 拓扑图

d 运行示例

• 点赞
• 评论
• 分享
x

海报分享

扫一扫，分享海报

• 收藏
• 手机看

分享到微信朋友圈

x

扫一扫，手机阅读

• 打赏

打赏

shirukai

你的鼓励将是我创作的最大动力

C币 余额
2C币 4C币 6C币 10C币 20C币 50C币
• 一键三连

点赞Mark关注该博主, 随时了解TA的最新博文
05-23 2803

01-10 701
01-03 305
09-28 2167
02-03 357
10-23 190
05-19 92
05-05 626