flink随笔1_flink 分支类型tuple2-CSDN博客

本文链接：https://blog.csdn.net/longlovefilm/article/details/110123445

1 从 API 到逻辑算子 Transformation，再到物理算子Operator，就生成了 StreamGraph。下一步Flink会依据StreamOperator来生成 JobGraph。

2 作业图（JobGraph）是唯一被Flink的数据流引擎所识别的表述作业的数据结构，也正是这一共同的抽象体现了流处理和批处理在运行时的统一。至此就完成了从用户业务代码到Flink运行系统的转化。

3 yn 数量对perjob 模式不起作用吗？？？

4 process() 前面必须有keyby, 原因如下：
在我们使用process 函数的时候，有一个前提就是要求我们必须使用在keyedStream上，有两个原因，
一个是getRuntimeContext 得到的StreamingRuntimeContext 只提供了KeyedStateStore的访问权限，所以只能访问keyd state。
另外一个是我们在注册定时器的时候，需要有三个维度，namespace，key, time，所以要求我们有key,这就是在ProcessFunction中只能在keyedStream做定时器注册。
在flink1.8.0版本中，有ProcessFunction 和KeyedProcessFunction 这个类面向用户的api,但是在ProcessFunction 类我们无法注册定时器，在ProcessOperator源码中我们发现注册是抛出异常。

5 keyby().flatmap()
keyby().process()
flatmap也可以用process来代替，都可以直接用processElement来处理逻辑。

除此之外，process还有额外的功能就是注册定时器，在onTimer方法里实现定时触发的逻辑。

如下代码显示二者功能上的重合的地方实例。

package state;

import org.apache.flink.api.common.functions.RichFlatMapFunction;
import org.apache.flink.api.common.state.StateTtlConfig;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.time.Time;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.util.Collector;

public class KeyedStateDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        DataStreamSource<Tuple2<Long, Long>> input = env.fromElements(
                Tuple2.of(1L, 4L),
                Tuple2.of(1L, 2L),
                Tuple2.of(1L, 6L),
                Tuple2.of(2L, 4L),
                Tuple2.of(2L, 4L),
                Tuple2.of(3L, 5L),
                Tuple2.of(3L, 5L),
                Tuple2.of(3L, 5L),
                Tuple2.of(2L, 3L),
                Tuple2.of(1L, 4L)
        );
        //input.keyBy(0).flatMap(new KeyedStateAgvFlatMap()).setParallelism(1).print();
        input.keyBy(0).process(new ProcessAgvFlatMap()).setParallelism(1).print();
        env.execute();
    }

    public static class ProcessAgvFlatMap extends ProcessFunction<Tuple2<Long,Long>,Tuple2<Long,Long>>{
        private ValueState<Tuple2<Long, Long>> valueState;

        @Override
        public void processElement(Tuple2<Long,Long> value, Context ctx, Collector<Tuple2<Long, Long>> out) throws Exception {
            Tuple2<Long, Long> currentValue = valueState.value();
            if(currentValue==null){
                currentValue= Tuple2.of(0L,0L);
            }
            currentValue.f0+=1;
            currentValue.f1+=value.f1;
            valueState.update(currentValue);
            if(currentValue.f0>=3){
                out.collect(Tuple2.of(value.f0, currentValue.f1/currentValue.f0));
                valueState.clear();
            }
        }

        @Override
        public void open(Configuration parameters) throws Exception {
            super.open(parameters);
            StateTtlConfig config = StateTtlConfig.newBuilder(Time.seconds(30))
                    .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
                    .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
                    .build();
            ValueStateDescriptor<Tuple2<Long, Long>> valueStateDescriptor = new ValueStateDescriptor<>("agvKeyedState",
                    TypeInformation.of(new TypeHint<Tuple2<Long, Long>>() {
                    }));
            valueStateDescriptor.enableTimeToLive(config);
            this.valueState=getRuntimeContext().getState(valueStateDescriptor);
        }

    }

    public static class KeyedStateAgvFlatMap extends RichFlatMapFunction<Tuple2<Long,Long>,Tuple2<Long,Long>>{
        private ValueState<Tuple2<Long, Long>> valueState;
        @Override
        public void flatMap(Tuple2<Long, Long> input, Collector<Tuple2<Long, Long>> out)
                throws Exception {
            Tuple2<Long, Long> currentValue = valueState.value();
            if(currentValue==null){
                currentValue= Tuple2.of(0L,0L);
            }
            currentValue.f0+=1;
            currentValue.f1+=input.f1;
            valueState.update(currentValue);
            if(currentValue.f0>=3){
                out.collect(Tuple2.of(input.f0, currentValue.f1/currentValue.f0));
                valueState.clear();
            }
        }

        @Override
        public void open(Configuration parameters) throws Exception {
            super.open(parameters);
            StateTtlConfig config = StateTtlConfig.newBuilder(Time.seconds(30))
                    .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
                    .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
                    .build();
            ValueStateDescriptor<Tuple2<Long, Long>> valueStateDescriptor = new ValueStateDescriptor<>("agvKeyedState",
                    TypeInformation.of(new TypeHint<Tuple2<Long, Long>>() {
                    }));
            valueStateDescriptor.enableTimeToLive(config);
            this.valueState=getRuntimeContext().getState(valueStateDescriptor);
        }
    }

}

6 接下来我们会在四个维度来区分两种不同的 state：operator state 以及 keyed state。

6.1. 是否存在当前处理的 key（current key）：operator state 是没有当前 key 的概念，而 keyed state 的数值总是与一个 current key 对应。
6.2. 存储对象是否 on heap: 目前 operator state backend 仅有一种 on-heap 的实现；而 keyed state backend 有 on-heap 和 off-heap（RocksDB）的多种实现。
6.3. 是否需要手动声明快照（snapshot）和恢复 (restore) 方法：operator state 需要手动实现 snapshot 和 restore 方法；而 keyed state 则由 backend 自行实现，对用户透明。
6.4. 数据大小：一般而言，我们认为 operator state 的数据规模是比较小的；认为 keyed state 规模是相对比较大的。需要注意的是，这是一个经验判断，不是一个绝对的判断区分标准。

7 flink1.10已经不必需要-yn了，它会自动去检测. 单个tm slot数量* tm = flink最大并行度。

flink run -m yarn-cluster -c com.xxx.WordCount ./xxxx.jar

-yn,--container <arg> 表示分配容器的数量，也就是 TaskManager 的数量。

-d,--detached：设置在后台运行。

-yjm,--jobManagerMemory<arg>:设置 JobManager 的内存，单位是 MB。

-ytm，--taskManagerMemory<arg>:设置每个 TaskManager 的内存，单位是 MB。

-ynm,--name:给当前 Flink application 在 Yarn 上指定名称。

-yq,--query：显示 yarn 中可用的资源（内存、cpu 核数）

-yqu,--queue<arg> :指定 yarn 资源队列

-ys,--slots<arg> :每个 TaskManager 使用的 Slot 数量。

-yz,--zookeeperNamespace<arg>:针对 HA 模式在 Zookeeper 上创建 NameSpace

-yid,--applicationID<yarnAppId> : 指定 Yarn 集群上的任务 ID,附着到一个后台独立运行的 Yarn Session 中。

8 Flink 的水印处理以及传播算法,确保了operator task恰当地释放一致时间戳的记录和水印。然而它依赖的基础是：所有分区持续提供递增的水印。一旦一个分区的水印不再递增，或者完全空闲（不再发送任何记录与水印），则task的事件-时间时钟不会再向前推进，并且task的计时器也不会被触发。在基于时间的、依赖于向前（advancing）时钟执行计算（并做清理）的operators中，便会造成问题。最终会导致处理延时、state大小剧增（如果没有定期从所有的输入任务中接收到新的水印）。

若是两个输入流的水印差异太大，也会造成类似的影响。在有两个输入流的task中，它的事件-时钟会对应于较慢的流，并且较快的流的records或是中间结果一般会缓存到state中，直到event-time 时钟允许处理它们。

9 用户定义的时间戳分配函数一般尽可能近的应用到离source operator，因为若是在records已经被一个operator处理后，将会很难推出原本的records顺序。这也是为什么尽量不要在流处理程序的middle部分对时间戳与水印做覆盖的原因，尽管这个是可以通过用户定义函数实现的。

10 所有由Flink 事件-时间流应用生成的条目都必须伴随着一个时间戳。时间戳将一个条目与一个特定的时间点关联起来，一般这个时间点表示的是这条record发生的时间。不过application可以随意选择时间戳的含义，只要流中条目的时间戳是随着流的前进而递增即可。

当Flink以事件-时间的模式处理流数据时，它基于条目的时间戳来评估（evaluate）基于时间（time-based）的operators。例如，一个time-window operator 根据条目的时间戳，将它们分派给不同的windows。Flink将时间戳编码为 16-byte，Long类型的值，并将它们以元数据（metadata）的方式附加到流记录（records）中。它内置的operators将这个Long型的值解释为Unix 时间戳，精确到毫秒，也就是自1970-01-01-00:00:00.000 开始，所经过的毫秒数。不过，用户自定义的operators可以有它们自己的解释方法（interpretation），例如，将精确度指定为微秒级别。

11 流上的每个记录的包装类里有eventTime属性

public final class StreamRecord<T> extends StreamElement {

	/** The actual value held by this record. */
	private T value;

	/** The timestamp of the record. */
	private long timestamp;

12 state processer api不能处理 window state
13 目前Flink SQL的维表JOIN仅支持对当前时刻维表快照的关联(处理时间语义)，而不支持事实表rowtime所对应的的维表快照(事件时间语义)。

14
目前 State TTL 仅对 Processing Time 时间模式有效，但通过与开发者进行交流，Flink 在不远的今后也将对 Event Time 的 State TTL 特性提供支持

15
Flink Watermark sideOutputLateData，延迟的数据通过outputTag输出，必须要事件时间大于watermark + allowed lateness，数据才会存储在outputTag中。

16
flink+kafka
kafka hash方式往不同分区发送数据的时候，有可能数据倾斜导致hash不均匀，甚至某些分区没有数据。这样会导致后面的watermark不前进。
可以定时在kafka发送不同分区人为制造数据，推进水位上升。