Flink之数据去重

前言

flink 处理重复数据,经过实战,总结一下3个方式。

1. 去重方式一 流转表

核心代码逻辑

    // 计算iopv
        SingleOutputStreamOperator<FundIopvIndicators> streamOperator = EtfIopvFunction.calculateRealTimeIopv(stringKeyedStream);

        // 流转表
        Table table = tableEnv.fromDataStream(streamOperator, "fundCode,realTimeIopv,computationTime,strTime");
        // 去重
        Table duplicateRemoval = tableEnv.sqlQuery(" select strTime,fundCode,realTimeIopv,computationTime from ( "
                + " select  strTime,fundCode,realTimeIopv,computationTime, ROW_NUMBER() OVER "
                + " (PARTITION BY strTime ORDER BY strTime desc) AS rownum  from "
                + table
                + ") where rownum=1"
        );
2 去重方式二 flink sql
--- 去重查询
-- kafka source
CREATE TABLE user_log (
  user_id VARCHAR
  ,item_id VARCHAR
  ,category_id VARCHAR
  ,behavior INT
  ,ts TIMESTAMP(3)
  ,process_time as proctime()
  , WATERMARK FOR ts AS ts
) WITH (
  'connector' = 'kafka'
  ,'topic' = 'user_behavior'
  ,'properties.bootstrap.servers' = 'localhost:9092'
  ,'properties.group.id' = 'user_log'
  ,'scan.startup.mode' = 'group-offsets'
  ,'format' = 'json'
);

---sink table
CREATE TABLE user_log_sink (
  user_id VARCHAR
  ,item_id VARCHAR
  ,category_id VARCHAR
  ,behavior INT
  ,ts TIMESTAMP(3)
  ,num BIGINT
  ,primary key (user_id) not enforced
) WITH (
'connector' = 'upsert-kafka'
  ,'topic' = 'user_behavior_sink'
  ,'properties.bootstrap.servers' = 'localhost:9092'
  ,'properties.group.id' = 'user_log'
  ,'key.format' = 'json'
  ,'key.json.ignore-parse-errors' = 'true'
  ,'value.format' = 'json'
  ,'value.json.fail-on-missing-field' = 'false'
  ,'value.fields-include' = 'ALL'
);

-- insert
insert into user_log_sink(user_id, item_id, category_id,behavior,ts,num)
SELECT user_id, item_id, category_id,behavior,ts,rownum
FROM (
   SELECT user_id, item_id, category_id,behavior,ts,
     ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY process_time desc) AS rownum -- desc use the latest one,
   FROM user_log)
WHERE rownum=1-- 只能使用 rownum=1,如果写 rownum=2(或<10),每个分区只会输出一条数据(小于是多条)rownum=2的,看起来基于全局去重了
3. 利用 flink 的 MapState
public class IopvDeduplicateProcessFunction extends RichFlatMapFunction<FundIopvIndicators, FundIopvIndicators> {
    private MapState<String, FundIopvIndicators> mapState;

    /***状态初始化*/
    @Override
    public void open(Configuration parameters) throws Exception {

        MapStateDescriptor descriptor = new MapStateDescriptor("MapDescriptor", String.class, FundIopvIndicators.class);
        mapState = getRuntimeContext().getMapState(descriptor);

    }

    @Override
    public void flatMap(FundIopvIndicators iopvIndicators, Collector<FundIopvIndicators> collector) throws Exception {
        String strTime = iopvIndicators.getStrTime();
        // 去重
        if (mapState.get(strTime) == null) {
            mapState.put(strTime, iopvIndicators);
            collector.collect(iopvIndicators);
        }
    }
}
Flink可以通过使用SetState来实现历史全量去重计数。具体实现步骤如下: 1.定义一个MapState作为状态,用于存储历史数据去重结果。 ``` MapState<String, Long> countState = getRuntimeContext().getMapState(new MapStateDescriptor<>("countState", String.class, Long.class)); ``` 2.在KeyedProcessFunction的processElement方法中,判断当前数据是否已经存在于状态中,如果不存在则将其加入状态,并将计数器加1。 ``` @Override public void processElement(T value, Context ctx, Collector<Long> out) throws Exception { //获取当前事件的key和value String key = ctx.getCurrentKey(); Long currentValue = value.get(); //如果当前事件不存在于状态中,就将其加入状态并将计数器加1 if (!countState.contains(currentValue.toString())) { countState.put(currentValue.toString(), 1L); out.collect(countState.values().iterator().next()); } } ``` 3.在Job中设置状态后端,并启动Job。 ``` StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStateBackend(new RocksDBStateBackend("hdfs://localhost:9000/flink/checkpoints")); DataStream<Tuple2<String, Long>> input = env.fromElements( Tuple2.of("key", 1L), Tuple2.of("key", 1L), Tuple2.of("key", 2L), Tuple2.of("key", 3L), Tuple2.of("key", 2L), Tuple2.of("key", 4L), Tuple2.of("key", 5L), Tuple2.of("key", 3L) ); input.keyBy(0) .process(new CountDistinct()) .print(); env.execute(); ``` 这样就可以实现历史全量去重计数了。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值