Flink 实时项目

最新推荐文章于 2024-08-09 14:46:00 发布

CODE20220318

最新推荐文章于 2024-08-09 14:46:00 发布

阅读量469

点赞数

分类专栏： Flink 文章标签： flink hbase 大数据

本文链接：https://blog.csdn.net/m0_50689246/article/details/126363659

版权

Flink 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Flink 实时项目

实时统计指标.

窗口处理

窗口处理的方法由以下元素：窗口函数、

1不使用keyBy函数，使用windowAll和窗口处理函数，处理对应窗口的所有key数据

2使用keyBy函数，再window开窗，使用aggregate函数聚合key、窗口唯一对应的数据。

3使用keyBy函数，再window开窗，使用窗口处理函数（或全窗口函数）聚合key、窗口唯一对应的数据。

等同地位的几种函数：1增量聚合函数（归约、聚合）2全窗口函数 3窗口处理函数

日志结构

登录日志
登录失败日志

#页面数据，事件数据，启动数据和错误数据

对于事件、页面、曝光、错误、启动几个模块，在flink的项目中，我们需要事件（用于协同过滤），页面（用于UVPV分析）、曝光（用于曝光量计算）、启动日志（用于记录日活信息）

除此之外还需要商品的交易信息业务日志，用于计算topn商品、用户首单信息

有一些数据是业务数据库和行为日志都可以提供的，例如添加购物车、点击收藏，这些行为既可以通过前段日志记录，也可以通过业务数据库获得。

Top N商品

每10s进行一次统计，统计1分钟内的商品热度排行，找出当前热门的Top N对象

windowAll

不进行keyBy

#原始的想法为，不进行keyBy，直接在一个分区上执行，通过维护一个hashMap，key是url，value是改url的热门程度，用访问次数等进行表示。
#这里借助windowAll直接开窗，然后在窗口中使用窗口处理函数进行全量计算。
public class ProcessAllWindowTopN {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env= StreamExecutionEnvironment.getExecutionEnvironment();
        SingleOutputStreamOperator<LogEntity> sourceStream=env.addSource(new ClickSource())
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy.<LogEntity>forMonotonousTimestamps()
                           .withTimestampAssigner(new SerializableTimestampAssigner<LogEntity>() {
                               @Override
                               public long extractTimestamp(LogEntity element, long recordTimestamp) {
                                   return element.getTime();
                               }
                           })
                );

//        SingleOutputStreamOperator<Integer> result=sourceStream.map(new MapFunction<LogEntity, Integer>() {
//            @Override
//            public Integer map(LogEntity value) throws Exception {
//                return value.getProductId();
//            }
//        });
        SingleOutputStreamOperator<TopProductEntity> result=sourceStream.map(logEntity->logEntity.getProductId())
                .windowAll(SlidingEventTimeWindows.of(Time.seconds(60),Time.seconds(10)))
                .process(new ProcessAllWindowFunction<Integer, TopProductEntity, TimeWindow>() {
                             @Override
                             public void process(Context context, Iterable<Integer> iterable, Collector<TopProductEntity> collector) throws Exception {
                                 HashMap<Integer,Long> productCountMap=new HashMap<>();

                                 for(Integer productId:iterable){
                                     if(productCountMap.containsKey(productId))
                                     {
                                         Long oldValue=productCountMap.get(productId);
                                         productCountMap.put(productId,oldValue+1L);
                                     }
                                     else{
                                         productCountMap.put(productId,1L);
                                     }
                                 }

                                 ArrayList<Tuple2<Integer,Long>> productIdCountList=new ArrayList<>();
                                 for(Map.Entry<Integer,Long> entry:productCountMap.entrySet()){
                                     productIdCountList.add(Tuple2.of(entry.getKey(),entry.getValue()));
                                 }
                                 productIdCountList.sort(new Comparator<Tuple2<Integer, Long>>() {
                                     @Override
                                     public int compare(Tuple2<Integer, Long> o1, Tuple2<Integer, Long> o2) {
                                         return o2.f1.intValue()-o1.f1.intValue();
                                     }
                                 });
                                for(int i=0;i<10;i++){
                                    Tuple2<Integer,Long> temp=productIdCountList.get(i);
                                    collector.collect(TopProductEntity.of(temp.f0,context.window().getEnd(),temp.f1));
                                }

                             }
                         }
                );
        result.print();
        env.execute();
    }
}

Elasticsearch Sink

PUT /topproduct
{
    "settings": { 
    "number_of_shards": 3
 },
    "mappings": {
          "properties":{
             "productid":{
                "type":"keyword"
             },
             "times":{
                "type":"keyword"
             },
             "windowEnd":{
                "type":"date"
             }
          }
        }
    
}

实时处理速度

在window主机上运行，其占用了i7-9700 cpu 的60%,和16GB的内存，处理的数据量为每分钟704万*10=7000万条数据。使用自定义的sourceFunction定义的输出源，每分钟输出7000万条模拟日志数据，程序能够实时的统计出1分钟Top N热榜，而且滑动窗口每10s滑动一次进行计算。

如果在用集群测试，处理速度会更加强大。

2> TopProductEntity{productId=110, actionTimes=7022101, windowEnd=1651199620000, rankName='1651199620000'}
6> TopProductEntity{productId=104, actionTimes=7049810, windowEnd=1651199630000, rankName='1651199630000'}
7> TopProductEntity{productId=108, actionTimes=7049275, windowEnd=1651199630000, rankName='1651199630000'}
8> TopProductEntity{productId=102, actionTimes=7049250, windowEnd=1651199630000, rankName='1651199630000'}
5> TopProductEntity{productId=105, actionTimes=7049997, windowEnd=1651199630000, rankName='1651199630000'}
3> TopProductEntity{productId=106, actionTimes=7047867, windowEnd=1651199630000, rankName='1651199630000'}
1> TopProductEntity{productId=109, actionTimes=7048695, windowEnd=1651199630000, rankName='1651199630000'}
2> TopProductEntity{productId=101, actionTimes=7048377, windowEnd=1651199630000, rankName='1651199630000'}
5> TopProductEntity{productId=107, actionTimes=7045300, windowEnd=1651199630000, rankName='1651199630000'}
4> TopProductEntity{productId=103, actionTimes=7051025, windowEnd=1651199630000, rankName='1651199630000'}
4> TopProductEntity{productId=110, actionTimes=7047220, windowEnd=1651199630000, rankName='1651199630000'}
1> TopProductEntity{productId=101, actionTimes=7075255, windowEnd=1651199640000, rankName='1651199640000'}

#集群中以会话模式提交作业
#由于该应用只有一个作业，所以单作业模式与应用模式差别不大，在资源隔离上是近似的。而且搭建的是测试用的集群，没有其他任务抢占cpu等资源，就只测试会话模式。
bin/flink run -h
#stanalone模式提交：指定提交什么包，执行什么类，提交到哪
#yarn模式直接指定会话、单作业模式、应用模式，不用指定jobmanager地址。-m 指定jobmaster，-c指定类路径，-d指定后台运行。
bin/flink run  -m hbase1:8081  -c com.demo.task.practice.ProcessAllWindowTopN  /opt/software/jars/flink-2-hbase-1.0-SNAPSHOT.jar

窗口处理函数 Top N

对产品的频次进行统计，然后进行排序。

keyBy Top N，先用keyBy按照produceId分组，然后开滑动窗口计算。

计算完成后，再把同一个窗口的keyBy到一起，然后使用KeyedProcessFunction处理排名，并输出。

对于同一个窗口的数据，都将数据存储到一个内存状态中，并设置一个定时器，当水位线到达窗口结束时，触发计算，统计每个商品的频次，排名后得到热门商品。

DataStream<TopProductEntity> topProduct = dataStream.map(new TopProductMapFunction()).
                // 抽取时间戳做watermark 以 秒 为单位
                assignTimestampsAndWatermarks(new AscendingTimestampExtractor<LogEntity>() {
                    @Override
                    public long extractAscendingTimestamp(LogEntity logEntity) {
                        return logEntity.getTime() * 1000;
                    }
                })
                // 按照productId 按滑动窗口
                .keyBy("productId").timeWindow(Time.seconds(60),Time.seconds(5))
//                统计传入数据的总数 并封装为topProduceEntity  ，要windowsend干嘛
                .aggregate(new CountAgg(), new WindowResultFunction())
//                同一时间窗口的分到一起
                .keyBy("windowEnd")
//                如果到达windowEnd，那么触发timer计时器，进行排序，并输出为arrayList
//                flatmap就是用于处理arrayList，它将每个arrayList读取后，为每个string生成一个TopProductEntity，并写入排名。
//                发往下游的数据是一个windowEnd所对应的arrayList，不是累积状态。
                .process(new TopNHotItems(topSize))
                .flatMap(new FlatMapFunction<List<TopProductEntity>, TopProductEntity>() {
                    @Override
                    public void flatMap(List<TopProductEntity> TopProductEntitys, Collector<TopProductEntity> collector) throws Exception {
                        System.out.println("-------------Top N Product------------");
                        for (int i = 0; i < TopProductEntitys.size(); i++) {
                            TopProductEntity top = TopProductEntitys.get(i);
                            // 输出排名结果
                            System.out.println(top);
                            collector.collect(top);
                        }
                    }
                });

用户日活

每到来一条数据，判断今日是否登陆过，如果是新登录则记录下来，然后统计日后数量。

统计当日累计共有多少不重复的用户登录，需要对用户进行去重，使用Hashset或外部的redis等即可。

保留用户的统计结果。

#日活所有用户id
#查询redis状态是否创建。未创建则创建。已创建则根据用户id读出状态，然后过滤出新登录的用户。将新登陆的用户写入状态，并插入到ES外部存储。
package com.demo.task.practice;

import com.demo.domain.LogEntity;
import com.demo.util.Property;
import com.demo.util.RedisUtil;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch7.ElasticsearchSink;
import org.apache.flink.util.Collector;

import org.apache.http.HttpHost;
import org.apache.http.client.methods.HttpPost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;
import redis.clients.jedis.Jedis;

import java.text.SimpleDateFormat;
import java.util.*;

public class DailyActiveUser {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(4);
        SingleOutputStreamOperator<StartLog> sourceStream=env.addSource(new StartLogSource())
                .process(new ProcessFunction<StartLog, StartLog>() {
                    @Override
                    public void open(Configuration parameters) throws Exception {
                        super.open(parameters);
                        jedis= RedisUtil.connectRedis(Property.getStrValue("redis.host"));
                        if(jedis!=null){
                            System.out.println("jedis连接成功"+jedis);
                        }
                        simpleDateFormat=new SimpleDateFormat("yyyy-MM-dd");
                    }
                    private Jedis jedis;
                    private SimpleDateFormat simpleDateFormat;
                    @Override
                    public void processElement(StartLog value, Context ctx, Collector<StartLog> out) throws Exception {
//                        查询内存状态
                        Long userId=value.getUserId();
                        Long time=value.getTs();
//                        dt
                        Date date=new Date(time);
                        Long flag=jedis.sadd("flinkdau"+simpleDateFormat.format(date), String.valueOf(userId));
//
                        if(jedis.ttl("flinkdau"+String.valueOf(date))==-1L)
                        {
                            jedis.expire("flinkdau"+String.valueOf(date), 3600 * 24);
                        }
                        if(flag==1){
                            out.collect(value);
                        }
                    }

                    @Override
                    public void close() throws Exception {
                        super.close();
                        jedis.close();
                    }
                });
//        sourceStream.print();

//        httppost
//elasticsearchSinkFunction
        ElasticsearchSinkFunction<StartLog> elasticsearchSinkFunction=new ElasticsearchSinkFunction<StartLog>() {
            @Override
            public void process(StartLog element, RuntimeContext ctx, RequestIndexer indexer) {
                Map<String,String> result=new HashMap<>();
                result.put("userId",Long.valueOf(element.getUserId()).toString());
                result.put("entry",element.getEntry());
                result.put("ts",Long.valueOf(element.getTs()).toString());
                IndexRequest indexRequest= Requests.indexRequest().index("flinkdaustartlog").type("startLog").source(result).id(Long.valueOf(element.getUserId()).toString());
                indexer.add(indexRequest);
            }
        };
        List<HttpHost> httpPosts=new ArrayList<>();
        httpPosts.add(new HttpHost("hbase",9200,"http"));
        ElasticsearchSink.Builder<StartLog> builder=new ElasticsearchSink.Builder<StartLog>(httpPosts, elasticsearchSinkFunction);
        builder.setBulkFlushMaxActions(1);
        sourceStream.addSink(builder.build());
        sourceStream.print();
        env.execute("flinkdau");
    }
}

当日首单用户数量

在电商中，需求是记录今天的首单用户有哪些，共有多少。需要对非首单的用户去重，为什么不用redis。

在内容平台中，就是今天首次消费用户数量。

由于是否首单，需要一直保存其标记，用redis不合适。所以使用hbase记录是否首单。

elasticsearch建表：
PUT /newpurchaseuser
{
  "settings": { 
  "number_of_shards": 3
 },
  "mappings": {
      "properties":{
         "userId":{
          "type":"keyword"
         },
         "productId":{
          "type":"keyword"
         },
         "time":{
          "type":"date"
         }
      }
    }
}

#使用phoenix建表
create table newpurchaseuser(userid varchar not null primary,flag v)salt_buckets=16;
upsert into npuser values('101','1')
#查询是否首单，若是首单，则保留，并使用phoenix插入到hbase。
package com.demo.task.practice;

import com.demo.domain.LogEntity;
import com.demo.util.Property;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch7.ElasticsearchSink;
import org.apache.flink.util.Collector;
import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
import java.util.*;
import java.util.concurrent.Executors;

public class NewPurchaseUser {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env= StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(4);
        SingleOutputStreamOperator<LogEntity> sourceStream=env.addSource(new ClickSource())
//                过滤
                .process(new ProcessFunction<LogEntity, LogEntity>() {
                    private Connection conn;
                    @Override
                    public void open(Configuration parameters) throws Exception {
                        super.open(parameters);
                        Class.forName("org.apache.phoenix.jdbc.PhoenixDriver");
                        String url = "jdbc:phoenix:hbase,hbase1,hbase2:2181";
                        conn = DriverManager.getConnection(url);
                    }

                    @Override
                    public void close() throws Exception {
                        super.close();
                        conn.close();
                    }

                    @Override
                    public void processElement(LogEntity value, Context ctx, Collector<LogEntity> out) throws Exception {
                        int userId=value.getUserId();
//                        首单
                        Statement statement=conn.createStatement();
                        String sql="select userid from npuser where userid ='"+userId+"'";
                        ResultSet resultSet=statement.executeQuery(sql);

//                        收集
                        if(!resultSet.next()){
                            Statement insertStatement=conn.createStatement();
                            String insertsql="upsert into npuser values('"+userId+"','"+"1')";
                            System.out.println(insertsql);
                            insertStatement.execute(insertsql);
                            conn.commit();
                            out.collect(value);
                        }
                    }

                });

        ElasticsearchSinkFunction<LogEntity> elasticsearchSinkFunction=new ElasticsearchSinkFunction<LogEntity>() {
            @Override
            public void process(LogEntity element, RuntimeContext ctx, RequestIndexer indexer) {
                Map<String,String> result=new HashMap<>();
                result.put("userId",Integer.valueOf(element.getUserId()).toString());
                result.put("productId",Integer.valueOf(element.getProductId()).toString());
                result.put("time",Long.valueOf(element.getTime()).toString());
                IndexRequest indexRequest= Requests.indexRequest().index("newpurchaseuser").source(result).id(Integer.valueOf(element.getUserId()).toString());
                indexer.add(indexRequest);
            }
        };

        List<HttpHost> httpPosts=new ArrayList<>();
        httpPosts.add(new HttpHost(new Property().getElasProperties().getProperty("host"),9201,"http"));
        ElasticsearchSink.Builder<LogEntity> builder=new ElasticsearchSink.Builder<LogEntity>(httpPosts, elasticsearchSinkFunction);
        builder.setBulkFlushMaxActions(1);

        sourceStream.addSink(builder.build());
        sourceStream.print();
        env.execute("npuser");
    }
}

UV PV

User View，对应页面的distinct用户浏览量，独立访客数。

Page View，对应页面的浏览量，页面浏览量。

PV/UV表示人均重复访问量，也就是每个用户平均访问多少次一面，这在一定程度上代表了用户的粘度。

uv pv也是对对象进行频次统计，不过是对页面进行统计，其中UV需要另外处理，进行去重。

elasticsearch建表：
PUT /uvpv
{
  "settings": { 
  "number_of_shards": 3
 },
  "mappings": {
      "properties":{
         "uvpv":{
          "type":"keyword"
         },
         "productId":{
          "type":"keyword"
         },
         "ts":{
          "type":"date"
         }
      }
    }
}

package com.demo.task.practice;

import akka.stream.impl.fusing.Sliding;
import com.demo.domain.LogEntity;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch7.ElasticsearchSink;
import org.apache.flink.util.Collector;
import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;

import java.util.*;

public class UVPV {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
        DataStream<Tuple2<Integer,Double>> stream=env.addSource(new ClickSource())
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy.<LogEntity>forMonotonousTimestamps()
                        .withTimestampAssigner(
                                new SerializableTimestampAssigner<LogEntity>() {
                                    @Override
                                    public long extractTimestamp(LogEntity element, long recordTimestamp) {
                                        return element.getTime();
                                    }
                                }
                        )
                )
                .keyBy(LogEntity-> LogEntity.getProductId())
                .window( SlidingEventTimeWindows.of(Time.seconds(60),Time.seconds(10)))
                .aggregate(new AggregateFunction<LogEntity, Tuple2<HashSet<String>, Long>, Double>() {
                    @Override
                    public Tuple2<HashSet<String>, Long> createAccumulator() {
                        return Tuple2.of(new HashSet<String>(), 0L);
                    }

                    @Override
                    public Tuple2<HashSet<String>, Long> add(LogEntity value, Tuple2<HashSet<String>, Long> accumulator) {
                        accumulator.f0.add(Integer.valueOf(value.getUserId()).toString());
                        return Tuple2.of(accumulator.f0, accumulator.f1 + 1L);
                    }

                    @Override
                    public Double getResult(Tuple2<HashSet<String>, Long> accumulator) {
                        return (double) accumulator.f1 / accumulator.f0.size();
                    }

                    @Override
                    public Tuple2<HashSet<String>, Long> merge(Tuple2<HashSet<String>, Long> a, Tuple2<HashSet<String>, Long> b) {
                        return null;
                    }
                }, new WindowFunction<Double, Tuple2<Integer,Double>, Integer, TimeWindow>() {
                    @Override
                    public void apply(Integer integer, TimeWindow window, Iterable<Double> input, Collector<Tuple2<Integer,Double>> out) throws Exception {
                        out.collect(Tuple2.of(integer,input.iterator().next()));
                    }
                });

        ElasticsearchSinkFunction<Tuple2<Integer,Double>> elasticsearchSinkFunction=new ElasticsearchSinkFunction<Tuple2<Integer,Double>>() {
            @Override
            public void process(Tuple2<Integer,Double> element, RuntimeContext ctx, RequestIndexer indexer) {
                Map<String,String> result=new HashMap<>();
                result.put("productId",Integer.valueOf(element.f0).toString());
                result.put("pvuv",Double.valueOf(element.f1).toString());
                IndexRequest indexRequest= Requests.indexRequest().index("flinkpvuv").type("logEntity").source(result).id(Integer.valueOf(element.f0).toString());
                indexer.add(indexRequest);
            }
        };

        List<HttpHost> httpPosts=new ArrayList<>();
        httpPosts.add(new HttpHost("hbase",9200,"http"));
        ElasticsearchSink.Builder<Tuple2<Integer,Double>> builder=new ElasticsearchSink.Builder<Tuple2<Integer,Double>>(httpPosts, elasticsearchSinkFunction);
        builder.setBulkFlushMaxActions(1);
        stream.print();
        stream.addSink(builder.build());
        env.execute("UVPV");
    }
}

#flink sql写法

CEP连续登录失败

接下来我们考虑一个具体的需求：检测用户行为，如果连续三次登录失败，就输出报警信息。很显然，这是一个复杂事件的检测处理，我们可以使用 Flink CEP 来实现。

PUT /cepFailBehavior
{
  "settings": { 
  "number_of_shards": 3
 },
  "mappings": {
      "properties":{
         "userId":{
          "type":"keyword"
         },
         "first":{
          "type":"date"
         },
         "second":{
          "type":"date"
         },
         "third":{
          "type":"date"
         }
      }
    }
}

package com.demo.task.practice;

import com.demo.domain.LogEntity;
import com.typesafe.config.ConfigIncluderFile;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.cep.PatternSelectFunction;
import org.apache.flink.cep.PatternStream;
import org.apache.flink.cep.pattern.Pattern;
import org.apache.flink.cep.CEP;
import org.apache.flink.cep.pattern.conditions.SimpleCondition;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch7.ElasticsearchSink;
import org.apache.http.HttpHost;
import org.apache.kafka.common.protocol.types.Field;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class FailBehavior {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
        System.setProperty("HADOOP_USER_NAME", "root");
        System.setProperty("user.name", "root");

        env.setParallelism(4);
//        checkpoint
        env.enableCheckpointing(1000);
        CheckpointConfig config=env.getCheckpointConfig();
        config.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
        config.setMinPauseBetweenCheckpoints(500);
        config.setCheckpointTimeout(60000);
        config.setMaxConcurrentCheckpoints(1);
        config.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
        config.enableUnalignedCheckpoints();
        config.setCheckpointStorage("hdfs://hbase:9000/flink/checkpoints");

        DataStream<LogEntity> sourceStream=env.addSource(new ClickSource())
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy.<LogEntity>forMonotonousTimestamps()
                                .withTimestampAssigner(
                                        new SerializableTimestampAssigner<LogEntity>() {
                                            @Override
                                            public long extractTimestamp(LogEntity value, long l)
                                            {
                                                return value.getTime();
                                            }
                                        }
                                )
                )
                .keyBy(LogEntity->LogEntity.getUserId());
//        sourceStream.print();
        Pattern<LogEntity,LogEntity> pattern=Pattern
                .<LogEntity>begin("first")
                .where(new SimpleCondition<LogEntity>() {
                    @Override
                    public boolean filter(LogEntity value) throws Exception {
                        return value.getAction().equals("1");
                    }
                })
                .next("second")
                .where(new SimpleCondition<LogEntity>() {
                    @Override
                    public boolean filter(LogEntity value) throws Exception {
                        return value.getAction().equals("1");

                    }
                })
                .next("third")
                .where(new SimpleCondition<LogEntity>() {
                    @Override
                    public boolean filter(LogEntity value) throws Exception {
                        return value.getAction().equals("2");
                    }
                });
        PatternStream<LogEntity> patternStream=CEP.pattern(sourceStream,pattern);
        DataStream<Tuple4<Integer,Long,Long,Long>> stream=patternStream.select(new PatternSelectFunction<LogEntity, Tuple4<Integer,Long,Long,Long>>() {
            @Override
            public Tuple4<Integer,Long,Long,Long> select(Map<String, List<LogEntity>> map) throws Exception {
                LogEntity first=map.get("first").get(0);
                LogEntity second=map.get("second").get(0);
                LogEntity third=map.get("third").get(0);
                return Tuple4.of(first.getUserId(),first.getTime(),second.getTime(),third.getTime());
            }
        });
        stream.print("warning");
//        sinkfunciton
//        httphost
//        essinkbulder
        ElasticsearchSinkFunction<Tuple4<Integer,Long,Long,Long>> elasticsearchSinkFunction=new ElasticsearchSinkFunction<Tuple4<Integer,Long,Long,Long>>() {
            @Override
            public void process(Tuple4<Integer,Long,Long,Long> element, RuntimeContext ctx, RequestIndexer indexer) {
                Map<String,String> result=new HashMap<>();
                result.put("userId",Integer.valueOf(element.f0).toString());
                result.put("first",Long.valueOf(element.f1).toString());
                result.put("second",Long.valueOf(element.f2).toString());
                result.put("third",Long.valueOf(element.f3).toString());
                IndexRequest indexRequest= Requests.indexRequest().index("flinkwarning").type("logEntity").source(result).id(Integer.valueOf(element.f0).toString());
                indexer.add(indexRequest);
            }
        };
        List<HttpHost> httpPosts=new ArrayList<>();
        httpPosts.add(new HttpHost("hbase",9200,"http"));
        ElasticsearchSink.Builder<Tuple4<Integer,Long,Long,Long>> builder=new ElasticsearchSink.Builder<Tuple4<Integer,Long,Long,Long>>(httpPosts, elasticsearchSinkFunction);
        builder.setBulkFlushMaxActions(1);
        stream.addSink(builder.build());

        env.execute("FailBehavior");
    }
}

检查点

在CEP这个例子中使用了检查点

1需要停止stream流处理任务的场景，完成检查点的保存 2完成检查点的恢复，确保故障恢复到正确的内存状态和外部存储系统状态。

广播状态过滤keyword

借助广播状态进行全局配置，对于一些可能需要变动的配置，使用广播变量全局配置。本例中使用广播变量配置，定义任务的处理规则。

https://blog.csdn.net/wangpei1949/article/details/99698978

通过周期性的从mysql获取信息，将配置进行广播。

PUT /boradcastf
{
  "settings": { 
  "number_of_shards": 3
 },
  "mappings": {
      "properties":{
         "userId":{
          "type":"keyword"
         },
         "first":{
          "type":"date"
         },
         "second":{
          "type":"date"
         },
         "third":{
          "type":"date"
         }
      }
    }
}

#主函数
package com.demo.task.practice;

import com.demo.domain.LogEntity;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.streaming.api.datastream.BroadcastConnectedStream;
import org.apache.flink.streaming.api.datastream.BroadcastStream;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction;
import org.apache.flink.util.Collector;

import javax.xml.crypto.Data;

public class KeywordsFilter {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
        DataStream<LogEntity> sourceStream=env.addSource(new ClickSource());
        DataStream<String> configStream=env.addSource(new MysqlSource("hbase",3306,"con","root","root",1));

        MapStateDescriptor<String,String> configStateDescriptor=new MapStateDescriptor<String, String>("config", Types.STRING,Types.STRING);
        BroadcastStream<String> broadcastConfigStream=configStream.broadcast(configStateDescriptor);
        BroadcastConnectedStream<LogEntity,String> broadcastConnectedStream=sourceStream.connect(broadcastConfigStream);

        DataStream<LogEntity> filterStream=broadcastConnectedStream.process(new BroadcastProcessFunction<LogEntity,String,LogEntity>(){
            private String keyword="-1";

            @Override
            public void processElement(LogEntity value, ReadOnlyContext ctx, Collector<LogEntity> out) throws Exception {
                if(value.getProductId()==Integer.parseInt(keyword))
                {
                    out.collect(value);
                }
            }

            @Override
            public void processBroadcastElement(String value, Context ctx, Collector<LogEntity> out) throws Exception {
                keyword=value;
            }
        });
        filterStream.print();
        env.execute("keywordFilter");
    }
}

#自定义的mysqlsource
package com.demo.task.practice;


import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.source.RichSourceFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;

public class MysqlSource extends RichSourceFunction<String> {
    private boolean running=true;
    private Connection connection;

    private String host;
    private Integer port;
    private String db;
    private String user;
    private String passwd;
    private Integer secondInterval;
    private PreparedStatement preparedStatement;

    public MysqlSource(String host, Integer port, String db, String user, String passwd, Integer secondInterval) {
        this.host = host;
        this.port = port;
        this.db = db;
        this.user = user;
        this.passwd = passwd;
        this.secondInterval = secondInterval;
    }

    @Override
    public void open(Configuration parameters) throws Exception {
        super.open(parameters);
        Class.forName("com.mysql.jdbc.Driver");
        connection= DriverManager.getConnection("jdbc:mysql://"+host+":"+port+"/"+db+"?useUnicode=true&characterEncoding=UTF-8",user,passwd);
        String sql="select keyword from config";
        preparedStatement=connection.prepareStatement(sql);
    }

    @Override
    public void run(SourceContext<String> ctx) throws Exception {
        while (running){
            ResultSet resultset=preparedStatement.executeQuery();
            String keyword;
            while (resultset.next()){
//                action1=resultset.getString("action1");
//                action2=resultset.getString("action2");
//                action3=resultset.getString("action3");
                keyword=resultset.getString("keyword");
                ctx.collect(keyword);
            }
            Thread.sleep(1000*secondInterval);
        }
    }

    @Override
    public void cancel() {
        running=false;
    }

    @Override
    public void close() throws Exception {
        super.close();
        if(connection!=null){
            connection.close();
        }
        if(preparedStatement!=null){
            preparedStatement.close();
        }
    }
}

保存点

检查点启用后，配置检查点保存间隔，检查点保存到堆内存还是外部hdfs等存储，设置精确一次消费模式。检查点是由flink自行管理，到达触发时机自动保存程序内存状态等信息，如果发生故障，就会使用保存的checkpoint恢复应用状态，并继续执行，达到故障恢复的目的，提高容错性。

而保存点必须由用户手动配置并管理，包括触发时机，以及如何使用，我们可以计划性的对应用设置保存点，然后从保存点恢复应用。为了能够直接恢复应用，保存点相比于检查点多保存了一些元数据，如算子ID。

常用场景：

1版本管理归档

2更新flink版本
3更新应用程序

4调整并行度

5暂停应用

注意程序的更改兼容是有条件的，就是状态的拓扑结构和数据类型不能改变。如果使用保存点，需要设置算子的ID。

package com.demo.task.practice;

import com.demo.domain.LogEntity;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class SavePoint {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
        env.addSource(new ClickSource())
                .uid("source-id")
                .map(LogEntity->LogEntity)
                .uid("mapper-id")
                .print();
        env.execute("savepoint");
    }
}

bin/flink run -m hbase1:8081 -c com.demo.task.practice.SavePoint /opt/software/jars/flink-2-hbase-1.0-SNAPSHOT.jar
bin/flink savepoint jobId  file:///opt/module/flink/savepoints
bin/flink savepoint jobId  hdfs://hbase:9000/flink/savepoints

#从savepoint恢复
bin/flink run -s hdfs://hbase:9000/flink/savepoints -c com.demo.task.practice.SavePoint /opt/software/jars/flink-2-hbase-1.0-SNAPSHOT.jar