Flink 实时项目

Flink 实时项目

推荐模块

在flink-2hbase中,主要分为4个flink任务

mysql中中主要存储用户信息,商品信息,相当于维表,主要用于拼接获得信息。

hbase用于存储从flink处理完的数据结果。

日志导入

从kafka接受的数据直接写入Hbase事实表,保存完整的日志log,日志中包含了用户id,用户操作的产品id,操作时间,行为(如购买,点击,推荐等)

数据按时间窗口统计数据大屏需要的数据,返回前段显示

数据存储在Hbase的con表

从kafka的con topic读取数据,继承mapFunction编写map函数,将日志解析为LogEntity(userid,produceid,time,action),然后根据用户id、产品id、时间戳拼接hbase的rowkey,最终将每一条记录插入hbase的con表中。

用户-产品浏览历史->实现基于协同过滤的推荐逻辑

通过flink记录用户浏览过这个类目下的哪些产品,为后面的基于Item的协同过滤做准备,实时的记录用户的评分到Hbase中,为后续离线处理做准备。

从kafka的con topic读取数据,继承mapFunction编写map函数,将用户id、产品id分别存入u_history表和p_history表的对应行,并增加计数。

用户画像计算->实现基于标签的推荐逻辑

按照三个维度计算用户画像,分别是用户的颜色兴趣,用户的产地兴趣和用户的风格兴趣,根据日志不断的修改用户画像的数据,记录在Hbase中。数据存储在hbase的user表中

从kafka的con topic读取数据,查询对应的产品信息(country、color、style),在以userId为rowkey的记录中找到以(country、color、style)为列的cell,增加计数。存入的表名为user表。

产品画像记录->实现基于标签的推荐逻辑

用两个维度记录产品画像,一个是喜爱该产品的年龄段,另一个是性别,数据存储在Hbase的prod表。

从kafka的con topic中读取数据,继承mapFunction编写map函数,从mysql的user表查询出该用户的sex、age信息,数据存入hbase的prod表。增加计数。

实时热度榜->实现基于热度的推荐逻辑

通过flink时间窗口机制,统计当前时间的实时热度,并将数据缓存在Redis中,通过Flink的窗口机制计算实时热度,使用ListState保存一次热度榜。数据存储在redis中,按照时间戳存储list。增加计数。

#将数据统计后写入redis,用于实时热度榜使用
首先按照productId分组,对于每个productId内部使用滑动窗口,对于窗口内的进行aggregate操作,统计商品次数封装为topProduct对象。
aggregate目的是,每有一条数据累加一次。然后使用keyBy按照windwoEnd分组,然后对相同windwoEnd内的商品借助keyBy分组到一起,再进行排序,获得商品的排名。最后将结果写入。这里要理解为什么还要按照windwoEnd分组,就要理解窗口函数的输出传递到下游是什么,他是一个由分区号+窗口时间段唯一确定的一条记录。
#流处理的概念理解
当我们对流施加keyBy操作,本质是创建n个分区,当数据到来时,数据被分发到不同的分区。如果keyBy后续有操作,那么本质就是在多个分区上施加该操作。如果后续是聚合,那么会得到对应数目的聚合结果,聚合结果是针对整个流进行累积的状态呢,还是当前单个流呢?是针对整个流,来了新数据它会尝试更新状态,然后发往下一个任务。如果加了窗口呢?那就是针对一个窗口的状态,一个计算完就结束了,没有累计状态。
这个的问题是,为了统计相同窗口的,不同分区的聚合数据。如果不添加按照窗口分区,那么数据由同一个map处理。就是累积了嘛?额,所有没加窗口的聚合都是累积吧。
理清的关键是确定1.何时触发计算2计算的数据对象是什么3计算的结果是累积状态嘛4计算的结果发往哪里

即使用window后,当水位线到达windwoend时,就触发计算。计算的对象为该widows内数据,如果分区了,还要截取对应分区。计算的结果不是累积state,计算结果数量时分区数*窗口个数,该数目的记录数全部发往下游的任务。下游的任务,如果需要对应窗口的数据,只能借助窗口时间分组,才能获取到对应的数据,自己使用窗口时间分组,就涉及到触发时间的问题,因为分组无法自己触发计算,所以要借助定时器,当到达end时触发计算,统计该时间分组的所有数据。flatmap。
aggregate:对数据进行统计聚合
process:可以对流施加复杂的操作,包括设定定时器、触发定时器等。除了设定map函数,还可以设定其他如open、onTimer等函数,时较为底层的api,功能丰富。

实时统计指标.

窗口处理

窗口处理的方法由以下元素:窗口函数、

1不使用keyBy函数,使用windowAll和窗口处理函数,处理对应窗口的所有key数据

2使用keyBy函数,再window开窗,使用aggregate函数聚合key、窗口唯一对应的数据。

3使用keyBy函数,再window开窗,使用窗口处理函数(或全窗口函数)聚合key、窗口唯一对应的数据。

等同地位的几种函数:1增量聚合函数(归约、聚合)2全窗口函数 3窗口处理函数

日志结构

登录日志
登录失败日志

#页面数据,事件数据,启动数据和错误数据

对于事件、页面、曝光、错误、启动几个模块,在flink的项目中,我们需要事件(用于协同过滤),页面(用于UVPV分析)、曝光(用于曝光量计算)、启动日志(用于记录日活信息)

除此之外还需要商品的交易信息业务日志,用于计算topn商品、用户首单信息

有一些数据是业务数据库和行为日志都可以提供的,例如添加购物车、点击收藏,这些行为既可以通过前段日志记录,也可以通过业务数据库获得。

Top N商品

每10s进行一次统计,统计1分钟内的商品热度排行,找出当前热门的Top N对象

windowAll

不进行keyBy

#原始的想法为,不进行keyBy,直接在一个分区上执行,通过维护一个hashMap,key是url,value是改url的热门程度,用访问次数等进行表示。
#这里借助windowAll直接开窗,然后在窗口中使用窗口处理函数进行全量计算。
public class ProcessAllWindowTopN {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env= StreamExecutionEnvironment.getExecutionEnvironment();
        SingleOutputStreamOperator<LogEntity> sourceStream=env.addSource(new ClickSource())
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy.<LogEntity>forMonotonousTimestamps()
                           .withTimestampAssigner(new SerializableTimestampAssigner<LogEntity>() {
                               @Override
                               public long extractTimestamp(LogEntity element, long recordTimestamp) {
                                   return element.getTime();
                               }
                           })
                );

//        SingleOutputStreamOperator<Integer> result=sourceStream.map(new MapFunction<LogEntity, Integer>() {
//            @Override
//            public Integer map(LogEntity value) throws Exception {
//                return value.getProductId();
//            }
//        });
        SingleOutputStreamOperator<TopProductEntity> result=sourceStream.map(logEntity->logEntity.getProductId())
                .windowAll(SlidingEventTimeWindows.of(Time.seconds(60),Time.seconds(10)))
                .process(new ProcessAllWindowFunction<Integer, TopProductEntity, TimeWindow>() {
                             @Override
                             public void process(Context context, Iterable<Integer> iterable, Collector<TopProductEntity> collector) throws Exception {
                                 HashMap<Integer,Long> productCountMap=new HashMap<>();

                                 for(Integer productId:iterable){
                                     if(productCountMap.containsKey(productId))
                                     {
                                         Long oldValue=productCountMap.get(productId);
                                         productCountMap.put(productId,oldValue+1L);
                                     }
                                     else{
                                         productCountMap.put(productId,1L);
                                     }
                                 }

                                 ArrayList<Tuple2<Integer,Long>> productIdCountList=new ArrayList<>();
                                 for(Map.Entry<Integer,Long> entry:productCountMap.entrySet()){
                                     productIdCountList.add(Tuple2.of(entry.getKey(),entry.getValue()));
                                 }
                                 productIdCountList.sort(new Comparator<Tuple2<Integer, Long>>() {
                                     @Override
                                     public int compare(Tuple2<Integer, Long> o1, Tuple2<Integer, Long> o2) {
                                         return o2.f1.intValue()-o1.f1.intValue();
                                     }
                                 });
                                for(int i=0;i<10;i++){
                                    Tuple2<Integer,Long> temp=productIdCountList.get(i);
                                    collector.collect(TopProductEntity.of(temp.f0,context.window().getEnd(),temp.f1));
                                }

                             }
                         }
                );
        result.print();
        env.execute();
    }
}
Elasticsearch Sink
PUT /topproduct
{
    "settings": { 
    "number_of_shards": 3
 },
    "mappings": {
          "properties":{
             "productid":{
                "type":"keyword"
             },
             "times":{
                "type":"keyword"
             },
             "windowEnd":{
                "type":"date"
             }
          }
        }
    
}
实时处理速度

在window主机上运行,其占用了i7-9700 cpu 的60%,和16GB的内存,处理的数据量为每分钟704万*10=7000万条数据。使用自定义的sourceFunction定义的输出源,每分钟输出7000万条模拟日志数据,程序能够实时的统计出1分钟Top N热榜,而且滑动窗口每10s滑动一次进行计算。

如果在用集群测试,处理速度会更加强大。

2> TopProductEntity{productId=110, actionTimes=7022101, windowEnd=1651199620000, rankName='1651199620000'}
6> TopProductEntity{productId=104, actionTimes=7049810, windowEnd=1651199630000, rankName='1651199630000'}
7> TopProductEntity{productId=108, actionTimes=7049275, windowEnd=1651199630000, rankName='1651199630000'}
8> TopProductEntity{productId=102, actionTimes=7049250, windowEnd=1651199630000, rankName='1651199630000'}
5> TopProductEntity{productId=105, actionTimes=7049997, windowEnd=1651199630000, rankName='1651199630000'}
3> TopProductEntity{productId=106, actionTimes=7047867, windowEnd=1651199630000, rankName='1651199630000'}
1> TopProductEntity{productId=109, actionTimes=7048695, windowEnd=1651199630000, rankName='1651199630000'}
2> TopProductEntity{productId=101, actionTimes=7048377, windowEnd=1651199630000, rankName='1651199630000'}
5> TopProductEntity{productId=107, actionTimes=7045300, windowEnd=1651199630000, rankName='1651199630000'}
4> TopProductEntity{productId=103, actionTimes=7051025, windowEnd=1651199630000, rankName='1651199630000'}
4> TopProductEntity{productId=110, actionTimes=7047220, windowEnd=1651199630000, rankName='1651199630000'}
1> TopProductEntity{productId=101, actionTimes=7075255, windowEnd=1651199640000, rankName='1651199640000'}
#集群中以会话模式提交作业
#由于该应用只有一个作业,所以单作业模式与应用模式差别不大,在资源隔离上是近似的。而且搭建的是测试用的集群,没有其他任务抢占cpu等资源,就只测试会话模式。
bin/flink run -h
#stanalone模式提交:指定提交什么包,执行什么类,提交到哪
#yarn模式直接指定会话、单作业模式、应用模式,不用指定jobmanager地址。-m 指定jobmaster,-c指定类路径,-d指定后台运行。
bin/flink run  -m hbase1:8081  -c com.demo.task.practice.ProcessAllWindowTopN  /opt/software/jars/flink-2-hbase-1.0-SNAPSHOT.jar
窗口处理函数 Top N

对产品的频次进行统计,然后进行排序。

keyBy Top N,先用keyBy按照produceId分组,然后开滑动窗口计算。

计算完成后,再把同一个窗口的keyBy到一起,然后使用KeyedProcessFunction处理排名,并输出。

对于同一个窗口的数据,都将数据存储到一个内存状态中,并设置一个定时器,当水位线到达窗口结束时,触发计算,统计每个商品的频次,排名后得到热门商品。

DataStream<TopProductEntity> topProduct = dataStream.map(new TopProductMapFunction()).
                // 抽取时间戳做watermark 以 秒 为单位
                assignTimestampsAndWatermarks(new AscendingTimestampExtractor<LogEntity>() {
                    @Override
                    public long extractAscendingTimestamp(LogEntity logEntity) {
                        return logEntity.getTime() * 1000;
                    }
                })
                // 按照productId 按滑动窗口
                .keyBy("productId").timeWindow(Time.seconds(60),Time.seconds(5))
//                统计传入数据的总数 并封装为topProduceEntity  ,要windowsend干嘛
                .aggregate(new CountAgg(), new WindowResultFunction())
//                同一时间窗口的分到一起
                .keyBy("windowEnd")
//                如果到达windowEnd,那么触发timer计时器,进行排序,并输出为arrayList
//                flatmap就是用于处理arrayList,它将每个arrayList读取后,为每个string生成一个TopProductEntity,并写入排名。
//                发往下游的数据是一个windowEnd所对应的arrayList,不是累积状态。
                .process(new TopNHotItems(topSize))
                .flatMap(new FlatMapFunction<List<TopProductEntity>, TopProductEntity>() {
                    @Override
                    public void flatMap(List<TopProductEntity> TopProductEntitys, Collector<TopProductEntity> collector) throws Exception {
                        System.out.println("-------------Top N Product------------");
                        for (int i = 0; i < TopProductEntitys.size(); i++) {
                            TopProductEntity top = TopProductEntitys.get(i);
                            // 输出排名结果
                            System.out.println(top);
                            collector.collect(top);
                        }
                    }
                });

用户日活

每到来一条数据,判断今日是否登陆过,如果是新登录则记录下来,然后统计日后数量。

统计当日累计共有多少不重复的用户登录,需要对用户进行去重,使用Hashset或外部的redis等即可。

保留用户的统计结果。

#日活所有用户id
#查询redis状态是否创建。未创建则创建。已创建则根据用户id读出状态,然后过滤出新登录的用户。将新登陆的用户写入状态,并插入到ES外部存储。
package com.demo.task.practice;

import com.demo.domain.LogEntity;
import com.demo.util.Property;
import com.demo.util.RedisUtil;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch7.ElasticsearchSink;
import org.apache.flink.util.Collector;

import org.apache.http.HttpHost;
import org.apache.http.client.methods.HttpPost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;
import redis.clients.jedis.Jedis;

import java.text.SimpleDateFormat;
import java.util.*;

public class DailyActiveUser {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(4);
        SingleOutputStreamOperator<StartLog> sourceStream=env.addSource(new StartLogSource())
                .process(new ProcessFunction<StartLog, StartLog>() {
                    @Override
                    public void open(Configuration parameters) throws Exception {
                        super.open(parameters);
                        jedis= RedisUtil.connectRedis(Property.getStrValue("redis.host"));
                        if(jedis!=null){
                            System.out.println("jedis连接成功"+jedis);
                        }
                        simpleDateFormat=new SimpleDateFormat("yyyy-MM-dd");
                    }
                    private Jedis jedis;
                    private SimpleDateFormat simpleDateFormat;
                    @Override
                    public void processElement(StartLog value, Context ctx, Collector<StartLog> out) throws Exception {
//                        查询内存状态
                        Long userId=value.getUserId();
                        Long time=value.getTs();
//                        dt
                        Date date=new Date(time);
                        Long flag=jedis.sadd("flinkdau"+simpleDateFormat.format(date), String.valueOf(userId));
//
                        if(jedis.ttl("flinkdau"+String.valueOf(date))==-1L)
                        {
                            jedis.expire("flinkdau"+String.valueOf(date), 3600 * 24);
                        }
                        if(flag==1){
                            out.collect(value);
                        }
                    }

                    @Override
                    public void close() throws Exception {
                        super.close();
                        jedis.close();
                    }
                });
//        sourceStream.print();

//        httppost
//elasticsearchSinkFunction
        ElasticsearchSinkFunction<StartLog> elasticsearchSinkFunction=new ElasticsearchSinkFunction<StartLog>() {
            @Override
            public void process(StartLog element, RuntimeContext ctx, RequestIndexer indexer) {
                Map<String,String> result=new HashMap<>();
                result.put("userId",Long.valueOf(element.getUserId()).toString());
                result.put("entry",element.getEntry());
                result.put("ts",Long.valueOf(element.getTs()).toString());
                IndexRequest indexRequest= Requests.indexRequest().index("flinkdaustartlog").type("startLog").source(result).id(Long.valueOf(element.getUserId()).toString());
                indexer.add(indexRequest);
            }
        };
        List<HttpHost> httpPosts=new ArrayList<>();
        httpPosts.add(new HttpHost("hbase",9200,"http"));
        ElasticsearchSink.Builder<StartLog> builder=new ElasticsearchSink.Builder<StartLog>(httpPosts, elasticsearchSinkFunction);
        builder.setBulkFlushMaxActions(1);
        sourceStream.addSink(builder.build());
        sourceStream.print();
        env.execute("flinkdau");
    }
}

当日首单用户数量

在电商中,需求是记录今天的首单用户有哪些,共有多少。需要对非首单的用户去重,为什么不用redis。

在内容平台中,就是今天首次消费用户数量。

由于是否首单,需要一直保存其标记,用redis不合适。所以使用hbase记录是否首单。

elasticsearch建表:
PUT /newpurchaseuser
{
  "settings": { 
  "number_of_shards": 3
 },
  "mappings": {
      "properties":{
         "userId":{
          "type":"keyword"
         },
         "productId":{
          "type":"keyword"
         },
         "time":{
          "type":"date"
         }
      }
    }
}
#使用phoenix建表
create table newpurchaseuser(userid varchar not null primary,flag v)salt_buckets=16;
upsert into npuser values('101','1')
#查询是否首单,若是首单,则保留,并使用phoenix插入到hbase。
package com.demo.task.practice;

import com.demo.domain.LogEntity;
import com.demo.util.Property;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch7.ElasticsearchSink;
import org.apache.flink.util.Collector;
import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
import java.util.*;
import java.util.concurrent.Executors;

public class NewPurchaseUser {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env= StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(4);
        SingleOutputStreamOperator<LogEntity> sourceStream=env.addSource(new ClickSource())
//                过滤
                .process(new ProcessFunction<LogEntity, LogEntity>() {
                    private Connection conn;
                    @Override
                    public void open(Configuration parameters) throws Exception {
                        super.open(parameters);
                        Class.forName("org.apache.phoenix.jdbc.PhoenixDriver");
                        String url = "jdbc:phoenix:hbase,hbase1,hbase2:2181";
                        conn = DriverManager.getConnection(url);
                    }

                    @Override
                    public void close() throws Exception {
                        super.close();
                        conn.close();
                    }

                    @Override
                    public void processElement(LogEntity value, Context ctx, Collector<LogEntity> out) throws Exception {
                        int userId=value.getUserId();
//                        首单
                        Statement statement=conn.createStatement();
                        String sql="select userid from npuser where userid ='"+userId+"'";
                        ResultSet resultSet=statement.executeQuery(sql);

//                        收集
                        if(!resultSet.next()){
                            Statement insertStatement=conn.createStatement();
                            String insertsql="upsert into npuser values('"+userId+"','"+"1')";
                            System.out.println(insertsql);
                            insertStatement.execute(insertsql);
                            conn.commit();
                            out.collect(value);
                        }
                    }

                });

        ElasticsearchSinkFunction<LogEntity> elasticsearchSinkFunction=new ElasticsearchSinkFunction<LogEntity>() {
            @Override
            public void process(LogEntity element, RuntimeContext ctx, RequestIndexer indexer) {
                Map<String,String> result=new HashMap<>();
                result.put("userId",Integer.valueOf(element.getUserId()).toString());
                result.put("productId",Integer.valueOf(element.getProductId()).toString());
                result.put("time",Long.valueOf(element.getTime()).toString());
                IndexRequest indexRequest= Requests.indexRequest().index("newpurchaseuser").source(result).id(Integer.valueOf(element.getUserId()).toString());
                indexer.add(indexRequest);
            }
        };

        List<HttpHost> httpPosts=new ArrayList<>();
        httpPosts.add(new HttpHost(new Property().getElasProperties().getProperty("host"),9201,"http"));
        ElasticsearchSink.Builder<LogEntity> builder=new ElasticsearchSink.Builder<LogEntity>(httpPosts, elasticsearchSinkFunction);
        builder.setBulkFlushMaxActions(1);

        sourceStream.addSink(builder.build());
        sourceStream.print();
        env.execute("npuser");
    }
}

UV PV

User View,对应页面的distinct用户浏览量,独立访客数。

Page View,对应页面的浏览量,页面浏览量。

PV/UV表示人均重复访问量,也就是每个用户平均访问多少次一面,这在一定程度上代表了用户的粘度。

uv pv也是对对象进行频次统计,不过是对页面进行统计,其中UV需要另外处理,进行去重。

elasticsearch建表:
PUT /uvpv
{
  "settings": { 
  "number_of_shards": 3
 },
  "mappings": {
      "properties":{
         "uvpv":{
          "type":"keyword"
         },
         "productId":{
          "type":"keyword"
         },
         "ts":{
          "type":"date"
         }
      }
    }
}
package com.demo.task.practice;

import akka.stream.impl.fusing.Sliding;
import com.demo.domain.LogEntity;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch7.ElasticsearchSink;
import org.apache.flink.util.Collector;
import org.apache.http.HttpHost;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;

import java.util.*;

public class UVPV {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
        DataStream<Tuple2<Integer,Double>> stream=env.addSource(new ClickSource())
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy.<LogEntity>forMonotonousTimestamps()
                        .withTimestampAssigner(
                                new SerializableTimestampAssigner<LogEntity>() {
                                    @Override
                                    public long extractTimestamp(LogEntity element, long recordTimestamp) {
                                        return element.getTime();
                                    }
                                }
                        )
                )
                .keyBy(LogEntity-> LogEntity.getProductId())
                .window( SlidingEventTimeWindows.of(Time.seconds(60),Time.seconds(10)))
                .aggregate(new AggregateFunction<LogEntity, Tuple2<HashSet<String>, Long>, Double>() {
                    @Override
                    public Tuple2<HashSet<String>, Long> createAccumulator() {
                        return Tuple2.of(new HashSet<String>(), 0L);
                    }

                    @Override
                    public Tuple2<HashSet<String>, Long> add(LogEntity value, Tuple2<HashSet<String>, Long> accumulator) {
                        accumulator.f0.add(Integer.valueOf(value.getUserId()).toString());
                        return Tuple2.of(accumulator.f0, accumulator.f1 + 1L);
                    }

                    @Override
                    public Double getResult(Tuple2<HashSet<String>, Long> accumulator) {
                        return (double) accumulator.f1 / accumulator.f0.size();
                    }

                    @Override
                    public Tuple2<HashSet<String>, Long> merge(Tuple2<HashSet<String>, Long> a, Tuple2<HashSet<String>, Long> b) {
                        return null;
                    }
                }, new WindowFunction<Double, Tuple2<Integer,Double>, Integer, TimeWindow>() {
                    @Override
                    public void apply(Integer integer, TimeWindow window, Iterable<Double> input, Collector<Tuple2<Integer,Double>> out) throws Exception {
                        out.collect(Tuple2.of(integer,input.iterator().next()));
                    }
                });

        ElasticsearchSinkFunction<Tuple2<Integer,Double>> elasticsearchSinkFunction=new ElasticsearchSinkFunction<Tuple2<Integer,Double>>() {
            @Override
            public void process(Tuple2<Integer,Double> element, RuntimeContext ctx, RequestIndexer indexer) {
                Map<String,String> result=new HashMap<>();
                result.put("productId",Integer.valueOf(element.f0).toString());
                result.put("pvuv",Double.valueOf(element.f1).toString());
                IndexRequest indexRequest= Requests.indexRequest().index("flinkpvuv").type("logEntity").source(result).id(Integer.valueOf(element.f0).toString());
                indexer.add(indexRequest);
            }
        };

        List<HttpHost> httpPosts=new ArrayList<>();
        httpPosts.add(new HttpHost("hbase",9200,"http"));
        ElasticsearchSink.Builder<Tuple2<Integer,Double>> builder=new ElasticsearchSink.Builder<Tuple2<Integer,Double>>(httpPosts, elasticsearchSinkFunction);
        builder.setBulkFlushMaxActions(1);
        stream.print();
        stream.addSink(builder.build());
        env.execute("UVPV");
    }
}

#flink sql写法

CEP连续登录失败

接下来我们考虑一个具体的需求:检测用户行为,如果连续三次登录失败,就输出报警信 息。很显然,这是一个复杂事件的检测处理,我们可以使用 Flink CEP 来实现。

PUT /cepFailBehavior
{
  "settings": { 
  "number_of_shards": 3
 },
  "mappings": {
      "properties":{
         "userId":{
          "type":"keyword"
         },
         "first":{
          "type":"date"
         },
         "second":{
          "type":"date"
         },
         "third":{
          "type":"date"
         }
      }
    }
}
package com.demo.task.practice;

import com.demo.domain.LogEntity;
import com.typesafe.config.ConfigIncluderFile;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.cep.PatternSelectFunction;
import org.apache.flink.cep.PatternStream;
import org.apache.flink.cep.pattern.Pattern;
import org.apache.flink.cep.CEP;
import org.apache.flink.cep.pattern.conditions.SimpleCondition;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch7.ElasticsearchSink;
import org.apache.http.HttpHost;
import org.apache.kafka.common.protocol.types.Field;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.client.Requests;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class FailBehavior {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
        System.setProperty("HADOOP_USER_NAME", "root");
        System.setProperty("user.name", "root");

        env.setParallelism(4);
//        checkpoint
        env.enableCheckpointing(1000);
        CheckpointConfig config=env.getCheckpointConfig();
        config.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
        config.setMinPauseBetweenCheckpoints(500);
        config.setCheckpointTimeout(60000);
        config.setMaxConcurrentCheckpoints(1);
        config.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
        config.enableUnalignedCheckpoints();
        config.setCheckpointStorage("hdfs://hbase:9000/flink/checkpoints");

        DataStream<LogEntity> sourceStream=env.addSource(new ClickSource())
                .assignTimestampsAndWatermarks(
                        WatermarkStrategy.<LogEntity>forMonotonousTimestamps()
                                .withTimestampAssigner(
                                        new SerializableTimestampAssigner<LogEntity>() {
                                            @Override
                                            public long extractTimestamp(LogEntity value, long l)
                                            {
                                                return value.getTime();
                                            }
                                        }
                                )
                )
                .keyBy(LogEntity->LogEntity.getUserId());
//        sourceStream.print();
        Pattern<LogEntity,LogEntity> pattern=Pattern
                .<LogEntity>begin("first")
                .where(new SimpleCondition<LogEntity>() {
                    @Override
                    public boolean filter(LogEntity value) throws Exception {
                        return value.getAction().equals("1");
                    }
                })
                .next("second")
                .where(new SimpleCondition<LogEntity>() {
                    @Override
                    public boolean filter(LogEntity value) throws Exception {
                        return value.getAction().equals("1");

                    }
                })
                .next("third")
                .where(new SimpleCondition<LogEntity>() {
                    @Override
                    public boolean filter(LogEntity value) throws Exception {
                        return value.getAction().equals("2");
                    }
                });
        PatternStream<LogEntity> patternStream=CEP.pattern(sourceStream,pattern);
        DataStream<Tuple4<Integer,Long,Long,Long>> stream=patternStream.select(new PatternSelectFunction<LogEntity, Tuple4<Integer,Long,Long,Long>>() {
            @Override
            public Tuple4<Integer,Long,Long,Long> select(Map<String, List<LogEntity>> map) throws Exception {
                LogEntity first=map.get("first").get(0);
                LogEntity second=map.get("second").get(0);
                LogEntity third=map.get("third").get(0);
                return Tuple4.of(first.getUserId(),first.getTime(),second.getTime(),third.getTime());
            }
        });
        stream.print("warning");
//        sinkfunciton
//        httphost
//        essinkbulder
        ElasticsearchSinkFunction<Tuple4<Integer,Long,Long,Long>> elasticsearchSinkFunction=new ElasticsearchSinkFunction<Tuple4<Integer,Long,Long,Long>>() {
            @Override
            public void process(Tuple4<Integer,Long,Long,Long> element, RuntimeContext ctx, RequestIndexer indexer) {
                Map<String,String> result=new HashMap<>();
                result.put("userId",Integer.valueOf(element.f0).toString());
                result.put("first",Long.valueOf(element.f1).toString());
                result.put("second",Long.valueOf(element.f2).toString());
                result.put("third",Long.valueOf(element.f3).toString());
                IndexRequest indexRequest= Requests.indexRequest().index("flinkwarning").type("logEntity").source(result).id(Integer.valueOf(element.f0).toString());
                indexer.add(indexRequest);
            }
        };
        List<HttpHost> httpPosts=new ArrayList<>();
        httpPosts.add(new HttpHost("hbase",9200,"http"));
        ElasticsearchSink.Builder<Tuple4<Integer,Long,Long,Long>> builder=new ElasticsearchSink.Builder<Tuple4<Integer,Long,Long,Long>>(httpPosts, elasticsearchSinkFunction);
        builder.setBulkFlushMaxActions(1);
        stream.addSink(builder.build());

        env.execute("FailBehavior");
    }
}

检查点

在CEP这个例子中使用了检查点

1需要停止stream流处理任务的场景,完成检查点的保存 2完成检查点的恢复,确保故障恢复到正确的内存状态和外部存储系统状态。

广播状态过滤keyword

借助广播状态进行全局配置,对于一些可能需要变动的配置,使用广播变量全局配置。本例中使用广播变量配置,定义任务的处理规则。

https://blog.csdn.net/wangpei1949/article/details/99698978

通过周期性的从mysql获取信息,将配置进行广播。

PUT /boradcastf
{
  "settings": { 
  "number_of_shards": 3
 },
  "mappings": {
      "properties":{
         "userId":{
          "type":"keyword"
         },
         "first":{
          "type":"date"
         },
         "second":{
          "type":"date"
         },
         "third":{
          "type":"date"
         }
      }
    }
}
#主函数
package com.demo.task.practice;

import com.demo.domain.LogEntity;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.streaming.api.datastream.BroadcastConnectedStream;
import org.apache.flink.streaming.api.datastream.BroadcastStream;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction;
import org.apache.flink.util.Collector;

import javax.xml.crypto.Data;

public class KeywordsFilter {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
        DataStream<LogEntity> sourceStream=env.addSource(new ClickSource());
        DataStream<String> configStream=env.addSource(new MysqlSource("hbase",3306,"con","root","root",1));

        MapStateDescriptor<String,String> configStateDescriptor=new MapStateDescriptor<String, String>("config", Types.STRING,Types.STRING);
        BroadcastStream<String> broadcastConfigStream=configStream.broadcast(configStateDescriptor);
        BroadcastConnectedStream<LogEntity,String> broadcastConnectedStream=sourceStream.connect(broadcastConfigStream);

        DataStream<LogEntity> filterStream=broadcastConnectedStream.process(new BroadcastProcessFunction<LogEntity,String,LogEntity>(){
            private String keyword="-1";

            @Override
            public void processElement(LogEntity value, ReadOnlyContext ctx, Collector<LogEntity> out) throws Exception {
                if(value.getProductId()==Integer.parseInt(keyword))
                {
                    out.collect(value);
                }
            }

            @Override
            public void processBroadcastElement(String value, Context ctx, Collector<LogEntity> out) throws Exception {
                keyword=value;
            }
        });
        filterStream.print();
        env.execute("keywordFilter");
    }
}

#自定义的mysqlsource
package com.demo.task.practice;


import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.source.RichSourceFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;

public class MysqlSource extends RichSourceFunction<String> {
    private boolean running=true;
    private Connection connection;

    private String host;
    private Integer port;
    private String db;
    private String user;
    private String passwd;
    private Integer secondInterval;
    private PreparedStatement preparedStatement;

    public MysqlSource(String host, Integer port, String db, String user, String passwd, Integer secondInterval) {
        this.host = host;
        this.port = port;
        this.db = db;
        this.user = user;
        this.passwd = passwd;
        this.secondInterval = secondInterval;
    }

    @Override
    public void open(Configuration parameters) throws Exception {
        super.open(parameters);
        Class.forName("com.mysql.jdbc.Driver");
        connection= DriverManager.getConnection("jdbc:mysql://"+host+":"+port+"/"+db+"?useUnicode=true&characterEncoding=UTF-8",user,passwd);
        String sql="select keyword from config";
        preparedStatement=connection.prepareStatement(sql);
    }

    @Override
    public void run(SourceContext<String> ctx) throws Exception {
        while (running){
            ResultSet resultset=preparedStatement.executeQuery();
            String keyword;
            while (resultset.next()){
//                action1=resultset.getString("action1");
//                action2=resultset.getString("action2");
//                action3=resultset.getString("action3");
                keyword=resultset.getString("keyword");
                ctx.collect(keyword);
            }
            Thread.sleep(1000*secondInterval);
        }
    }

    @Override
    public void cancel() {
        running=false;
    }

    @Override
    public void close() throws Exception {
        super.close();
        if(connection!=null){
            connection.close();
        }
        if(preparedStatement!=null){
            preparedStatement.close();
        }
    }
}

保存点

检查点启用后,配置检查点保存间隔,检查点保存到堆内存还是外部hdfs等存储,设置精确一次消费模式。检查点是由flink自行管理,到达触发时机自动保存程序内存状态等信息,如果发生故障,就会使用保存的checkpoint恢复应用状态,并继续执行,达到故障恢复的目的,提高容错性。

而保存点必须由用户手动配置并管理,包括触发时机,以及如何使用,我们可以计划性的对应用设置保存点,然后从保存点恢复应用。为了能够直接恢复应用,保存点相比于检查点多保存了一些元数据,如算子ID。

常用场景:

1版本管理归档

2更新flink版本
3更新应用程序

4调整并行度

5暂停应用

注意程序的更改兼容是有条件的,就是状态的拓扑结构和数据类型不能改变。如果使用保存点,需要设置算子的ID。

package com.demo.task.practice;

import com.demo.domain.LogEntity;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class SavePoint {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
        env.addSource(new ClickSource())
                .uid("source-id")
                .map(LogEntity->LogEntity)
                .uid("mapper-id")
                .print();
        env.execute("savepoint");
    }
}
bin/flink run -m hbase1:8081 -c com.demo.task.practice.SavePoint /opt/software/jars/flink-2-hbase-1.0-SNAPSHOT.jar
bin/flink savepoint jobId  file:///opt/module/flink/savepoints
bin/flink savepoint jobId  hdfs://hbase:9000/flink/savepoints

#从savepoint恢复
bin/flink run -s hdfs://hbase:9000/flink/savepoints -c com.demo.task.practice.SavePoint /opt/software/jars/flink-2-hbase-1.0-SNAPSHOT.jar

项目难点

实时更改配置

自定义sourceFunciton,读取数据源并广播

广播连接流读取其中的配置并更新,处理流访问配置状态,动态变更规则。

优雅停止任务

通过读取文件路径,当文件路径存在时,将任务停止。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值