大数据计算引擎:Flink基础教程

一、Flink基础

1、什么是Flink?数据模型、体系架构、生态圈

官方解释:
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
Flink中处理的两种数据集合:
(1)Unbounded data stream:无边界的数据集 —> 流式计算、实时计算 —> Flink DataStream API
(2)bounded data stream:有边界的数据集 —> 离线计算、批处理 —> Flink DataSet API
Flink的体系架构:
在这里插入图片描述
flink的体系架构和我们前面所提到的spark,storm等,都是非常相似的,都是采用主从架构的思想。主节点为:jobManager、从节点:taskManager。每个从节点开辟进程,进程里面的task为最小的执行任务单位。

Flink的生态圈:
在这里插入图片描述
从上图可以看出,Flink也分为离线计算和实时计算。

2、部署Flink

(1)Standalone模式

tar -zxvf flink-1.11.0-bin-scala_2.12.tgz -C ~/training/
核心配置文件 conf/flink-conf.yaml
Web UI:端口 8081

  • 伪分布的环境
    直接运行就行,flink已经默认配置好
    bin/start-cluster.sh
    在这里插入图片描述

  • 全分布的环境

(2)Flink on Yarn

把Flink中的任务放到yarn上进行执行。有两种模式:我们一般采用模式二。
注意:我们要使用Hadoop,必须把flink-shaded-hadoop-2-uber-2.8.3-10.0.jar 包添加到lib目录下,因为在flink1.10版本后,flink把Hadoop相关依赖去除。

(模式一)内存集中管理的模式
yarn初始化一个集群,开辟指定的资源,我们提交job都在这个flink yarn-session中,也就是不管有多少个job,这些job都会共用yarn中申请的资源。

bin/yarn-session.sh -n 2  -jm 1024 -tm 1024 -d

(模式二)内存Job的管理模式:
在yarn中,每次提交job都会创建一个新的flink集群,任务之间相互独立,互不影响并且方便管理,任务执行完后,创建的集群也会消失。

bin/flink run -m yarn-cluster -p 1 -yjm 1024 -ytm 1024 examples/batch/WordCount.jar
注意:-p 1 指任务的并行度

在这里插入图片描述

(3)HA模式:基于ZooKeeper

3、执行Flink的任务:WordCount

Flink中example中也提供一些例子。可以去查看一下。
(1)离线计算-批处理

bin/flink run examples/batch/WordCount.jar -input hdfs://bigdata111:9000/input/data.txt -output hdfs://bigdata111:9000/flink/wc

(2)流式计算-实时计算

bin/flink run examples/streaming/SocketWindowWordCount.jar --port 1234

4、开发自己的Flink程序:WordCount(Java)

添加依赖:

<dependency>
	<groupId>org.apache.flink</groupId>
	<artifactId>flink-streaming-java_2.11</artifactId>
	<version>1.11.0</version>
	<!--<scope>provided</scope>-->
</dependency>

<dependency>
	<groupId>org.apache.flink</groupId>
	<artifactId>flink-clients_2.11</artifactId>
	<version>1.11.0</version>
</dependency>	

(1)离线计算-批处理

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;


/**
 * @author 
 * @date 2020/11/21-${TIEM}
 */
public class WordCountBatchExample {
    public static void main(String[] args) throws Exception{
        ExecutionEnvironment env = ExecutionEnvironment.createCollectionsEnvironment();
        //创建一个DataSet来代表处理的数据
        DataSource<String> data = env.fromElements("i love beijing",
                "i love chain ", "chain is the captial of the beijing");

        data.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
            public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
                String[] words = s.split(" ");
                for (String word : words) {
                    collector.collect(new Tuple2<String, Integer>(word,1));
                }
            }
        }).groupBy(0).sum(1).print();

    }
}

(2)流式计算-实时计算

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

/**
 * @author 
 * @date 2020/11/21-${TIEM}
 */
public class WordCountStreamExample {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment sen = StreamExecutionEnvironment.getExecutionEnvironment();

        //接收输入
        DataStreamSource<String> source = sen.socketTextStream("bigdata111", 1234);

        source.flatMap(new FlatMapFunction<String, WordCount>() {
            public void flatMap(String s, Collector<WordCount> collector) throws Exception {
                String[] words = s.split(" ");
                for (String word : words) {
                    collector.collect(new WordCount(word,1));
                }
            }
        }).keyBy("word").sum("count").print().setParallelism(1);
        

        sen.execute("WordCountStreamExample");

    }
}

在Linux bigdata111上启动 :nc -l 1234

5、对比:Storm、Spark Streaming、Flink的技术特点

在这里插入图片描述

二、Flink DataSet API ----> 离线计算-批处理

Mysql读取、写入

添加依赖

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-jdbc_2.11</artifactId>
    <version>1.9.1</version>
</dependency>

Java实现

	    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        //读mysql
        DataSource<Row> dataSource = env.createInput(JDBCInputFormat.buildJDBCInputFormat()
                .setDrivername("com.mysql.jdbc.Driver")
                .setDBUrl("jdbc:mysql://localhost:3306")
                .setUsername("root")
                .setPassword("root")
                .setQuery("select id,dq from flink.ajxx_xs ")
                .setRowTypeInfo(new RowTypeInfo(BasicTypeInfo.INT_TYPE_INFO, BasicTypeInfo.STRING_TYPE_INFO))
                .finish());
 
        final BatchTableEnvironment tableEnv = BatchTableEnvironment.create(env);
        tableEnv.registerDataSet("ods_tb01", dataSource);
 
        Table query = tableEnv.sqlQuery("select * from ods_tb01");
 
        DataSet<Row> result = tableEnv.toDataSet(query, Row.class);
 
        result.print();
 
        result.output(JDBCOutputFormat.buildJDBCOutputFormat()
                .setDrivername("com.mysql.jdbc.Driver")
                .setDBUrl("jdbc:mysql://localhost:3306")
                .setUsername("root")
                .setPassword("root")
                .setQuery("insert into flink.ajxx_xs2 (id,dq) values (?,?)")
                .setSqlTypes(new int[]{Types.INTEGER, Types.NCHAR})
                .finish());
 
        env.execute("flink-test");

算子介绍

算子解释
map输入一个元素,返回一个元素,中间可以做清洗转换操作
FlatMap输入一个元素,可以返回0个,一个或对个元素
map与flatmap区别map输入一个数据只能有一次输出,flatmap输入一个数据,可以有多个输出
MapPartition类似map,一次处理一个分区的数据
Filter过滤函数,对传入的数据进行判断,符合条件的留下
Reduce对数据进行聚合操作,结合当前元素和上一次reduce返回的值进行聚合操作,返回一个新的值
Aggregate聚合操作,sum、max、min等
distinct去重之后的元素
join内连接
outerJoin外连接
cross获取两个数据集的笛卡尔积
Union返回两个数据集的总和,数据类型必须一致
first-n获取集合中前 N 个元素
Sort Partition在本地数据集的所有分区进行排序,通过SortPartition()的连接调用来完成对多个字段的排序

(1)Map、FlatMap与MapPartition,注意区别

public class FlinkDemo1 {
    public static void main(String[] args) throws Exception {

        // 准备数据
        ArrayList<String> data = new ArrayList<String>();
        data.add("I love Beijing");
        data.add("I love China");
        data.add("Beijing is the capital of China");
        
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        DataSource<String> dataSource = env.fromCollection(data);

        /*
        * 输入一个元素,返回一个元素,中间可以做清洗转换操作
        *
        * */
        MapOperator<String, List<String>> map = dataSource.map(new MapFunction<String, List<String>>() {
            public List<String> map(String value) throws Exception {
                String[] words = value.split(" ");

                //创建一个List
                List<String> list = new ArrayList<String>();
                for (String w : words) {
                    list.add(w);
                }
                list.add("-----------");
                return list;
            }
        });
        map.print();
        System.out.println("**************************");

        /*
        *输入一个元素,可以返回0个,一个或对个元素
        *
        * */
        dataSource.flatMap(new FlatMapFunction<String, String>() {
            public void flatMap(String value, Collector<String> out) throws Exception {
                String[] words = value.split(" ");
                for(String w:words) {
                    out.collect(w);
                }
            }
        }).print();
        System.out.println("************************");

        /*
        * mappartion
        * 拿到的是这个分区的元素,常用于连接数据库等操作
        * */
        dataSource.mapPartition(new MapPartitionFunction<String, String>() {
            public void mapPartition(Iterable<String> iterable, Collector<String> out) throws Exception {
                Iterator<String> ite = iterable.iterator();
                while (ite.hasNext()){
                    String next = ite.next();
                    String[] s = next.split(" ");
                    for (String s1 : s) {
                        out.collect(s1);
                    }

                }
            }
        }).print();

    }
}

(2)Filter与Distinct

public class FlinkDemo2 {
    public static void main(String[] args) throws Exception {
        // 准备数据
        ArrayList<String> data = new ArrayList<String>();
        data.add("I love Beijing");
        data.add("I love China");
        data.add("Beijing is the capital of China");
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSource<String> dataSource = env.fromCollection(data);

        final FlatMapOperator<String, String> word = dataSource.flatMap(new FlatMapFunction<String, String>() {
            public void flatMap(String s, Collector<String> collector) throws Exception {
                String[] words = s.split(" ");
                for (String word : words) {
                    collector.collect(word);
                }
            }
        });
        // 筛选长度大于3
        FilterOperator<String> filterword = word.filter(new FilterFunction<String>() {
            public boolean filter(String s) throws Exception {

                return s.length() >= 3 ? true : false;

            }
        });

        filterword.distinct().print();


    }
}

(3)Join操作,内连接

内连接:将两个表连接,仅仅返回匹配(主键与外键的匹配)条件的行的连接成为内连接。

public class FlinkDemo3 {
    public static void main(String[] args) throws Exception {
        //创建第一张表:用户表(用户ID、姓名)
        ArrayList<Tuple2<Integer, String>> data1 = new ArrayList<Tuple2<Integer,String>>();
        data1.add(new Tuple2<Integer, String>(1,"Tom"));
        data1.add(new Tuple2<Integer, String>(2,"Mike"));
        data1.add(new Tuple2<Integer, String>(3,"Mary"));
        data1.add(new Tuple2<Integer, String>(4,"Jone"));

        //创建第二张表:用户所在城市(用户ID、城市)
        ArrayList<Tuple2<Integer, String>> data2 = new ArrayList<Tuple2<Integer,String>>();
        data2.add(new Tuple2<Integer, String>(1,"北京"));
        data2.add(new Tuple2<Integer, String>(2,"上海"));
        data2.add(new Tuple2<Integer, String>(3,"北京"));
        data2.add(new Tuple2<Integer, String>(4,"深圳"));

        // 创建执行的运行环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<Tuple2<Integer, String>> table1 = env.fromCollection(data1);
        DataSet<Tuple2<Integer, String>> table2 = env.fromCollection(data2);


        //执行join操作
        //where(0).equalTo(0) 表示:使用第一张表的第一个列,连接第二张表的第一个列
        //相当于   where table1.userID = table2.userID
        table1.join(table2).where(0).equalTo(0).with(new JoinFunction<Tuple2<Integer, String>,
                Tuple2<Integer, String>,
                Tuple2<String, String>>() {
            public Tuple2<String, String> join(Tuple2<Integer, String> t1, Tuple2<Integer, String> t2) throws Exception {
                return new Tuple2<String, String>(t1.f1,t2.f1);
            }
        }).print();
    }
}

(4)外链接、全连接操作

外连接:与内连接相反,把没有匹配(主键与外键的匹配)成功的行也进行返回。

public class FlinkDemo4 {
    public static void main(String[] args) throws Exception {
        //创建第一张表:用户表(用户ID、姓名)
        ArrayList<Tuple2<Integer, String>> data1 = new ArrayList<Tuple2<Integer,String>>();
        data1.add(new Tuple2<Integer, String>(1,"Tom"));
        data1.add(new Tuple2<Integer, String>(3,"Mary"));
        data1.add(new Tuple2<Integer, String>(4,"Jone"));

        //创建第二张表:用户所在城市(用户ID、城市)
        ArrayList<Tuple2<Integer, String>> data2 = new ArrayList<Tuple2<Integer,String>>();
        data2.add(new Tuple2<Integer, String>(1,"北京"));
        data2.add(new Tuple2<Integer, String>(2,"上海"));
        data2.add(new Tuple2<Integer, String>(4,"深圳"));

        // 创建执行的运行环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<Tuple2<Integer, String>> table1 = env.fromCollection(data1);
        DataSet<Tuple2<Integer, String>> table2 = env.fromCollection(data2);
        
        // 左外连接
        System.out.println("左外连接:");
        table1.leftOuterJoin(table2).where(0).equalTo(0)
                .with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer,String,String>>() {
                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> left, Tuple2<Integer, String> right) throws Exception {
                        return right == null ? new Tuple3<Integer, String, String>(left.f0 , left.f1 , null) :
                                new Tuple3<Integer, String, String>(left.f0,left.f1,right.f1);
                    }
                }).print();

        // 右外连接
        System.out.println("右外连接:");
        table1.rightOuterJoin(table2).where(0).equalTo(0)
                .with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer,String,String>>() {
                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> left, Tuple2<Integer, String> right) throws Exception {
                        return left == null ? new Tuple3<Integer, String, String>(right.f0 , right.f1 , null) :
                                new Tuple3<Integer, String, String>(left.f0,left.f1,right.f1);
                    }
                }).print();

        // 全连接
        System.out.println("全连接");
        table1.fullOuterJoin(table2).where(0).equalTo(0)
                .with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer, String, String>>() {
                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> left, Tuple2<Integer, String> right) throws Exception {
                        if(left == null) {
                            return new Tuple3<Integer, String, String>(right.f0,null,right.f1);
                        }else if(right == null) {
                            return new Tuple3<Integer, String, String>(left.f0,left.f1,null);
                        }else {
                            return new Tuple3<Integer, String, String>(right.f0,left.f1,right.f1);
                        }
                    }
                }).print();

    }
}

(5)笛卡尔积

public class FlinkDemo5 {
    public static void main(String[] args) throws Exception {
        //创建第一张表:用户表(用户ID、姓名)
        ArrayList<Tuple2<Integer, String>> data1 = new ArrayList<Tuple2<Integer,String>>();
        data1.add(new Tuple2<Integer, String>(1,"Tom"));
        data1.add(new Tuple2<Integer, String>(2,"Mike"));
        data1.add(new Tuple2<Integer, String>(3,"Mary"));
        data1.add(new Tuple2<Integer, String>(4,"Jone"));

        //创建第二张表:用户所在城市(用户ID、城市)
        ArrayList<Tuple2<Integer, String>> data2 = new ArrayList<Tuple2<Integer,String>>();
        data2.add(new Tuple2<Integer, String>(1,"北京"));
        data2.add(new Tuple2<Integer, String>(2,"上海"));
        data2.add(new Tuple2<Integer, String>(3,"北京"));
        data2.add(new Tuple2<Integer, String>(4,"深圳"));

        // 创建执行的运行环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<Tuple2<Integer, String>> table1 = env.fromCollection(data1);
        DataSet<Tuple2<Integer, String>> table2 = env.fromCollection(data2);

        // 笛卡尔积
        table1.cross(table2).print();


    }
}

(6)First-N 、 sortPartition(SQL:Top-N)

注意:导包别导错了

public class FlinkDemo6 {
    public static void main(String[] args) throws Exception {
        //Tuple3: 姓名 薪水 部门号
        ArrayList<Tuple3<String,Integer,Integer>> data1 = new ArrayList<Tuple3<String,Integer,Integer>>();

        data1.add(new Tuple3<String,Integer,Integer>("Tom",1000,10));
        data1.add(new Tuple3<String,Integer,Integer>("Mary",2000,20));
        data1.add(new Tuple3<String,Integer,Integer>("Mike",1500,30));
        data1.add(new Tuple3<String,Integer,Integer>("Jone",1800,10));

        // 创建执行的运行环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        //构造一个DataSet
        DataSet<Tuple3<String,Integer,Integer>> table = env.fromCollection(data1);

        //取出前三条记录
        table.first(3).print();

        System.out.println("********************");
        //先按照部门号排序,再按照薪水排序
        table.sortPartition(2, Order.ASCENDING).sortPartition(1,Order.DESCENDING).print();


    }
}

三、Flink DataStream API ----> 流式计算-实时计算

1、DataSource 数据源

(1)自定义的数据源,实现接口:

SourceFunction:并行度1
ParalleSourceFunction:多并行度

创建MySingleDataSourceTest 类进行测试

public class MySingleDataSourceTest  {
    public static void main(String[] args) throws Exception {

        // 创建一个执行环境
        StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();

        //指定单并行度的数据源
        DataStreamSource<Integer> source = sEnv.addSource( new MySingleDataSource());
        DataStream<Integer> step1 =  source.map(new MapFunction<Integer, Integer>() {

            public Integer map(Integer value) throws Exception {
                System.out.println("收到的数据是:"+ value);
                return value*10;
            }
        });

        //每隔两秒做一次求和
        step1.timeWindowAll(Time.seconds(2)).sum(0).setParallelism(1).print();

        sEnv.execute("MySingleDataSourceTest");
    }
}

创建MySingleDataSource 类,集成SourceFunction接口,实现自定义数据源。

public class MySingleDataSource implements SourceFunction<Integer> {

    //计数器
    private Integer count = 1;
    //开关
    private boolean isRunning = true;

    public void run(SourceContext<Integer> ctx) throws Exception {

        // 如何产生数据
        while(isRunning) {
            //输出数据
            ctx.collect(count);

            //每隔一秒
            Thread.sleep(1000);
            //自增
            count ++;
        }

    }

    public void cancel() {
        // 如何停止产生数据
        isRunning = false;
    }
}

(2)kafka数据源

Java代码如下

public class FlinkStreamWithKafka {
    public static void main(String[] args) throws Exception {
        // 创建一个KafkaStream
        Properties props = new Properties();
        // 指定Broker的地址
        props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.92.111:9093");
        // 消费者组
        props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "mygroup1");

        FlinkKafkaConsumer<String> source = new FlinkKafkaConsumer<String>("mydemotopic1", new SimpleStringSchema(), props);

        StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> source1  = sEnv.addSource(source);
        source1.print();

        sEnv.execute("FlinkStreamWithKafka");
    }
}

2、转换操作 Transformation

算子解释
map输一个元素,返回一个元素,中间做清洗转化操作
flatmap输入一个元素,可以返回0个,一个或对个元素
filter过滤函数,对传入的数据进行判断,符合条件的留下
keyBy指定key进行分组,相同的key的数据会进入同一个分区
reduce对数据进行聚合操作,结合当前元素和上一次reduce返回的值进行聚合操作,返回一个新的值
aggregationssum()、min() 、 max()等
window后面详解
union返回两个数据集的总和,数据类型必须一致
connect和union类似,但是只能连接两个流,两个流的数据类型可以不同,会对两个流中数据应用不同的处理方法
coMap 、 CoFlatMap在connectedStreams中需要使用这种函数,类似map和flatmap
split根据规则吧一个数据流切分为多个流,打标签
select和split配合使用,选择切分后的流 , 选择标签组合一个新的流

(1)union 合并两个流

public class FlinkDemo1 {
    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();

        //创建两个DataStream Source
        DataStreamSource<Integer> source1 = sEnv.addSource(new MySingleDataSource());
        DataStreamSource<Integer> source2 = sEnv.addSource(new MySingleDataSource());
        //执行union计算:并集,两个流的数据合并到一起
        DataStream<Integer> resault = source1.union(source2);
        resault.print().setParallelism(1);

        sEnv.execute("FlinkDemo1");
    }
}

(2)connect

public class FlinkDemo2 {
    public static void main(String[] args) throws Exception {
        // 使用之前创建的并行度为1的数据源进行测试
        StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();

        //创建两个DataStream Source
        DataStreamSource<Integer> source1 = sEnv.addSource(new MySingleDataSource());

        DataStream<String> source2 = sEnv.addSource(new MySingleDataSource())
                .map(new MapFunction<Integer, String>() {
                    public String map(Integer value) throws Exception {
                        // 把Integer转换为String
                        return "String"+value;
                    }
                });


        // 可以包括不同的数据类型
        ConnectedStreams<Integer, String> connect = source1.connect(source2);
        // 处理不同的数据类型,返回不同的结果
        connect.map(new CoMapFunction<Integer, String, Object>() {
            public Object map1(Integer value) throws Exception {
                // 对第一个数据流进行处理
                return "对Integer类型的数据流进行处理:"+ value;
            }

            public Object map2(String value) throws Exception {
                // 对第二个数据流进行处理
                return "对String类型的数据流进行处理:"+ value;
            }
        }).print().setParallelism(1);


        sEnv.execute("FlinkDemo2");

    }
}

(3)split 与 select

public class FlinkDemo3 {
    public static void main(String[] args) throws Exception {
        // 使用之前创建的并行度为1的数据源进行测试
        StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<Integer> source1 = sEnv.addSource(new MySingleDataSource());

        //把数据源中的奇数和偶数分开,进行打标签
        SplitStream<Integer> split = source1.split(new OutputSelector<Integer>() {
            public Iterable<String> select(Integer value) {
                //定义一个集合,代表该数据的所有的标签,一个数据可以有多个标签
                ArrayList<String> selector = new ArrayList<String>();
                /*
                 * String:表示给收到的数据打上标签,标签可以多个
                 * Iterable:标签可以多个
                 */

                if (value % 2 == 0) {
                    //偶数
                    selector.add("even"); //偶数
                } else {
                    //奇数
                    selector.add("odd"); //奇数
                }

                return selector;

            }
        });
        //选择所有的奇数
        split.select("odd").print().setParallelism(1);

        sEnv.execute("FlinkDemo3");


    }
}

(4)自定义分区

flink有自己的一套分区规则,但有时可能不是我们想要,自己可以定义分区规则。

public class MyPartitionerTest {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<Integer> source1 = sEnv.addSource(new MySingleDataSource());

        SingleOutputStreamOperator<Tuple1<Integer>> data  = source1.map(new MapFunction<Integer, Tuple1<Integer>>() {
            public Tuple1<Integer> map(Integer integer) throws Exception {
                return new Tuple1<Integer>(integer);
            }
        });
        DataStream<Tuple1<Integer>> partitioner = data.partitionCustom(new MyPartitioner(), 0);
        partitioner.map(new MapFunction<Tuple1<Integer>, Integer>() {
            public Integer map(Tuple1<Integer> value) throws Exception {
                //得到数据
                Integer data = value.f0;
                long threadID = Thread.currentThread().getId();
                System.out.println("线程号:"+ threadID +"\t 数据:" + data);
                return data;
            }
        }).print().setParallelism(1);


        sEnv.execute("MyPartitionerTest");
    }
}

(5)
(6)

3、Data Sink 目的地

(1)数据保存到Redis

从kafka数据源获得数据,经过处理之后,数据保存到redis之中。

public class FlinkStreamWithKafka {
    public static void main(String[] args) throws Exception {
        // 创建一个KafkaStream
        Properties props = new Properties();
        // 指定Broker的地址
        props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.92.111:9093");
        // 消费者组
        props.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "mygroup1");

        FlinkKafkaConsumer<String> source = new FlinkKafkaConsumer<String>("mydemotopic1", new SimpleStringSchema(), props);

        StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> source1  = sEnv.addSource(source);
        SingleOutputStreamOperator<WordCount> result = source1.flatMap(new FlatMapFunction<String, WordCount>() {
            public void flatMap(String s, Collector<WordCount> out) throws Exception {
                String[] words = s.split(" ");
                for (String word : words) {
                    out.collect(new WordCount(word, 1));
                }
            }
        }).keyBy("word").sum("count");


        FlinkJedisPoolConfig conf = new  FlinkJedisPoolConfig.Builder()
                .setHost("192.168.92.111").setPort(6379).build();
        RedisSink<WordCount> redisSink = new RedisSink<WordCount>(conf , new MyRedisMapper());

        //把结果result输出到Redis
        result.addSink(redisSink);

        sEnv.execute("FlinkStreamWithKafka");
    }
}

创建MyRedisMapper类,做redis映射器。

public class MyRedisMapper implements RedisMapper<WordCount> {
    public RedisCommandDescription getCommandDescription() {
        return new RedisCommandDescription(RedisCommand.HSET,"myflink");
    }

    public String getKeyFromData(WordCount wordCount) {
        return wordCount.word;
    }

    public String getValueFromData(WordCount wordCount) {
        return String.valueOf(wordCount.count);
    }
}

四、高级特性

1、分布式缓存:类似Map Join,提高性能

在这里插入图片描述
数据在每个taskmanage节点上都进行缓存一份。减少每个任务都要缓存数据占用内存。提高系统的性能
== 注意:数据是缓存到节点上 ,然后每个任务都可以使用这些数据 ==

Java代码实现:

public class DistributedCacheDemo {
    public static void main(String[] args) throws Exception {
        // 创建一个方位接口的对象:DataSet API
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        //注册需要缓存的数据
        //路径可以是HDFS,也可以是本地
        //如果是HDFS,需要把HDFS的依赖包含
        env.registerCachedFile("D:\\InstallDev\\Java\\MyJavaProject\\flinkDemo\\src\\main\\java\\demo3\\data.txt"
                , "localfile");

        //执行一个简单的计算
        //创建一个DataSet
        DataSet<Integer> source = env.fromElements(1,2,3,4,5,6,7,8,9,10);

        /*
         * 需要是RichMapFunction的open方法,在初始化的时候读取缓存的数据文件
         */
        source.map(new RichMapFunction<Integer, String>() {

            String shareData ;
			/**
			 * open
             * 这个方法只会执行一次
             * 可以在这里实现一些初始化的功能
             *
             * 所以,就可以在open方法中获取广播变量数据
             *
             */
            @Override
            public void open(Configuration parameters) throws Exception {
                // 读取分布式缓存的数据
                File localfile = this.getRuntimeContext().getDistributedCache().getFile("localfile");

                List<String> lines  = FileUtils.readLines(localfile);

                //得到数据
                shareData = lines.get(0);


            }

            public String map(Integer integer) throws Exception {




                return shareData + integer;
            }
        }).print();

        // 离散处理不需要 execute方法
        // env.execute("DistributedCacheDemo");


    }
}

观察打印结果,可以看出,每个任务都共享了缓存数据。

2、并行度设置

看前面内容即可。

3、广播变量

和分布式缓存本质是一样的
区别:

  • 分布式缓存 -------> 文件
  • 广播变量 --------> 变量

Java代码实现:

public class BroadCastDemo {
    public static void main(String[] args) throws Exception {
        // 创建一个方位接口的对象:DataSet API
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        //需要广播的变量: 姓名  年龄
        List<Tuple2<String, Integer>> people = new ArrayList<Tuple2<String,Integer>>();
        people.add(new Tuple2<String, Integer>("Tom",23));
        people.add(new Tuple2<String, Integer>("Mike",20));
        people.add(new Tuple2<String, Integer>("Jone",25));

        DataSet<Tuple2<String, Integer>> peopleData = env.fromCollection(people);

        //把peopleData转换成HashMap ----> 要广播的数据
        DataSet<HashMap<String,Integer>> broadCast = peopleData.map(new MapFunction<Tuple2<String,Integer>, HashMap<String,Integer>>() {

            public HashMap<String, Integer> map(Tuple2<String, Integer> value) throws Exception {
                HashMap<String, Integer> result = new HashMap<String, Integer>();
                result.put(value.f0,value.f1);
                return result;
            }
        });


        //根据姓名key,获取年龄value
        DataSet<String> source = env.fromElements("Tom","Mike","Jone");
        DataSet<String> result = source.map(new RichMapFunction<String, String>() {
            //定义一个变量保存广播变量的值
            HashMap<String,Integer> allMap = new HashMap<String, Integer>();


            public void open(Configuration parameters) throws Exception {
                //获取广播的变量
                List<HashMap<String,Integer>> broadVariable = getRuntimeContext().getBroadcastVariable("mydata");
                for(HashMap<String,Integer> x:broadVariable) {
                    allMap.putAll(x);
                }
            }


            public String map(String name) throws Exception {
                // 根据名字获取年龄
                Integer age = allMap.get(name);
                return "姓名:" + name + "\t 年龄:" + age;
            }
        }).withBroadcastSet(broadCast, "mydata");


        result.print();

    }
}

4、累加器和计数器

作用:在全局只维护一份数据
注意:累加器只有在任务执行完成后,才能得到

java代码实现:

public class AccumulatorDemo {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        // 创建DataSet
        DataSet<String> data = env.fromElements("Tom", "Mike", "Mary", "Jone");

        MapOperator<String, Integer> result = data.map(new RichMapFunction<String, Integer>() {

            //第一步:创建一个累加器
            private IntCounter intCounter = new IntCounter();

            @Override
            public void open(Configuration parameters) throws Exception {

                //第二步:注册一个累加器
                this.getRuntimeContext().addAccumulator("myIntCounter", intCounter);
            }

            public Integer map(String s) throws Exception {
                //第三步:实现计数
                intCounter.add(1);
                return 0;
            }
        }).setParallelism(4);

        // 必须有sink,在离线处理才不会报错
        result.writeAsText("D:\\InstallDev\\Java\\MyJavaProject\\flinkDemo\\src\\main\\java\\data");

        JobExecutionResult finalResult  = env.execute("AccumulatorDemo");
        //第四步:获取累加器的值
        Object total  = finalResult.getAccumulatorResult("myIntCounter");
        System.out.println("累加器最终值:"+total);
    }
}

5、状态管理

支持三种状态的持久化:memory、文件系统、rockdb

(1)什么是状态?state

有状态计算是指在程序计算过程中,在Flink程序内部存储计算产生的中间结果,并提供给后续Function或算子计算结果使用。状态数据可以维系在本地存储中,这里的存储可以是Flink的堆内存或者堆外内存,也可以借助第三方的存储介质,例如Flink中已经实现的RocksDB,当然用户也可以自己实现相应的缓存系统去存储状态信息,以完成更加复杂的计算逻辑。
和状态计算不同的是,无状态计算不会存储计算过程中产生的结果,也不会将结果用于下一步计算过程中,程序只会在当前的计算流程中实行计算,计算完成就输出结果,然后下一条数据接入,然后再处理。

(2)检查点checkpoint

通俗的说,就是一个定时器,我们可以设置以多长时间执行一次,让状态信息缓存到我们指定的地方。

(3)检查点的后端存储

支持三种状态的持久化:memory、文件系统、rockdb
设置HDFS为后端存储,添加依赖

<dependency>
	<groupId>org.apache.hadoop</groupId>
	<artifactId>hadoop-client</artifactId>
	<version>3.1.2</version>
	<scope>provided</scope>
</dependency>	

(4)重启策略

Flink 支持不同的重启策略,以便在故障发生时控制作业如何重启。常用的重启策略:

  • 固定间隔(Fixed delay)
  • 失败率(Failure rate)
  • 无重启(No restart)

如果没有启动checkPointing,则使用无重启策略。

  • 第一种:全局配置 flink-conf.yaml
restart-strategy:fixed-delay
restart-strategy.fixed-delay.attempts:3
restart-strategy.fixed-delay.delay:10 s
  • 第二种:应用代码设置
sEnv.setRestartStrategy(RestartStrategies.fixedDelayRestart(
		  3, // 尝试重启的次数
		  Time.of(10, TimeUnit.SECONDS) // 间隔
		));

状态管理综合案例

Java代码实现

//执行计算:每三个数字进行求和
public class CountWindowWithState {

	public static void main(String[] args) throws Exception {
		StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
		
		//启用检查点功能
		sEnv.enableCheckpointing(1000);//每个秒生成一个检查点
		// 检查点的后端存储
		sEnv.setStateBackend(new FsStateBackend("hdfs://bigdata111:9000/flink/ckpt"));
		// 重启策略
		sEnv.setRestartStrategy(RestartStrategies.fixedDelayRestart(
				  3, // 尝试重启的次数
				  Time.of(10, TimeUnit.SECONDS) // 间隔
				));
		
		//创建一个keyedstream
		sEnv.fromElements(
				Tuple2.of(1, 1),
				Tuple2.of(1, 2),
				Tuple2.of(1, 3),
				Tuple2.of(1, 4),
				Tuple2.of(1, 5),
				Tuple2.of(1, 6),
				Tuple2.of(1, 7),
				Tuple2.of(1, 8),
				Tuple2.of(1, 9))
		.keyBy(0)
		.flatMap(new MyFlatMapFunction())
		.print().setParallelism(1);
		
		sEnv.execute("CountWindowWithState");
	}
}

class MyFlatMapFunction extends RichFlatMapFunction<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>>{
	//定义一个状态
	//第一个Integer 表示个数;第二个Integer表示求和的结果
	private ValueState<Tuple2<Integer,Integer>> state;
	
	@Override
	public void open(Configuration parameters) throws Exception {
		//对状态的初始化
		ValueStateDescriptor<Tuple2<Integer,Integer>> descriptor = 
				new ValueStateDescriptor<Tuple2<Integer,Integer>>("mystate",    //状态的名字
						                                          TypeInformation.of(new TypeHint<Tuple2<Integer,Integer>>() {
																  }),     //状态类型
						                                          Tuple2.of(0, 0));
		state = getRuntimeContext().getState(descriptor);
	}

	@Override
	public void flatMap(Tuple2<Integer, Integer> value, Collector<Tuple2<Integer, Integer>> out) throws Exception {
		// 每三个数据进行求和
		//获取当前状态的值
		Tuple2<Integer,Integer> current = state.value();
		
		//个数加一
		current.f0 += 1; //表示的个数加一
		//值也需要累加
		current.f1 += value.f1;  //值进行累加
		
		//更新状态
		state.update(current);
		
		//判断:是否已经有了三个数据
		if(current.f0 >= 3) {
			//输出这三个元素的结果                                                                  个数                     求和的结果
			out.collect(new Tuple2<Integer, Integer>(current.f0,current.f1));
			//状态需要清空
			state.clear();
		}
	}
}



操作算子可以会产生一个中间结果,去维护管理。

6、窗口计算、水位线(乱序数据)

窗口计算前面已经提到过,这里不再赘述。
要明白水位线,先明白三个概念:

  • 事件时间(Event Time):表示是数据源产生数据的时间
  • 摄入时间(Ingestion Time):数据到达Flink的时间
  • 处理时间(Processing Time):Flink处理数据所需要的时间

执行乱序数据处理的时候,需要指定时间语义----> 一般:事件时间

所谓水位线,我是这样理解的,就是延迟触发机制,等待数据。数据源产生的数据是有序的,但是传递到flink,由于网络的带宽,可能导致这些数据是无序,但是flink又想处理有序的数据,就设置一个等待时间,这个等待时间是我们估计最迟到达flink的哪个数据。那么我们又是如何触发窗口内数据计算呢?看下图
在这里插入图片描述

高水位线就像容器里面的水位,在往容器里倒水,这水位线是慢慢的上升,在Flink中,随着数据流的流动,水位线也会慢慢的升高,时间的水位线慢慢的变大,在某一时刻变多大肯定需要一个度量,就用水位线进行表示。慢慢体会。

7、allowedLateness

默认情况下:如果晚到的数据超过了Watermark允许的时间,数据将被丢弃。我们可以使用allowedLateness来对晚到的数据单独处理
核心代码

//定义侧流输出的标签
OutputTag<StationLog> lateTag = new OutputTag<StationLog>("late-Data") {};
.allowedLateness(Time.seconds(30)).sideOutputLateData(lateTag)
result.getSideOutput(lateTag).print();

窗口计算综合案例

Java代码实现

package day1113.datastream;

import java.time.Duration;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

// 按基站,每隔3秒,将过去是5秒内,通话时间最长的通话日志输出。
public class WaterMarkDemo {
	public static void main(String[] args) throws Exception {
		//得到Flink流式处理的运行环境
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		
		//指定时间的语义
		env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
		
		env.setParallelism(1);
		//设置周期性的产生水位线的时间间隔。当数据流很大的时候,如果每个事件都产生水位线,会影响性能。
		env.getConfig().setAutoWatermarkInterval(100);//默认100毫秒
		//定义侧流输出的标签
		OutputTag<StationLog> lateTag = new OutputTag<StationLog>("late-Data") {};
		
		//得到输入流
		SingleOutputStreamOperator<String> resualt = DataStreamSource<String> stream = env.socketTextStream("bigdata111", 1234);

		stream.flatMap(new FlatMapFunction<String, StationLog>() {

			public void flatMap(String data, Collector<StationLog> output) throws Exception {
				String[] words = data.split(",");
				//                           基站ID            from    to        通话时长                                                    callTime
				output.collect(new StationLog(words[0], words[1],words[2], Long.parseLong(words[3]), Long.parseLong(words[4])));
			}
		}).filter(new FilterFunction<StationLog>() {
			
			@Override
			public boolean filter(StationLog value) throws Exception {
				return value.getDuration() > 0?true:false;
			}
		}).assignTimestampsAndWatermarks(WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(3)) //表示延时3秒
				.withTimestampAssigner(new SerializableTimestampAssigner<StationLog>() {
					@Override
					public long extractTimestamp(StationLog element, long recordTimestamp) {
						return element.getCallTime(); //使用呼叫的时间,作为EventTime对应的字段
					}
				})
		).keyBy(new KeySelector<StationLog, String>(){
			@Override
			public String getKey(StationLog value) throws Exception {
				return value.getStationID();  //按照基站分组
			}}
		).timeWindow(Time.seconds(5),Time.seconds(3)) //设置时间窗口
		// 设置无法处理的数据
		.allowedLateness(Time.seconds(30)).sideOutputLateData(lateTag)
		.reduce(new MyReduceFunction(),new MyProcessWindows());
		//获取侧流输出
		//对于Watermark处理不了的时间,单独处理 -----> 直接打印
		result.getSideOutput(lateTag).print();
		resualt .print();

		env.execute();
	}
}
//用于如何处理窗口中的数据,即:找到窗口内通话时间最长的记录。
class MyReduceFunction implements ReduceFunction<StationLog> {
	@Override
	public StationLog reduce(StationLog value1, StationLog value2) throws Exception {
		// 找到通话时间最长的通话记录
		return value1.getDuration() >= value2.getDuration() ? value1 : value2;
	}
}
//窗口处理完成后,输出的结果是什么
class MyProcessWindows extends ProcessWindowFunction<StationLog, String, String, TimeWindow> {
	@Override
	public void process(String key, ProcessWindowFunction<StationLog, String, String, TimeWindow>.Context context,
			Iterable<StationLog> elements, Collector<String> out) throws Exception {
		StationLog maxLog = elements.iterator().next();

		StringBuffer sb = new StringBuffer();
		sb.append("窗口范围是:").append(context.window().getStart()).append("----").append(context.window().getEnd()).append("\n");;
		sb.append("基站ID:").append(maxLog.getStationID()).append("\t")
		  .append("呼叫时间:").append(maxLog.getCallTime()).append("\t")
		  .append("主叫号码:").append(maxLog.getFrom()).append("\t")
		  .append("被叫号码:")	.append(maxLog.getTo()).append("\t")
		  .append("通话时长:").append(maxLog.getDuration()).append("\n");
		out.collect(sb.toString());
	}
}

StationLog 类

//station1,18688822219,18684812319,10,1595158485855
public class StationLog {
    private String stationID;   //基站ID
    private String from;		//呼叫放
    private String to;			//被叫方
    private long duration;		//通话的持续时间
    private long callTime;		//通话的呼叫时间
    public StationLog(String stationID, String from,
                      String to, long duration,
                      long callTime) {
        this.stationID = stationID;
        this.from = from;
        this.to = to;
        this.duration = duration;
        this.callTime = callTime;
    }
    public String getStationID() {
        return stationID;
    }
    public void setStationID(String stationID) {
        this.stationID = stationID;
    }
    public long getCallTime() {
        return callTime;
    }
    public void setCallTime(long callTime) {
        this.callTime = callTime;
    }
    public String getFrom() {
        return from;
    }
    public void setFrom(String from) {
        this.from = from;
    }

    public String getTo() {
        return to;
    }
    public void setTo(String to) {
        this.to = to;
    }
    public long getDuration() {
        return duration;
    }
    public void setDuration(long duration) {
        this.duration = duration;
    }
}

所用数据

station1,18688822219,18684812319,10,1595158485855
station5,13488822219,13488822219,50,1595158490856
station5,13488822219,13488822219,50,1595158495856
station5,13488822219,13488822219,50,1595158500856
station5,13488822219,13488822219,50,1595158505856
station2,18464812121,18684812319,20,1595158507856
station3,18468481231,18464812121,30,1595158510856
station5,13488822219,13488822219,50,1595158515857
station2,18464812121,18684812319,20,1595158517857
station4,18684812319,18468481231,40,1595158521857
station0,18684812319,18688822219,0,1595158521857
station2,18464812121,18684812319,20,1595158523858
station6,18608881319,18608881319,60,1595158529858
station3,18468481231,18464812121,30,1595158532859
station4,18684812319,18468481231,40,1595158536859
station2,18464812121,18684812319,20,1595158538859
station1,18688822219,18684812319,10,1595158539859
station5,13488822219,13488822219,50,1595158544859
station4,18684812319,18468481231,40,1595158548859
station3,18468481231,18464812121,30,1595158551859
station1,18688822219,18684812319,10,1595158552859
station3,18468481231,18464812121,30,1595158555859
station0,18684812319,18688822219,0,1595158555859
station2,18464812121,18684812319,20,1595158557859
station4,18684812319,18468481231,40,1595158561859

五、数据分析引擎:Flink Table & SQL (不作为重点,举几个例子)

Flink Table & SQL不成熟,还在开发阶段(2020.11.20)
Please note that the Table API and SQL are not yet feature complete and are being actively developed. Not all operations are supported by every combination of [Table API, SQL] and [stream, batch] input.

添加依赖:

 <dependency>
	<groupId>org.apache.flink</groupId>
		<artifactId>flink-table-api-java-bridge_2.11</artifactId>
		<version>1.11.0</version>
		<scope>provided</scope>
	</dependency>

	<dependency>
		<groupId>org.apache.flink</groupId>
		<artifactId>flink-table-planner_2.11</artifactId>
		<version>1.11.0</version>
		<scope>provided</scope>
	</dependency>

	<dependency>
		<groupId>org.apache.flink</groupId>
		<artifactId>flink-table-planner-blink_2.11</artifactId>
		<version>1.11.0</version>
		<scope>provided</scope>
	</dependency>	

Java demo

(1)批处理

WordCountBatchTableAPI 类

public class WordCountBatchTableAPI {

    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        //创建一个DataSet来代表处理的数据
        DataSet<String> source = env.fromElements("I love Beijing","I love China",
                "Beijing is the capital of China");
        DataSet<WordCount> input  = source.flatMap(new FlatMapFunction<String, WordCount>() {
            @Override
            public void flatMap(String s, Collector<WordCount> collector) throws Exception {
                String[] words = s.split(" ");
                for (String word : words) {
                    collector.collect(new WordCount(word, 1));
                }
            }
        });

        // 创建运行环境
        BatchTableEnvironment tEnv = BatchTableEnvironment.create(env);
        // 将dataset转换成表
        Table table = tEnv.fromDataSet(input);
        // 处理数据
        Table data  = table.groupBy("word").select("word,frequency.sum as frequency");
        // 再转换为dataset
        DataSet<WordCount> result  = tEnv.toDataSet(data, WordCount.class);
        result.print();

    }

}

class WordCount {
    public String word;
    public long frequency;

    public WordCount() {
    }

    public WordCount(String word, int frequency) {
        this.word = word;
        this.frequency = frequency;
    }

    @Override
    public String toString() {
        return "WordCount [word=" + word + ", frequency=" + frequency + "]";
    }
}

WordCountBatchSQL类

public class WordCountBatchSQL {

	public static void main(String[] args) throws Exception {
		// 创建一个方位接口的对象:DataSet API
		ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
		
		//创建一个DataSet来代表处理的数据
		DataSet<String> source = env.fromElements("I love Beijing","I love China",
				                                  "Beijing is the capital of China");
		
		//生成一个个WordCount
		DataSet<WordCount> input = source.flatMap(new FlatMapFunction<String, WordCount>() {

			@Override
			public void flatMap(String value, Collector<WordCount> out) throws Exception {
				// I love Beijing
				String[] words = value.split(" ");
				for(String w:words) {
					//                       k2 v2
					out.collect(new WordCount(w,1));
				}
			}
		});
		
		//创建一个Table执行环境
		BatchTableEnvironment tEnv = BatchTableEnvironment.create(env);
		
		//注册表
		tEnv.registerDataSet("WordCount", input,"word,frequency");
		
		//执行SQL
		Table table = tEnv.sqlQuery("select word,sum(frequency) as frequency from WordCount group by word");
		
		DataSet<WordCount> result = tEnv.toDataSet(table, WordCount.class);
		result.print();
	}

	//定义class代表表结构
	public static class WordCount{
		public String word;
		public long frequency;
		
		public WordCount() {}
		public WordCount(String word,int frequency) {
			this.word = word;
			this.frequency = frequency;
		}
		@Override
		public String toString() {
			return "WordCount [word=" + word + ", frequency=" + frequency + "]";
		}
	}
}

(2)流式计算

WordCountStreamTableAPI类

public class WordCountStreamTableAPI {

	public static void main(String[] args) throws Exception {
		StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
	
		//接收输入
		DataStreamSource<String> source = sEnv.socketTextStream("bigdata111", 1234);

		DataStream<WordCount> input = source.flatMap(new FlatMapFunction<String, WordCount>() {

			@Override
			public void flatMap(String value, Collector<WordCount> out) throws Exception {
				// 数据:I love Beijing
				String[] words = value.split(" ");
				for(String word:words) {
					out.collect(new WordCount(word,1));
				}
				
			}
		});
		
		//创建一个表的运行环境
		StreamTableEnvironment stEnv = StreamTableEnvironment.create(sEnv);
		
		Table table = stEnv.fromDataStream(input,"word,frequency");
		Table result = table.groupBy("word").select("word,frequency.sum").as("word", "frequency");
		
		//输出
		stEnv.toRetractStream(result, WordCount.class).print();
		
		sEnv.execute("WordCountStreamTableAPI");
	}

	//定义class代表表结构
	public static class WordCount{
		public String word;
		public long frequency;
		
		public WordCount() {}
		public WordCount(String word,int frequency) {
			this.word = word;
			this.frequency = frequency;
		}
		@Override
		public String toString() {
			return "WordCount [word=" + word + ", frequency=" + frequency + "]";
		}
	}
}

WordCountStreamSQL类

public class WordCountStreamSQL {

	public static void main(String[] args) throws Exception {
		StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
		
		//接收输入
		DataStreamSource<String> source = sEnv.socketTextStream("bigdata111", 1234);

		DataStream<WordCount> input = source.flatMap(new FlatMapFunction<String, WordCount>() {

			@Override
			public void flatMap(String value, Collector<WordCount> out) throws Exception {
				// 数据:I love Beijing
				String[] words = value.split(" ");
				for(String word:words) {
					out.collect(new WordCount(word,1));
				}
				
			}
		});
		
		//创建一个表的运行环境
		StreamTableEnvironment stEnv = StreamTableEnvironment.create(sEnv);
		Table table = stEnv.fromDataStream(input,"word,frequency");
		
		//执行SQL
		Table result = stEnv.sqlQuery("select word,sum(frequency) as frequency from " + table + " group by word");
		stEnv.toRetractStream(result, WordCount.class).print();
		
		sEnv.execute("WordCountStreamSQL");
	}
	//定义class代表表结构
	public static class WordCount{
		public String word;
		public long frequency;
		
		public WordCount() {}
		public WordCount(String word,int frequency) {
			this.word = word;
			this.frequency = frequency;
		}
		@Override
		public String toString() {
			return "WordCount [word=" + word + ", frequency=" + frequency + "]";
		}
	}
}

第一章 整体介绍 2 1.1 什么是 Table API 和 Flink SQL 2 1.2 需要引入的依赖 2 1.3 两种 planner(old & blink)的区别 4 第二章 API 调用 5 2.1 基本程序结构 5 2.2 创建表环境 5 2.3 在 Catalog 中注册表 7 2.3.1 表(Table)的概念 7 2.3.2 连接到文件系统(Csv 格式) 7 2.3.3 连接到 Kafka 8 2.4 表的查询 9 2.4.1 Table API 的调用 9 2.4.2 SQL 查询 10 2.5 将 DataStream 转换成表 11 2.5.1 代码表达 11 2.5.2 数据类型与 Table schema 的对应 12 2.6. 创建临时视图(Temporary View) 12 2.7. 输出表 14 2.7.1 输出到文件 14 2.7.2 更新模式(Update Mode) 15 2.7.3 输出到 Kafka 16 2.7.4 输出到 ElasticSearch 16 2.7.5 输出到 MySql 17 2.8 将表转换成 DataStream 18 2.9 Query 的解释和执行 20 1. 优化查询计划 20 2. 解释成 DataStream 或者 DataSet 程序 20 第三章 流处理中的特殊概念 20 3.1 流处理和关系代数(表,及 SQL)的区别 21 3.2 动态表(Dynamic Tables) 21 3.3 流式持续查询的过程 21 3.3.1 将流转换成表(Table) 22 3.3.2 持续查询(Continuous Query) 23 3.3.3 将动态表转换成流 23 3.4 时间特性 25 3.4.1 处理时间(Processing Time) 25 3.4.2 事件时间(Event Time) 27 第四章 窗口(Windows) 30 4.1 分组窗口(Group Windows) 30 4.1.1 滚动窗口 31 4.1.2 滑动窗口 32 4.1.3 会话窗口 32 4.2 Over Windows 33 1) 无界的 over window 33 2) 有界的 over window 34 4.3 SQL 中窗口的定义 34 4.3.1 Group Windows 34 4.3.2 Over Windows 35 4.4 代码练习(以分组滚动窗口为例) 36 第五章 函数(Functions) 38 5.1 系统内置函数 38 5.2 UDF 40 5.2.1 注册用户自定义函数 UDF 40 5.2.2 标量函数(Scalar Functions) 40 5.2.3 表函数(Table Functions) 42 5.2.4 聚合函数(Aggregate Functions) 45 5.2.5 表聚合函数(Table Aggregate Functions) 47
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值