自定义Source

最新推荐文章于 2024-01-16 22:03:09 发布

江湖侠客

最新推荐文章于 2024-01-16 22:03:09 发布

阅读量346

点赞数 1

文章标签： flink

本文链接：https://blog.csdn.net/weixin_39868387/article/details/118424485

版权

上篇：第三方Connector Source定义

1、概述

本质上就是定义一个类，实现SourceFunction这个接口，实现run方法和cancel方法

run方法：

首先，run方法中实现的就是获取数据的逻辑
然后，调用SourceContext的collect方法，将获取的数据收集起来，这样就返回了一个新的DataStreamSource

注意：如果只实现这个接口，该Source只能是一个非并行的Source

2、采取自定义Source，如何满足生产环境的需求？

2.1、满足条件

希望Source可以并行的读取数据，这样读取数据的速度才更快

2.2、思路（最好的方式）

a、实现ParallelSourceFunction接口或继承RichParallelSourceFunction这个抽象类
b、同样实现实现run方法和cancel方法，这样该Source就是一个可以并行的Source了

2.3、编码1（单并行Source、非并行Source【自定义Source】）

实现SourceFunction接口，单并行

package cn._51doit.flink.day01;


import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;

import java.util.Arrays;
import java.util.List;

/**
 * Source-自定义
 * 有限的数据流与无限的数据流的区别：
 *  （1）有限的数据流，程序执行完毕就退出
 *  （2）无限的数据流，程序执行完毕一直运行中
 *
 *  控制台打印输出信息：
 *这个自定义的Source的并行度为：1
 * 1> aaa
 * 1> ccc
 * 1> eee
 *
 * 2> bbb
 * 2> ddd
 *
 */
public class CustomSource01 {
    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(2);  //设置env的并行度

        DataStreamSource<String> streamSource = env.addSource(new MySource1());  //把自定义的Source的MySource1方法调出来

        System.out.println("这个自定义的Source的并行度为："+streamSource.getParallelism());

        streamSource.print();  //打印输出

        env.execute();   //env抛异常

    }

    //实现SourceFunction接口，并行度为1（即非并行度的Source或单并行Source）
    public static class MySource1 implements SourceFunction<String>{

        //重写run方法
        @Override
        public void run(SourceContext<String> ctx) throws Exception {
            List<String> words = Arrays.asList("aaa", "bbb", "ccc", "ddd", "eee"); //输入参数
            for (String word: words) {  //遍历参数
                ctx.collect(word);  //将Source产生的数据输出【有限的数据流，程序执行完就退出】
            }

        }

        //重写cancel方法
        @Override
        public void cancel() {

        }
    }
}

运行程序控制台打印输出：是有限的数据流

这个自定义的Source的并行度为：1

2> aaa
2> ccc
2> eee
1> bbb
1> ddd

编码2（多并行Source【自定义Source】）

实现ParallelSourceFunction接口，可以是多并行的

package cn._51doit.flink.day01;

import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.ParallelSourceFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;

import java.util.Arrays;
import java.util.List;

/**
 * Source-自定义
 * 有限的数据流与无限的数据流的区别：
 *  （1）有限的数据流，程序执行完毕就退出
 *  （2）无限的数据流，程序执行完毕一直运行中
 *
 *  如何理解多并行域单并行
 *  答：（1）单并行只能创建一个实例，
 *     （2）而多并行创建创建多个实例（多并行：自己可以处理自己的数据）
 *      比如从kafka读取数据，每一个Souce的活跃分区（自己处理自己的数据、自己记自己的偏移量、互不相关）
 *      假如：一个Source有4个并行，每个Source读kafka的分区，效率更高
 *
 *
 *  控制台打印输出信息：
 *这个自定义的Source的并行度为：4（本地默认为4）
 *
 *12> aaa
 * 2> bbb
 * 2> ccc
 * 2> ddd
 * 2> eee
 * 3> aaa
 * 3> bbb
 * 3> ccc
 * 3> ddd
 * 3> eee
 * 4> aaa
 * 4> bbb
 * 4> ccc
 * 4> ddd
 * 4> eee
 * 1> aaa
 * 1> bbb
 * 1> ccc
 * 1> ddd
 * 1> eee
 *
 * 这个是案例是有限的数据流【程序执行完毕就退出】
 */
public class CustomSource02 {
    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //env.setParallelism(1);  //设置env的并行度; 注意：【虽实现ParallelSourceFunction接口，但是打算设置并行度，按这个设置并行度出发】

        DataStreamSource<String> streamSource = env.addSource(new MySource2());  //把自定义的Source的MySource1方法调出来

        System.out.println("这个自定义的Source的并行度为："+streamSource.getParallelism());

        streamSource.print();  //打印输出

        env.execute();   //env抛异常

    }

    //实现ParallelSourceFunction接口，可以是多并行的（在不设置并行度的前提下，它的并行度为4）
    public static class MySource2 implements ParallelSourceFunction<String> {

        //重写run方法
        @Override
        public void run(SourceContext<String> ctx) throws Exception {
            List<String> words = Arrays.asList("aaa", "bbb", "ccc", "ddd", "eee"); //输入参数
            for (String word: words) {  //遍历参数
                ctx.collect(word);  //将Source产生的数据输出【有限的数据流，程序执行完就退出】
            }

        }

        //重写cancel方法
        @Override
        public void cancel() {

        }
    }
}

编码3（多并行Source【自定义Source】）

这个是案例是无限的数据流【程序执行完毕一直运行】

package cn._51doit.flink.day01;

import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.RichSourceFunction;


import java.util.UUID;

/**
 * Source-自定义
 *
 * 这个是案例是无限的数据流【程序执行完毕一直运行】
 *
 * 控制台打印输出信息：
 *
 * constructor invoked
 *
 * 这个自定义的Source的并行度为：1  【根据自己本机器分配的节点来定的】
 *
 * 先执行：open method invoked
 * 后执行：run方法【不断产生数据】
 *1> b4245254-5526-4163-a929-d1462f038def
 * 2> c42cb876-96e2-4a84-8aa2-f6105e2b11a4
 * 3> ecde8814-2b3c-4dbd-9f5a-9a72dcd84cfa
 * 4> 50eed7d9-8e38-4a0f-a1c1-e4f2af499132
 * 1> 4e68d62f-8811-438f-9aff-4444358e3ce3
 * 2> a09bad70-1f95-4121-90d0-c822ddafb515
 * 3> 332f1806-2259-4f5f-9ae0-2bf7d787dc8e
 * 4> cffc5e0a-c2a7-42c1-a691-589cdd1a691b
 *
 */
public class CustomSource03 {
    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> streamSource = env.addSource(new MySource3());  //把自定义的Source的MySource1方法调出来

        System.out.println("这个自定义的Source的并行度为："+streamSource.getParallelism());

        streamSource.print();  //打印输出

        env.execute();   //env抛异常

    }

    //继承RichSourceFunction抽象类的Source，可以是多并行的
    public static class MySource3 extends RichSourceFunction<String> {
        //1、调用MySource3构造方法
        public MySource3() {
            System.out.println("constructor invoked");
        }

        private boolean flag=true;
        //2、调用open方法，只会调用一次
        //重写open方法
        @Override
        public void open(Configuration parameters) throws Exception {
            System.out.println("open method invoked");

        }

        //5、调用close方法【释放资源】
        //重写close方法
        @Override
        public void close() throws Exception {
            System.out.println("close method invoked");
        }

        //3、调用run方法【会不停产生数据】
        //重写run方法
        @Override
        public void run(SourceContext<String> ctx) throws Exception {

            while (flag){
                ctx.collect(UUID.randomUUID().toString());
                Thread.sleep(2000);
            }

        }

        //4、调用cannel方法【会停止】
        //重写cancel方法
        @Override
        public void cancel() {
            flag=false;

        }
    }
}

强调：继承RichSourceFunction抽象类的Source，可以是有限流数据

浏览器查看（代码）

 Configuration configuration = new Configuration();

 StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(configuration);

说明（程序执行顺序）：

程序正在执行时，先执行：open method invoked方法，后执行：run方法【不断产生数据】
关闭程序后，它会先执行cancel方法，后释放资源close

江湖侠客

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
自定义Source

上篇：第三方Connector Source定义1、概述本质上就是定义一个类，实现SourceFunction这个接口，实现run方法和cancel方法run方法：首先，run方法中实现的就是获取数据的逻辑然后，调用SourceContext的collect方法，将获取的数据收集起来，这样就返回了一个新的DataStreamSource注意：如果只实现这个接口，该Source只能是一个非并行的Source2、采取自定义Source，如何满足生产环境的需求？2.1、满足条件
复制链接

扫一扫