flink-广播变量、累加器、缓存（八）

最新推荐文章于 2022-02-28 19:23:05 发布

CurryYoung11

最新推荐文章于 2022-02-28 19:23:05 发布

阅读量472

点赞数

分类专栏： flink 文章标签：大数据 flink

本文链接：https://blog.csdn.net/CurryYoung11/article/details/105208232

版权

flink 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

flink-广播变量、累加器、缓存（八）

broadcast

/*
1.将要广播的数据转成DataSet类型
	 DataSet<Tuple2<String, Integer>> tupleData = env.fromCollection(broadData);
	 
2.自定义富函数 此处定义的是 new RichMapFunction
	在open方法中获取广播变量数据
3.在算子的最后 调用withBroadcastSet(广播变量,广播变量的别名)	
*/
public class BatchDemoBroadcast {
    public static void main(String[] args) throws Exception{
        //获取运行环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        // 1：准备需要广播的数据
        ArrayList<Tuple2<String, Integer>> broadData = new ArrayList<>();
        broadData.add(new Tuple2<>("zs",18));
        broadData.add(new Tuple2<>("ls",20));
        broadData.add(new Tuple2<>("ww",17));
        
		//将要广播的数据转成DataSet类型
        DataSet<Tuple2<String, Integer>> tupleData = env.fromCollection(broadData);

        // 处理需要广播的数据,转换为：HashMap<Striong, Integer>, key=name,value=age
        DataSet<HashMap<String, Integer>> toBroadcast = tupleData.map(new MapFunction<Tuple2<String, Integer>, HashMap<String, Integer>>() {
            @Override
            public HashMap<String, Integer> map(Tuple2<String, Integer> value) throws Exception {
                HashMap<String, Integer> res = new HashMap<>();
                res.put(value.f0, value.f1);
                return res;
            }
        });

        //源数据
        DataSource<String> data = env.fromElements("zs", "ls", "ww");
        //注意：在这里需要使用到RichMapFunction获取广播变量
        DataSet<String> result = data.map(new RichMapFunction<String, String>() {
            List<HashMap<String, Integer>> broadCastMap = new ArrayList<HashMap<String, Integer>>();
            HashMap<String, Integer> allMap = new HashMap<String, Integer>();
            /**
             * 这个方法只会执行一次
             * 可以在这里实现一些初始化的功能
             * 所以，就可以在open方法中获取广播变量数据
             */
            @Override
            public void open(Configuration parameters) throws Exception {
                super.open(parameters);
                // 3:获取广播数据
                this.broadCastMap = getRuntimeContext().getBroadcastVariable("broadCastMapName");
                for (HashMap map : broadCastMap) {
                    allMap.putAll(map);
                }
            }
            @Override
            public String map(String value) throws Exception {
                Integer age = allMap.get(value);
                return value + "," + age;
            }
        }).withBroadcastSet(toBroadcast, "broadCastMapName"); // 2：执行广播数据的操作

        result.print();
    }
}

Accumulators

Accumulator即累加器，与Mapreduce counter的应用场景差不多，都能很好地观察task在运行期间的数据变化
可以在Flink job任务中的算子函数中操作累加器，但是只能在任务执行结束之后才能获得累加器的最终结果。
Counter是一个具体的累加器(Accumulator)实现
IntCounter, LongCounter 和 DoubleCounter
用法
自定义一富函数：例如：RichMapFunction() 
在富函数中：
1：创建累加器
private IntCounter numLines = new IntCounter(); 
2：注册累加器 open方法中
getRuntimeContext().addAccumulator("num-lines", this.numLines);
3：使用累加器 
this.numLines.add(1); 
4：获取累加器的结果
myJobExecutionResult.getAccumulatorResult("num-lines")

部分代码

 DataSet<String> result = data.map(new RichMapFunction<String, String>() {
            // 1:创建累加器
            private IntCounter numLines = new IntCounter();
            @Override
            public void open(Configuration parameters) throws Exception {
                super.open(parameters);
                // 2:注册累加器
                getRuntimeContext().addAccumulator("intCounter", this.numLines);
            }

            //int sum = 0;
            @Override
            public String map(String value) throws Exception {
                // 3：使用累加器
                // 如果并行度为1，使用普通的累加求和即可，但是设置多个并行度，则普通的累加求和结果就不准了
                // sum++;
                // System.out.println("sum："+sum);
                this.numLines.add(1);
                return value;
            }
        }).setParallelism(8);


 JobExecutionResult ex = env.execute("test");
int intCounter = ex.getAccumulatorResult("intCounter");//获取累加器的值
System.out.println("acc="+intCounter);

Broadcast和Accumulators的区别

Broadcast(广播变量)允许程序员将一个只读的变量缓存在每台机器上，而不用在任务之间传递变量。广播变量可以进行共享，但是不可以进行修改。
Accumulators(累加器)是可以在不同任务中对同一个变量进行累加操作。

Distributed Cache(分布式缓存)

Flink提供了一个分布式缓存，类似于hadoop，可以使用户在并行函数中很方便的读取本地文件
此缓存的工作机制如下：程序注册一个文件或者目录(本地或者远程文件系统，例如hdfs或者s3)，通过ExecutionEnvironment注册缓存文件并为它起一个名称。当程序执行，Flink自动将文件或者目录复制到所有taskmanager节点的本地文件系统，用户可以通过这个指定的名称查找文件或者目录，然后从taskmanager节点的本地文件系统访问它
用法
1：注册一个文件
env.registerCachedFile("hdfs://node01:9000/cachefile", "hdfsFile")  
2：访问数据
File myFile = getRuntimeContext().getDistributedCache().getFile("hdfsFile");

public class BatchDemoDisCache {

    public static void main(String[] args) throws Exception{

        // 获取运行环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        // 1、注册一个文件,可以使用hdfs或者s3上的文件
        env.registerCachedFile("c://data/cachedfile1.txt","cachedfile1");

        DataSource<String> data = env.fromElements("java", "scala", "python");

        DataSet<String> result = data.map(new RichMapFunction<String, String>() {
            private ArrayList<String> dataList = new ArrayList<String>();
            @Override
            public void open(Configuration parameters) throws Exception {
                super.open(parameters);
                // 2、处理cache数据
                File cachedfile = getRuntimeContext().getDistributedCache().getFile("cachedfile1");
                List<String> lines = FileUtils.readLines(cachedfile);
                for (String line : lines) {
                    this.dataList.add(line);
                    System.out.println("line:" + line);
                }
            }
            @Override
            public String map(String value) throws Exception {
                // 3、在这里就可以使用dataList
                StringBuffer sb = new StringBuffer();
                for (String str : dataList) {
                    sb.append(str + value + " ");
                }
                return sb.toString();
            }
        });

        result.print();
    }
}

CurryYoung11

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
flink-广播变量、累加器、缓存（八）

flink-广播变量、累加器、缓存broadcast/*1.将要广播的数据转成DataSet类型 DataSet<Tuple2<String, Integer>> tupleData = env.fromCollection(broadData); 2.自定义富函数此处定义的是 new RichMapFunction 在open方法中获取广播变量数据3...
复制链接

扫一扫