flink中和数据共享和参数传递

最新推荐文章于 2024-06-18 20:00:58 发布

꧁꫞ND꫞꧂

最新推荐文章于 2024-06-18 20:00:58 发布

阅读量4.1k

点赞数 2

分类专栏： Flink

原文链接：https://www.cnblogs.com/029zz010buct/p/10362451.html

版权

Flink 专栏收录该内容

52 篇文章 8 订阅

订阅专栏

Flink中参数传递和数据共享

众所周知，flink作为流计算引擎，处理源源不断的数据是其本意，但是在处理数据的过程中，往往可能需要一些参数的传递，那么有哪些方法进行参数的传递？在什么时候使用？这里尝试进行简单的总结。

使用configuration

　　在main函数中定义变量

// Class in Flink to store parameters
Configuration configuration = new Configuration();
configuration.setString("genre", "Action");

lines.filter(new FilterGenreWithParameters())
        // Pass parameters to a function
        .withParameters(configuration)
        .print();

　　使用参数的function需要继承自一个rich的function，这样才可以在open方法中获取相应的参数。

class FilterGenreWithParameters extends RichFilterFunction<Tuple3<Long, String, String>> {

    String genre;

    @Override
    public void open(Configuration parameters) throws Exception {
        // Read the parameter
        genre = parameters.getString("genre", "");
    }

    @Override
    public boolean filter(Tuple3<Long, String, String> movie) throws Exception {
        String[] genres = movie.f2.split("\\|");

        return Stream.of(genres).anyMatch(g -> g.equals(genre));
    }
}

使用ParameterTool

使用configuration虽然传递了参数，但显然不够动态，每次参数改变，都涉及到程序的变更，既然main函数能够接受参数，flink自然也提供了相应的承接的机制，即ParameterTool。

如果使用ParameterTool，则在参数传递上如下

public static void main(String... args) {
    // Read command line arguments
    ParameterTool parameterTool = ParameterTool.fromArgs(args);

final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setGlobalJobParameters(parameterTool);
...

// This function will be able to read these global parameters
lines.filter(new FilterGenreWithGlobalEnv())
                .print();
}

如上面代码，使用parameterTool来承接main函数的参数，通过env来设置全局变量来进行分发，那么在继承了rich函数的逻辑中就可以使用这个全局参数。

class FilterGenreWithGlobalEnv extends RichFilterFunction<Tuple3<Long, String, String>> {

    @Override
    public boolean filter(Tuple3<Long, String, String> movie) throws Exception {
        String[] genres = movie.f2.split("\\|");
        // Get global parameters
        ParameterTool parameterTool = (ParameterTool) getRuntimeContext().getExecutionConfig().getGlobalJobParameters();
        // Read parameter
        String genre = parameterTool.get("genre");

        return Stream.of(genres).anyMatch(g -> g.equals(genre));
    }
}

使用broadcast变量

在上面使用configuration和parametertool进行参数传递会很方便，但是也仅仅适用于少量参数的传递，如果有比较大量的数据传递，flink则提供了另外的方式来进行，其中之一即是broadcast，这个也是在其他计算引擎中广泛使用的方法之一。

DataSet<Integer> toBroadcast = env.fromElements(1, 2, 3);
// Get a dataset with words to ignore
DataSet<String> wordsToIgnore = ...

data.map(new RichFlatMapFunction<String, String>() {

    // A collection to store words. This will be stored in memory
    // of a task manager
    Collection<String> wordsToIgnore;

    @Override
    public void open(Configuration parameters) throws Exception {
        // Read a collection of words to ignore
        wordsToIgnore = getRuntimeContext().getBroadcastVariable("wordsToIgnore");
    }


    @Override
    public String map(String line, Collector<String> out) throws Exception {
        String[] words = line.split("\\W+");
        for (String word : words)
            // Use the collection of words to ignore
            if (wordsToIgnore.contains(word))
                out.collect(new Tuple2<>(word, 1));
    }
    // Pass a dataset via a broadcast variable
}).withBroadcastSet(wordsToIgnore, "wordsToIgnore");

在第3行定义了需要进行广播的数据集，在第27行指定了将此数据集进行广播的目的地。

广播的变量会保存在tm的内存中，这个也必然会使用tm有限的内存空间，也因此不能广播太大量的数据。

那么，对于数据量更大的广播需要，要如何进行？flink也提供了缓存文件的机制，如下。

使用distributedCache

首先还是需要在定义dag图的时候指定缓存文件：

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

// Register a file from HDFS
env.registerCachedFile("hdfs:///path/to/file", "machineLearningModel")

...

env.execute()

flink本身支持指定本地的缓存文件，但一般而言，建议指定分布式存储比如hdfs上的文件，并为其指定一个名称。

使用起来也很简单，在rich函数的open方法中进行获取。

class MyClassifier extends RichMapFunction<String, Integer> {

    @Override
    public void open(Configuration config) {
      File machineLearningModel = getRuntimeContext().getDistributedCache().getFile("machineLearningModel");
      ...
    }

    @Override
    public Integer map(String value) throws Exception {
      ...
    }
}

上面的代码忽略了对文件内容的处理。

在上面的几个方法中，应该说参数本身都是static的，不会变化，那么如果参数本身随着时间也会发生变化，怎么办？

嗯，那就用connectStream，其实也是流的聚合了。

使用connectStream

使用ConnectedStream的前提当然是需要有一个动态的流，比如在主数据之外，还有一些规则数据，这些规则数据会通过Restful服务来发布，假如我们的主数据来自于kafka，

那么，就可以如下：

DataStreamSource<String> input = (DataStreamSource) KafkaStreamFactory
                .getKafka08Stream(env, srcCluster, srcTopic, srcGroup);

DataStream<Tuple2<String, String>> appkeyMeta = env.addSource(new AppKeySourceFunction(), "appkey")

ConnectedStreams<String, Tuple2<String, String>> connectedStreams = input.connect(appkeyMeta.broadcast());

DataStream<String> cleanData = connectedStreams.flatMap(new DataCleanFlatMapFunction())

其实可以看到，上面的代码中做了四件事，首先在第1行定义了获取主数据的流，在第4行定义了获取规则数据的流，在AppKeySourceFunction中实现了读取Restful的逻辑，

在第6行实现了将规则数据广播到主数据中去，最后在第8行实现了从connectedStream中得到经过处理的数据。其中的关键即在于DataCleanFlatMapFunction。

public class DataCleanFlatMapFunction extends RichCoFlatMapFunction<String, Tuple2<String, String>, String>{

public void flatMap1(String s, Collector<String> collector){...}

public void flatMap2(Tuple2<String, String> s, Collector<String> collector) {...}


}

这是一段缩减的代码，关键在于第一行，首先这个函数需要实现RichCoFlatMapFunction这个抽象类，其次在类实现中，flatMap2会承接规则函数，flatMap1会承接主函数。

当然，参数可以从client发送到task，有时候也需要从task发回到client，一般这里就会使用accumulator。

这里先看一个简单的例子，实现单词的计数以及处理文本的记录数：

DataSet<String> lines = ...

// Word count algorithm
lines.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
    @Override
    public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
        String[] words = line.split("\\W+");
        for (String word : words) {
            out.collect(new Tuple2<>(word, 1));
        }
    }
})
.groupBy(0)
.sum(1)
.print();

// Count a number of lines in the text to process
int linesCount = lines.count()
System.out.println(linesCount);

上面的代码中，第14行实现了单词的计算，第18行实现了处理记录的行数，但很可惜，这里会产生两个job，仅仅第18行一句代码，就会产生一个job，无疑是不高效的。

flink提供了accumulator来实现数据的回传，亦即从tm传回到JM。

flink本身提供了一些内置的accumulator:

IntCounter, LongCounter, DoubleCounter – allows summing together int, long, double values sent from task managers
AverageAccumulator – calculates an average of double values
LongMaximum, LongMinimum, IntMaximum, IntMinimum, DoubleMaximum, DoubleMinimum – accumulators to determine maximum and minimum values for different types
Histogram – used to computed distribution of values from task managers

首先需要定义一个accumulator，然后在某个自定义函数中来注册它，这样在客户端就可以获取相应的的值。

lines.flatMap(new RichFlatMapFunction<String, Tuple2<String, Integer>>() {

    // Create an accumulator
    private IntCounter linesNum = new IntCounter();

    @Override
    public void open(Configuration parameters) throws Exception {
        // Register accumulator
        getRuntimeContext().addAccumulator("linesNum", linesNum);
    }

    @Override
    public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
        String[] words = line.split("\\W+");
        for (String word : words) {
            out.collect(new Tuple2<>(word, 1));
        }

        // Increment after each line is processed
        linesNum.add(1);
    }
})
.groupBy(0)
.sum(1)
.print();

// Get accumulator result
int linesNum = env.getLastJobExecutionResult().getAccumulatorResult("linesNum");
System.out.println(linesNum);

当然，如果内置的accumulator不能满足需求，可以自定义accumulator，只需要继承两个接口之一即可，Accumulator或者SimpleAccumulato。

上面介绍了几种参数传递的方式，在日常的使用中，可能不仅仅是使用其中一种，或许是某些的组合，比如通过parametertool来传递hdfs的路径，再通过filecache来读取缓存。

꧁꫞ND꫞꧂

关注

2
点赞
踩
13

收藏

觉得还不错? 一键收藏
1
评论
flink中和数据共享和参数传递

Flink中参数传递和数据共享众所周知，flink作为流计算引擎，处理源源不断的数据是其本意，但是在处理数据的过程中，往往可能需要一些参数的传递，那么有哪些方法进行参数的传递？在什么时候使用？这里尝试进行简单的总结。使用configuration　　在main函数中定义变量// Class in Flink to store parametersConfiguration configuration = new Configuration();configuration.setStrin
复制链接

扫一扫

专栏目录