想使用flink实现机器学习算法,众所周知很多机器学习算法都需要涉及循环,比如说循环机器学习算法直到loss小于某个阈值,因此使用flink实现机器学习算法最基础的一环就是要学习在flink中怎么写循环。
这里演示了最基本的一个循环示例,
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env=ExecutionEnvironment.getExecutionEnvironment();
//迭代次数
int iterativeNum=10;
// 定义要循环的数据集
IterativeDataSet<String> input=env.fromElements("").iterate(iterativeNum);
// 在循环中的方法体,本次操作产生的结果集会作为下一次循环的输入
DataSet<String> iterativeBody = input.map(new RichMapFunction<String, String>() {
@Override
public String map(String s) throws Exception {
s = s + String.format("step: %d \n" , getIterationRuntimeContext().getSuperstepNumber());
return s;
}
});
// 将要循环的数据集和循环的方法体连接起来
DataSet<String> result = input.closeWith(iterativeBody);
result.print();
}
输出结果为:
step: 1
step: 2
step: 3
step: 4
step: 5
step: 6
step: 7
step: 8
step: 9
step: 10
给个流程图
上面说完了固定循环次数,在flink中执行循环。但是我们往往需要根据某种条件结束循环,比方说,在机器学习程序中,我们往往是判断loss收敛、验证集精度来判断是否要结束循环,对于这种需要根据某种条件来结束循环的操作,flink提供了另一种方法:
/**
* Closes the iteration and specifies a termination criterion. This method defines the end of
* the iterative program part.
*
* <p>The termination criterion is a means of dynamically signaling the iteration to halt. It is
* expressed via a data set that will trigger to halt the loop as soon as the data set is empty.
* A typical way of using the termination criterion is to have a filter that filters out all
* elements that are considered non-converged. As soon as no more such elements exist, the
* iteration finishes.
*
* @param iterationResult The data set that will be fed back to the next iteration.
* @param terminationCriterion The data set that being used to trigger halt on operation once it
* is empty.
* @return The DataSet that represents the result of the iteration, after the computation has
* terminated.
* @see DataSet#iterate(int)
*/
public DataSet<T> closeWith(DataSet<T> iterationResult, DataSet<?> terminationCriterion) {
return new BulkIterationResultSet<T>(
getExecutionEnvironment(), getType(), this, iterationResult, terminationCriterion);
}
意思就是说,我们可以在closeWith
方法中传入一个terminationCriterion
参数,其就是判断循环是否结束的条件,给出一个小例子:
public class Test {
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSource<Integer> integerDataSource = env.fromElements(-8, -5, -1, 6, 7, 10);
// 设置最大迭代次数:100次
IterativeDataSet<Integer> iterativeDataSet = integerDataSource.iterate(100);
// 循环体
DataSet<Integer> iterativeBody = iterativeDataSet.map(x -> x + 2);
// 循环结束条件
DataSet<Integer> terminationCriterion = iterativeDataSet.filter(x -> (x<0));
// 设置循环体和循环结束条件
DataSet<Integer> result = iterativeDataSet.closeWith(iterativeBody, terminationCriterion);
result.print();
}
}
这里我设置了最大循环次数为100,但是根本不会指定到100次,因为使用了terminationCriterion
,一旦所有的数字都大于0之后,这个循环就会停止。
这里还有一个很坑的事情,乍一看,上面的程序执行结果应该是:
0, 3, 7, 14, 15, 18
但是实际执行结果是:
2, 5, 9, 16, 17, 20
我开始看到结果结果也很懵,按理说当执行四次循环之后,terminationCriterion
中就没有小于0的元素,循环就应该停止,输出结果0, 3, 7, 14, 15, 18
。看论源码之后发现,flink在循环结束条件之后还会再执行一次iterativeBody
。但是我觉得这是个很傻dior的设计,totally does not make sense。