flink dataSet 算子大全
概要
DataSet主要用于批处理。
这一章主要讲解dataSet 的算子和dataStream注意区分,dataStream的算子见
DataStream api 算子
值得说的是,在以后的版本中DataSet api会被抛弃。 官网建议用Table api完全替代DataSet api.从而达到流批处理一体化,而不是像现在这样流处理dataStream和Table, 批处理用DataSet
1.ExecutionEnvironment.getExecutionEnvironment() :返回DataSet 对象 对应 DataSet api
2. StreamExecutionEnvironment.getExecutionEnvironment():返回DataStream对象对应DataStream api
1.map
和datastream 中的map一样,不用多说。
2.mapPartition
和map差不多,只不过map是每次处理一条数据,而mapPartion没此传入一个分区的迭代器,效率比map要高一些,可以每次传入的数据量变多了,而不用每次传递一个,一个明显的优势是减少网络传输。
3.filter
和和datastream 中的filter一样,不用多说。
4.flatmap
和datastream 中的flatmap一样,不用多说。
5.groupBy
根据指定的元素的某个字段进行数据分组,然后对组内的数据做操作,所以一般groupby都是结合其他算子使用,常用场景有:分组求和,分组求最大值,分组求最小值。
flink的dataSet的分组有以下四种:
- groupBy(KeySelector<T, K> keyExtractor)//自定义一个key生成器,key的生成基于流数据自有的字段
- groupBy(int… fields)//字段的索引位置,数组的意思是可以多字段分组
- groupBy(String… fields) //字段的名称 , 数组的意思是可以多字段分组
- Case Class fields (Case Classes only)//引用类的某个字段
下面展示了分组求和的三种写法:
public class groupbyDemo {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Long>> dataSet = env.fromElements(
Tuple2.of("a", 1L),
Tuple2.of("b", 2L),
Tuple2.of("a", 3L),
Tuple2.of("b", 4L),
Tuple2.of("c", 5L)
);
//1.dataSet.groupBy(0).sum(1).print();//和 //dataSet.groupBy(0).aggregate(Aggregations.SUM,1).print();一样都是分组,然后组内求和
//2.ataSet.groupBy(0).aggregate(Aggregations.SUM,1).print();
dataSet.groupBy(0).reduce(new ReduceFunction<Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> reduce(Tuple2<String, Long> value1, Tuple2<String, Long> value2) throws Exception {
return Tuple2.of(value1.f0,value1.f1+value2.f1);
}
}).print();
}
}
结果:
(a,4)
(b,6)
(c,5)
6.reduce
定义前后元素处理逻辑发出新的元素, 然后新的元素和下一个元素以同样的逻辑处理,依次往后,直到最后生成一个元素。
下面是求和:
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.ReduceOperator;
public class reduceDemo {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Integer> text = env.fromElements(1, 2, 3, 4);
ReduceOperator<Integer> reduceer = text.reduce(new ReduceFunction<Integer>() {
@Override
public Integer reduce(Integer value1, Integer value2) throws Exception {
return value1 + value2;
}
});
reduceer.print();
}
}
结果是:5
7.reduceGroup
reduceGroup顾名思义用在分组之后对每个组进行逻辑处理,这个算子单独使用没什么用,要结合groupBy, 和reduce的区别是,reduce每次处理两条数据,参数也是两个数据,而reduceGroup参数是当前组内的所有数据,它的参数是个迭代器。
mport org.apache.flink.api.common.functions.GroupReduceFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
import java.util.HashSet;
import java.util.Set;
public class reduceGroupDemo {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, String>> dataSet = env.fromElements(
Tuple2.of("a", "<a0>"),
Tuple2.of("b", "<b01>"),
Tuple2.of("a", "<a02>"),
Tuple2.of("b", "<b02"),
Tuple2.of("c", "<c01>")
);
dataSet.groupBy(0).reduceGroup(new GroupReduceFunction<Tuple2<String, String>, Tuple2<String,String>>() {
@Override
public void reduce(Iterable<Tuple2<String, String>> values, Collector<Tuple2<String, String>> out) throws Exception {
Set<String> uniqStrings = new HashSet<String>();
String key = null;
// add all strings of the group to the set
for (Tuple2<String, String> t : values) {
key = t.f0;
uniqStrings.add(t.f1);
}
// emit all unique strings.
out.collect(new Tuple2<String, String>(key, uniqStrings.toString()));
}
}).print();
}
}
输出:
(a,[, ])
(b,[<b02, ])
(c,[])
8.Aggregate
聚合操作,这个聚合的意思不是分组的意思,大家要理解聚合和分组的区别, 分组只是单独的分组, 聚合指的是对所有的数据做逻辑计算操作。常用场景有:聚合起来的数据的最大值, 聚合起来的数据的最小值,聚合起来的数据的求和。 看到没Aggregate定义了组内数据的逻辑行为。 而groupby只是单纯的分组, 但是并没有定义组内数据的逻辑行为。 所以一般来说 groupby 经常和Aggregate结合使用,前者定义了分组, 后者定义了组内的逻辑行为。
下面看下java版本Aggregate的源码:
aggregate接收两个参数:
Aggregations 聚合器
field 字段的位置
public AggregateOperator<T> aggregate(Aggregations agg, int field) {
return new AggregateOperator<>(this, agg, field, Utils.getCallLocationName());
}
Aggregations 聚合器内部已经帮我们实现了最大值,最小值,求和。源代码如下:
public enum Aggregations {
SUM(new SumAggregationFunction.SumAggregationFunctionFactory()),
MIN(new MinAggregationFunction.MinAggregationFunctionFactory()),
MAX(new MaxAggregationFunction.MaxAggregationFunctionFactory());
// --------------------------------------------------------------------------------------------
private final AggregationFunctionFactory factory;
private Aggregations(AggregationFunctionFactory factory) {
this.factory = factory;
}//注意这个方法,利用这个方法我们可以自己定义自己的聚合逻辑,
//比如求和的之后只要偶数,奇数数据过滤掉,但是一般用不到,如果你需要
//特殊的行为,可以自己量身定做
public AggregationFunctionFactory getFactory() {
return this.factory;
}
}
8.1 例子1
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.aggregation.Aggregations;
import org.apache.flink.api.java.tuple.Tuple2;
public class AggregateDemo {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Long>> dataSet = env.fromElements(
Tuple2.of("a", 1L),
Tuple2.of("b", 2L),
Tuple2.of("a", 3L),
Tuple2.of("b", 4L),
Tuple2.of("c", 5L)
);
dataSet.aggregate(Aggregations.SUM,1).print();
dataSet.groupBy(0).aggregate(Aggregations.SUM,1).print();
}
}
8.2例子2 结合and
public class AggregateAnd {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple3<String,Integer,Integer>> dataSet = env.fromElements(
Tuple3.of("鸡蛋",1,4),
Tuple3.of("鸡蛋",3,5),
Tuple3.of("鸡蛋",2,6)
);
dataSet.aggregate(Aggregations.SUM,1).and(Aggregations.MAX,2).and(Aggregations.MIN,2).print();
}
}
结果:
(鸡蛋,6,4)
9.Aggregate 之复杂聚合
9.1:aggregate组内计算的局限性
翻开aggregate方法的源码定义我们发现只有以下一种定义:
- .aggregate(Aggregations agg, int field)
我们点开其参数Aggregations :
public enum Aggregations {
SUM(new SumAggregationFunction.SumAggregationFunctionFactory()),
MIN(new MinAggregationFunction.MinAggregationFunctionFactory()),
MAX(new MaxAggregationFunction.MaxAggregationFunctionFactory());
}
看到没,这是一个枚举而且只有SUM ,MIN,MAX能用,而且并未提供任何负责的聚合计算。
那么也就意味着aggregate方法只能有三种用处:组内求和,组内最大值,组内最小值。
那么如果我有更富在的功能怎么办,比如此时我想在求和的时候排除一部分数据该怎么办?
问得好,我们知道aggregate的计算逻辑作用在组内,那么我只需要找到一个方法,
此方法可以在每个分组内进行计算就行了,而reduceGroup就是这么个方法。
9.2 一个复杂的聚合实现
需求:对商品进行分类,求出每个商品一共卖了多少个,以及整个销售期间该商品的最大价格和最小价格。
public class Prodect{
private String name;//产品名称
private Integer count;//产品销售数量
private Double price;//产品的价格
public Prodect(String name, Integer count, Double price) {
this.name = name;
this.count = count;
this.price = price;
}
public Prodect() {
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Integer getCount() {
return count;
}
public void setCount(Integer count) {
this.count = count;
}
public Double getPrice() {
return price;
}
public void setPrice(Double price) {
this.price = price;
}
@Override
public String toString() {
return "Prodect{" +
"name='" + name + '\'' +
", count=" + count +
", price=" + price +
'}';
}
}
import com.pg.flink.dataSet.enty.Prodect;
import org.apache.flink.api.common.functions.GroupReduceFunction;
import org.apache.flink.api.common.operators.Order;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.util.Collector;
public class AggregateMoreFieldsDemo {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Prodect> dataSet = env.fromElements(
new Prodect("鸡蛋",1,2.1),
new Prodect("鸡蛋",3,2.3),
new Prodect("鸡蛋",2,2.2),
new Prodect("牛奶",1,2.6),
new Prodect("牛奶",2,2.9),
new Prodect("牛奶",3,2.8)
);
dataSet.groupBy(Prodect::getName).sortGroup(Prodect::getPrice, Order.ASCENDING).reduceGroup(new GroupReduceFunction<Prodect, String>() {
@Override
public void reduce(Iterable<Prodect> values, Collector<String> out) throws Exception {
String name = "";//商品名字
int total = 0;//商品总销售额
double max_price = 0.0;//商品销售过程中最高价格达到过多少
double min_price = 0.0;//商品销售过程中最高价格达到过多少
int index = 1;//记录当前组的遍历索引
for (Prodect p:values){
//因为做了排序,所以当前组内商品价格是递增的,第一件商品价格最低,因此最后一件商品价格最高。
max_price = p.getPrice();
name = p.getName();
total += p.getCount();
if(index == 1){
min_price = p.getPrice();
}
index +=1;
}
out.collect(String.format("name=%s,total=%s,min_price=%s,max_price=%s",name,total,min_price,max_price ));
}
}).print();
//
}
}
结果:
name=鸡蛋,total=6,min_price=2.1,max_price=2.3
name=牛奶,total=6,min_price=2.6,max_price=2.9
友情提示这种写法:dataSet.groupBy(Prodect::getName)看不懂请参考:
flink 自定义KeySelector
10.join
类似于 inner join,没有join到的数据会被抛弃,看下面demo
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
public class distinct {
public static void main(String[] args) throws Exception{
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Long>> dataSet = env.fromElements(
Tuple2.of("张三", 1L),
Tuple2.of("李四", 2L),
Tuple2.of("张三", 3L),
Tuple2.of("33",4L)
);
DataSet<Tuple2<String, String>> dataSet2 = env.fromElements(
Tuple2.of("张三", "北京"),
Tuple2.of("李四", "上海"),
Tuple2.of("张三", "中国")
);
dataSet.join(dataSet2).where(0).equalTo(0).print();
}
}
((张三,1),(张三,北京))
((张三,3),(张三,北京))
((张三,1),(张三,中国))
((张三,3),(张三,中国))
((李四,2),(李四,上海))
类似于mysql的左连接和右连接分别是:
leftOuterJoin和rightOuterJoin
11.rightOuterJoin
rightOuterJoin 和with结合使用,虽然这种用法很奇怪。
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
public class distinct {
public static void main(String[] args) throws Exception{
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Long>> dataSet = env.fromElements(
Tuple2.of("张三", 1L),
Tuple2.of("李四", 2L),
Tuple2.of("张三", 3L),
Tuple2.of("33",4L)
);
DataSet<Tuple2<String, String>> dataSet2 = env.fromElements(
Tuple2.of("张三", "北京"),
Tuple2.of("李四", "上海"),
Tuple2.of("张三", "中国"),
Tuple2.of("88","99")
);
dataSet.rightOuterJoin(dataSet2).where(0).equalTo(0).with(new JoinFunction<Tuple2<String, Long>, Tuple2<String, String>, String>() {
@Override
public String join(Tuple2<String, Long> first, Tuple2<String, String> second) throws Exception {
if(first!=null){//要注意判断null,否则报错,这一点很无语,很不友好,左右连接竟然内部没处理好让人无语。
return first.toString() + second.toString();
}else {
return second.toString();
}
}
}).print();
}
}
结果如下:
(88,99)
(张三,1)(张三,北京)
(张三,3)(张三,北京)
(张三,1)(张三,中国)
(张三,3)(张三,中国)
(李四,2)(李四,上海)
11.coGroup
A.coGroup(B)
.where(0)//选择 key
.equalTo(1)//选择key key相等的进入同一个分组
.with(new CoGroupFunction<String, String, String>() {
public void coGroup(Iterable<String> in1, Iterable<String> in2, Collector<String> out) {
out.collect(...);
}
});
coGroup先将两个流A,B进行连接,然后选择指定的key进行分组,然后当前组内的数据进入public void coGroup(Iterable first, Iterable)处理,参数是两个迭代器,第一个迭代器是A流中的数据,第二个迭代器是B流中的数据,注意这个迭代器的数据不是所有的数据,而是根据key分组的当前组的数据。
下面利用coGroup实现 内连接功能,inner join。
import org.apache.flink.api.common.functions.CoGroupFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
public class coGroup {
public static void main(String[] args) throws Exception{
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Long>> dataSet = env.fromElements(
Tuple2.of("张三", 1L),
Tuple2.of("抛弃", 1000L),
// Tuple2.of("张三", 66L),
Tuple2.of("李四", 2L)
);
DataSet<Tuple2<String, Long>> dataSet2 = env.fromElements(
Tuple2.of("张三", 8L),
Tuple2.of("李四", 9L),
Tuple2.of("张三", 999999999L),
Tuple2.of("被抛弃", 999999999L)
);
dataSet.coGroup(dataSet2).where(0).equalTo(0).with(new CoGroupFunction<Tuple2<String, Long>, Tuple2<String, Long>, String>() {
@Override
public void coGroup(Iterable<Tuple2<String, Long>> first, Iterable<Tuple2<String, Long>> second, Collector<String> out) throws Exception {
// HashMap<String,Long> map = new HashMap<>();
for(Tuple2<String,Long> aaa:first){
for(Tuple2<String,Long> bbb:second){
out.collect(aaa.toString()+"@@@@@@@@@@@@@@@"+bbb.toString());
}
}
}
}).print();
}
}
结果:
(张三,1)@@@@@@@@@@@@@@@(张三,8)
(张三,1)@@@@@@@@@@@@@@@(张三,999999999)
(李四,2)@@@@@@@@@@@@@@@(李四,9)
注意:coGroup 第一个数据流如果有重复的key会报错, 而join不会报错,这一点要注意,我也是很无语,这种细节很难发现。
12.cross(笛卡尔积)
import org.apache.flink.api.common.functions.CrossFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
public class crossDemo {
public static void main(String[] args) throws Exception{
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Long>> dataSet = env.fromElements(
Tuple2.of("张三", 1L),
Tuple2.of("抛弃", 1000L),
Tuple2.of("李四", 2L)
);
DataSet<Tuple2<String, Long>> dataSet2 = env.fromElements(
Tuple2.of("张三", 8L),
Tuple2.of("李四", 9L)
);
dataSet.cross(dataSet2).with(new CrossFunction<Tuple2<String, Long>, Tuple2<String, Long>, String>() {
@Override
public String cross(Tuple2<String, Long> val1, Tuple2<String, Long> val2) throws Exception {
return val1.toString()+val2.toString();
}
}).print();
}
}
(张三,1)(张三,8)
(张三,1)(李四,9)
(抛弃,1000)(张三,8)
(抛弃,1000)(李四,9)
(李四,2)(张三,8)
(李四,2)(李四,9)
13. project
选择指定索引位置的字段,其他的字段被删除:
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple3;
public class projectDemo {
public static void main(String[] args) throws Exception{
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple3<String, Long,Long>> dataSet = env.fromElements(
Tuple3.of("张三", 1L,9L),
Tuple3.of("抛弃", 1000L,9L),
Tuple3.of("李四", 2L,9L)
);
dataSet.project(1,2).print();
}
}
输出:
(张三,9)
(抛弃,9)
(李四,9)
14.reduceGroup 排序问题
有时候我们在处理groupBy之后每个组内的数据的时候,若可能需要排序的某些逻辑,那么可以先排序,然后再掉用reduceGroup, 这样的化reduceGroup的参数迭代器中的数据就是排好序的。
比如下面: 取出每个分组内同名字的 年龄最大和最小的两个人
import org.apache.flink.api.common.functions.GroupReduceFunction;
import org.apache.flink.api.common.operators.Order;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.util.Collector;
public class reduceGroupSortDemo {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple3<String, Long,String>> dataSet = env.fromElements(
Tuple3.of("张三", 3L,"北京"),
Tuple3.of("张三", 1L,"上海"),
Tuple3.of("张三", 2L,"天津"),
Tuple3.of("李四", 5L,"南京"),
Tuple3.of("李四", 8L,"武汉"),
Tuple3.of("李四", 10L,"郑州")
);
//取出同名字的 年龄最大和最小的两个人
dataSet.groupBy(0).sortGroup(1, Order.DESCENDING).reduceGroup(new GroupReduceFunction<Tuple3<String, Long,String>, String>() {
@Override
public void reduce(Iterable<Tuple3<String, Long,String>> values, Collector<String> out) throws Exception {
Integer index = 1;
Tuple3<String,Long,String> max = null;
Tuple3<String,Long,String> min = null;
// add all strings of the group to the set
for (Tuple3<String, Long,String> t : values) {
if (index == 1){
max = t;
}
min = t;
index +=1;
}
// emit all unique strings.
out.collect("[max"+max.toString()+""+" min"+min.toString()+"]");
}
}).print();
}
}
结果:
[max(张三,3,北京) min(张三,1,上海)]
[max(李四,10,郑州) min(李四,5,南京)]
15.combineGroup
combineGroup 和reduceGroup是如此的相似导致很多人傻傻分不清,二者都用在groupBy之后,甚至代码逻辑都相同,那很多人心里会骂一句妈卖批。 今天彻底搞明白。先看图:
然后再看代码:
import org.apache.flink.api.common.functions.GroupCombineFunction;
import org.apache.flink.api.common.functions.GroupReduceFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.GroupCombineOperator;
import org.apache.flink.api.java.operators.GroupReduceOperator;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
public class reduceGroupCombine {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Long>> dataSet = env.fromElements(
Tuple2.of("张三", 3L),
Tuple2.of("张三", 1L),
Tuple2.of("张三", 2L),
Tuple2.of("李四", 5L),
Tuple2.of("李四", 8L),
Tuple2.of("李四", 10L)
);
//取出同名字的 年龄最大和最小的两个人
GroupCombineOperator<Tuple2<String, Long>, Tuple2<String, Long>> combinedWords = dataSet.groupBy(0).combineGroup(new GroupCombineFunction<Tuple2<String, Long>, Tuple2<String, Long>>() {
@Override
public void combine(Iterable<Tuple2<String, Long>> values, Collector<Tuple2<String, Long>> out) throws Exception {
Long sum = 0L;
String name = null;
for (Tuple2<String, Long> t : values) {
name = t.f0;
sum += t.f1;
}
out.collect(Tuple2.of(name,sum));
}
});
GroupReduceOperator<Tuple2<String, Long>, Tuple2<String, Long>> out = combinedWords.groupBy(0).reduceGroup(new GroupReduceFunction<Tuple2<String, Long>, Tuple2<String, Long>>() {
@Override
public void reduce(Iterable<Tuple2<String, Long>> values, Collector<Tuple2<String, Long>> out) throws Exception {
Long sum = 0L;
String name = null;
for (Tuple2<String, Long> t : values) {
name = t.f0;
sum += t.f1;
}
out.collect(Tuple2.of(name,sum));
}
});
out.print();
}
}
//结果如下:
(张三,6)
(李四,23)
15.combineGroup 功能的第二种写法
注意看上面的代码, combineGroup 和reduceGroup是分开写的,其实还有一种合并的写法看下面的代码:
import org.apache.flink.api.common.functions.GroupCombineFunction;
import org.apache.flink.api.common.functions.GroupReduceFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
public class reduceGroupCombine2 {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<String, Long>> dataSet = env.fromElements(
Tuple2.of("张三", 3L),
Tuple2.of("张三", 1L),
Tuple2.of("张三", 2L),
Tuple2.of("李四", 5L),
Tuple2.of("李四", 8L),
Tuple2.of("李四", 10L)
);
dataSet.groupBy(0).combineGroup(new MyCombinableGroupReducer()).print();
// 或者:dataSet.groupBy(0).reduceGroup(new MyCombinableGroupReducer()).print();
}
}
class MyCombinableGroupReducer implements GroupReduceFunction<Tuple2<String, Long>, String>, GroupCombineFunction<Tuple2<String, Long>, Tuple2<String, Long>>{
@Override
public void combine(Iterable<Tuple2<String, Long>> values, Collector<Tuple2<String, Long>> out) throws Exception {
Long sum = 0L;
String name = null;
for (Tuple2<String, Long> t : values) {
name = t.f0;
sum += t.f1;
}
out.collect(Tuple2.of(name,sum));
}
@Override
public void reduce(Iterable<Tuple2<String, Long>> values, Collector<String> out) throws Exception {
Long sum = 0L;
String name = null;
for (Tuple2<String, Long> t : values) {
name = t.f0;
sum += t.f1;
}
out.collect(Tuple2.of(name,sum).toString());
}
}
结果:
(张三,6)
(李四,23)
注意:本小节这种写法要注意:GroupCombineFunction接口的泛型输入和输出类型必须等于GroupReduceFunction的泛型输入类型,而上一节"15.combineGroup"中分开那种分开的写法则做了解耦,意味着combine的返回是泛型类型是自由的。这两个小节建议大家多试试,另外仔细看看我看的那张图。
16.max/min/sum 方法
注意这里指的是 方法,而不是aggregate中介绍的SUM,MAX,MIN聚合器具。
所谓随机返回意思就是不做任何保证了。
public class AggregateAnd {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple4<String,Integer,Integer,Integer>> dataSet = env.fromElements(
Tuple4.of("鸡蛋",9,8,44),
Tuple4.of("鸡蛋",2,3,55),
Tuple4.of("鸡蛋",1,6,66),
Tuple4.of("香蕉",21,22,23),
Tuple4.of("香蕉",16,17,18),
Tuple4.of("香蕉",81,82,83)
);
UnsortedGrouping<Tuple4<String,Integer,Integer,Integer>> group = dataSet.groupBy(0);
group.min(1).print();
group.max(1).print();
group.sum(1).print();
}
}
结果:
(香蕉,16,82,83)
(鸡蛋,1,6,66)
(香蕉,81,82,83)
(鸡蛋,9,6,66)
(香蕉,118,82,83)
(鸡蛋,12,6,66)
除了指定的索引位置,min(1),.max(1),sum(1)最在的列是有保证的,后面的字段都是随机返回的,毫无意义。
17. maxBy/minBy
max / min 只是保证了数据中所选字段的最大值或者是最小值,其他字段不做保证,也就是说不会返回指定的行数据。 如果我们要返回某字段最大值。最小值 对应是数据行,则要用,maxBy/minBy, 这两个算子保证了行数据的一致性。
public class AggregateAnd {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple4<String,Integer,Integer,Integer>> dataSet = env.fromElements(
Tuple4.of("鸡蛋",9,8,44),
Tuple4.of("鸡蛋",2,3,55),
Tuple4.of("鸡蛋",1,6,66),
Tuple4.of("香蕉",21,22,23),
Tuple4.of("香蕉",16,17,18),
Tuple4.of("香蕉",81,82,83)
);
UnsortedGrouping<Tuple4<String,Integer,Integer,Integer>> group = dataSet.groupBy(0);
group.minBy(1).print();
group.maxBy(1).print();
}
}
结果:
(香蕉,16,17,18)
(鸡蛋,1,6,66)
(香蕉,81,82,83)
(鸡蛋,9,8,44)
18.iterate
迭代常用于模型训练, 第一次的输入作为第二次的输出,依次往后计算(有点像reduce哈),直到达到设定的迭代次数方会停止迭代,或者是达到我们设定的迭代停止条件也会停止迭代。
18.1 全量迭代
上面的迭代是全量迭代,意思是每次迭代的输入数据是上一次迭代的全部输出数据,而不管数据中是否有已经满足迭代条件的元素。
public class iterate {
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Integer> integerDataSource = env.fromElements( 6, 7, 10);
// 设置最大迭代次数:100次
IterativeDataSet<Integer> iterativeDataSet = integerDataSource.iterate(100);
// 循环体
DataSet<Integer> iterativeBody = iterativeDataSet.map(x -> x - 1);
// 只要流数据中有任何一个大于 0 就会接着循环,所以这是个循环条件,也可以看成反向终止条件
DataSet<Integer> terminationCriterion = iterativeDataSet.filter(x -> (x>0));
// 设置循环体和循环结束条件
DataSet<Integer> result = iterativeDataSet.closeWith(iterativeBody, terminationCriterion);
result.print();
}
}
注意:在循环条件结束之后会多执行一次循环,正常来说应该是 -4,-3,0,但是试接结果却是:-5 -4 -1,
可以看出最后多执行了一次循环。
18.2增量迭代
还有一种叫做增量迭代,意思是已经满足迭代停止条件的数据会被缓存,只有不满足条件的会作为下一次迭代的输入。 这样的话传输的数据量会减少,提高程序执行速度。
很遗憾官网给的代码我并未看懂,这里就不写代码了。谁有可以运行的代码,请留言谢谢。
19.zipping(打标签)
在一些特殊情况下可能会遇到给每个元素分配一个唯一的标识符, 类似于数据主键,此时flink提供了一个内置的标签,打标签分为两种,一种是自增标签,一种是随机标签。此功能很简单,看代码:
定义一个商品对象:
public class Prodect{
private String name;//产品名称
private Integer count;//产品销售数量
private Double price;//产品的价格
public Prodect(String name, Integer count, Double price) {
this.name = name;
this.count = count;
this.price = price;
}
public Prodect() {
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Integer getCount() {
return count;
}
public void setCount(Integer count) {
this.count = count;
}
public Double getPrice() {
return price;
}
public void setPrice(Double price) {
this.price = price;
}
@Override
public String toString() {
return "Prodect{" +
"name='" + name + '\'' +
", count=" + count +
", price=" + price +
'}';
}
}
//打递增标签
public class zipDemo {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Prodect> dataSet = env.fromElements(
new Prodect("鸡蛋",1,2.1),
new Prodect("鸡蛋",3,2.3),
new Prodect("牛奶",2,2.9),
new Prodect("牛奶",3,2.8)
);
DataSet<Tuple2<Long, Prodect>> re = DataSetUtils.zipWithIndex(dataSet);
re.print();
}
}
输出:
(0,Prodect{name='鸡蛋', count=1, price=2.1})
(1,Prodect{name='鸡蛋', count=3, price=2.3})
(2,Prodect{name='牛奶', count=2, price=2.9})
(3,Prodect{name='牛奶', count=3, price=2.8})
随机标签把zipWithIndex替换成zipWithUniqueId即可
总结
DataSet模块后续会被flink抛弃,因为flink dataStream和Table sql接口已经逐渐支持批处理了,只需要更改一个runMode参数即可。