1、dataset的join连接,通过key进行关联,一般情况下的join都是inner join,类似sql里的inner join
key包括以下几种情况:
a key expression
a key-selector function
one or more field position keys (Tuple DataSet only).
Case Class Fields
2、inner join的几种情况
2.1 缺省的join,jion到一个Tuple2元组里
public static class User { public String name; public intzip; }public static class Store { public Manager mgr; public intzip; }
DataSet input1 = //[...]
DataSet input2 = //[...]//result dataset is typed as Tuple2
DataSet>result=input1.join(input2)
.where("zip") //key of the first input (users)
.equalTo("zip"); //key of the second input (stores)
2.2 用户自定义JoinFuncation,使用with语句
//some POJO
public classRating {publicString name;publicString category;public intpoints;
}//Join function that joins a custom POJO with a Tuple
public classPointWeighterimplements JoinFunction, Tuple2>{
@Overridepublic Tuple2 join(Rating rating, Tuple2weight) {//multiply the points and rating and construct a new output tuple
return new Tuple2(rating.name, rating.points *weight.f1);
}
}
DataSet ratings = //[...]
DataSet> weights = //[...]
DataSet>weightedRatings=ratings.join(weights)//key of the first input
.where("category")//key of the second input
.equalTo("f0")//applying the JoinFunction on joining pairs
.with(new PointWeighter());
2.3 使用Flat-Join Function,这种JoinFuncation和FlatJoinFuncation与MapFuncation和FlatMapFuncation的关系类似
public classPointWeighterimplements FlatJoinFunction, Tuple2>{
@Overridepublic void join(Rating rating, Tuple2weight,
Collector>out) {if (weight.f1 > 0.1) {
out.collect(new Tuple2(rating.name, rating.poi