flink左内连接Java_flink dataset join笔记

1、dataset的join连接,通过key进行关联,一般情况下的join都是inner join,类似sql里的inner join

key包括以下几种情况:

a key expression

a key-selector function

one or more field position keys (Tuple DataSet only).

Case Class Fields

2、inner join的几种情况

2.1 缺省的join,jion到一个Tuple2元组里

public static class User { public String name; public intzip; }public static class Store { public Manager mgr; public intzip; }

DataSet input1 = //[...]

DataSet input2 = //[...]//result dataset is typed as Tuple2

DataSet>result=input1.join(input2)

.where("zip") //key of the first input (users)

.equalTo("zip"); //key of the second input (stores)

2.2 用户自定义JoinFuncation,使用with语句

//some POJO

public classRating {publicString name;publicString category;public intpoints;

}//Join function that joins a custom POJO with a Tuple

public classPointWeighterimplements JoinFunction, Tuple2>{

@Overridepublic Tuple2 join(Rating rating, Tuple2weight) {//multiply the points and rating and construct a new output tuple

return new Tuple2(rating.name, rating.points *weight.f1);

}

}

DataSet ratings = //[...]

DataSet> weights = //[...]

DataSet>weightedRatings=ratings.join(weights)//key of the first input

.where("category")//key of the second input

.equalTo("f0")//applying the JoinFunction on joining pairs

.with(new PointWeighter());

2.3 使用Flat-Join Function,这种JoinFuncation和FlatJoinFuncation与MapFuncation和FlatMapFuncation的关系类似

public classPointWeighterimplements FlatJoinFunction, Tuple2>{

@Overridepublic void join(Rating rating, Tuple2weight,

Collector>out) {if (weight.f1 > 0.1) {

out.collect(new Tuple2(rating.name, rating.points *weight.f1));

}

}

}

DataSet>weightedRatings=ratings.join(weights)//[...]

2.4 join的投影构造,生成自定义的结果集

DataSet> input1 = //[...]

DataSet> input2 = //[...]

DataSet>result=input1.join(input2)//key definition on first DataSet using a field position key

.where(0)//key definition of second DataSet using a field position key

.equalTo(0)//select and reorder fields of matching tuples

.projectFirst(0,2).projectSecond(1).projectFirst(1);

projectFirst(int...) and projectSecond(int...)

选择应组合成输出元组的第一个和第二个连接输入的字段。索引的顺序定义了输出元组中字段的顺序。

连接投影也适用于非元组数据集,在这种情况下,必须在不带参数的情况下调用projectFirst()或projectSecond(),以将连接元素添加到输出元组。

2.5 加入join数据集大小提示,这是为了优化join的效率,引导优化器选择正确的执行策略。

DataSet> input1 = //[...]

DataSet> input2 = //[...]

DataSet, Tuple2>>result1=

//hint that the second DataSet is very small

input1.joinWithTiny(input2)

.where(0)

.equalTo(0);

DataSet, Tuple2>>result2=

//hint that the second DataSet is very large

input1.joinWithHuge(input2)

.where(0)

.equalTo(0);

2.6 join的算法提示,Flink运行时可以以各种方式执行连接。在不同情况下,每种可能的方式都优于其他方式。系统会尝试自动选择合理的方式,但允许您手动选择策略,以防您想要强制执行连接的特定方式。

DataSet input1 = //[...]

DataSet input2 = //[...]

DataSet result =input1.join(input2, JoinHint.BROADCAST_HASH_FIRST)

.where("id").equalTo("key");

OPTIMIZER_CHOOSES:相当于不提供任何提示,将选择留给系统。

BROADCAST_HASH_FIRST:广播第一个输入并从中构建哈希表,由第二个输入探测。如果第一个输入非常小,这是一个很好的策略。

BROADCAST_HASH_SECOND:广播第二个输入并从中构建一个哈希表,由第一个输入探测。如果第二个输入非常小,这是一个好策略。

REPARTITION_HASH_FIRST:系统对每个输入进行分区(shuffle)(除非输入已经分区)并从第一个输入构建哈希表。如果第一个输入小于第二个输入,则此策略很好,但两个输入仍然很大。

注意:如果不能进行大小估算,并且不能重新使用预先存在的分区和排序顺序,则这是系统使用的默认回退策略。

REPARTITION_HASH_SECOND:系统对每个输入进行分区(shuffle)(除非输入已经被分区)并从第二个输入构建哈希表。如果第二个输入小于第一个输入,则此策略很好,但两个输入仍然很大。

REPARTITION_SORT_MERGE:系统对每个输入进行分区(shuffle)(除非输入已经被分区)并对每个输入进行排序(除非它已经排序)。输入通过已排序输入的流合并来连接。如果已经对一个或两个输入进行了排序,则此策略很好。

3、FlatJoinFunction与FlatMapFunction的区别(JoinFuncation和MapFuncation的情况类似)

1、实际上两者可以干相同的事情

2、使用的区别是FlatJoinFunction有两个输入(就是join的两个数据集)一个输出,

而FlatMapFunction只有一个输入,但是这个输入参数里可以直接包括多个输入结构(即join的两个数据集都可以放入到一个输入参数里),

所以最终实现的结果实际是一致的。

3.1 FlatMapFunction应用join的例子

DataSet pagesInput = //[...]

DataSet> linksInput = //[...]

DataSet> pagesWithRanks = //[...]

DataSet> adjacencyListInput =//[...]

IterativeDataSet> iteration = //[...]

DataSet> newRanks =iteration.join(adjacencyListInput)

.where(0).equalTo(0)

.flatMap(new JoinVertexWithEdgesMatch())//下面的不用关注

.groupBy(0)

.aggregate(Aggregations.SUM,1)

.map(new Dampener(PageRank.DAMPENING_FACTOR, numPages));

public static final class JoinVertexWithEdgesMatch implements FlatMapFunction, Tuple2>, Tuple2>{

@Overridepublic void flatMap(Tuple2, Tuple2> value, Collector>out) {Long[] neighbors=value.f1.f1;double rank =value.f0.f1;double rankToDistribute = rank / ((double) neighbors.length);for(Long neighbor : neighbors) {

out.collect(new Tuple2(neighbor, rankToDistribute));

}

}

}

从上面的例子可以看到FlatMapFunction虽然只有一个输入,但是输入参数Tuple2里包含两个Tuple2,这被包含的两个Tuple2就是join的两个数据集。

3.2 FlatJoinFunction和JoinFuncation例子,它们使用with语句来实现

DataSet> changes =iteration.getWorkset().join(edges)

.where(0).equalTo(0)

.with(newNeighborWithComponentIDJoin())

.groupBy(0).aggregate(Aggregations.MIN, 1)

.join(iteration.getSolutionSet()).where(0).equalTo(0)

.with(new ComponentIdFilter());public static final class NeighborWithComponentIDJoin implements JoinFunction, Tuple2, Tuple2>{

@Overridepublic Tuple2 join(Tuple2 vertexWithComponent, Tuple2edge) {return new Tuple2<>(edge.f1, vertexWithComponent.f1);

}

}public static final class ComponentIdFilter implements FlatJoinFunction, Tuple2, Tuple2>{

@Overridepublic void join(Tuple2 candidate, Tuple2 old, Collector>out) {if (candidate.f1

out.collect(candidate);

}

}

}

从上述例子可以看到FlatJoinFunction或者JoinFunction是两个输入参数,也就是join的两个数据集

3.3 从源码上看,FlatJoinFunction与FlatMapFunction两者实际没太大区别

@Public

@FunctionalInterfacepublic interface FlatJoinFunction extendsFunction, Serializable {/*** The join method, called once per joined pair of elements.

*

*@paramfirst The element from first input.

*@paramsecond The element from second input.

*@paramout The collector used to return zero, one, or more elements.

*

*@throwsException This method may throw exceptions. Throwing an exception will cause the operation

* to fail and may trigger recovery.*/

void join (IN1 first, IN2 second, Collector out) throwsException;

}

@Public

@FunctionalInterfacepublic interface FlatMapFunction extendsFunction, Serializable {/*** The core method of the FlatMapFunction. Takes an element from the input data set and transforms

* it into zero, one, or more elements.

*

*@paramvalue The input value.

*@paramout The collector for returning result values.

*

*@throwsException This method may throw exceptions. Throwing an exception will cause the operation

* to fail and may trigger recovery.*/

void flatMap(T value, Collector out) throwsException;

}

4、outer join,外连接,类似sql的left join,right join,full join的情况

OuterJoin在两个数据集上执行左,右或全外连接。外连接类似于常规的(inner join)连接,并创建在其键上相等的所有元素对。

此外,如果在另一侧没有找到匹配的key,则保留“外部”侧(左侧,右侧或两者)的记录。

匹配元素对(或一个元素和另一个输入的空值)被赋予JoinFunction以将元素对转换为单个元素,

或者给予FlatJoinFunction以将元素对转换为任意多个(包括none)元素。

4.1 外连接OuterJoin

OuterJoin调用用户定义的连接函数来处理连接元组。连接函数接收第一个输入DataSet的一个元素和第二个输入DataSet的一个元素,并返回一个元素。根据外连接的类型(left,right,full),join函数的两个输入元素之一可以为null。

以下代码使用键选择器函数执行DataSet与自定义java对象和Tuple DataSet的左外连接,并显示如何使用用户定义的连接函数:

//some POJO

public classRating {publicString name;publicString category;public intpoints;

}//Join function that joins a custom POJO with a Tuple

public classPointAssignerimplements JoinFunction, Rating, Tuple2>{

@Overridepublic Tuple2 join(Tuple2movie, Rating rating) {//Assigns the rating points to the movie.//NOTE: rating might be null

return new Tuple2(movie.f0, rating == null ? -1: rating.points;

}

}

DataSet> movies = //[...]

DataSet ratings = //[...]

DataSet>moviesWithPoints=movies.leftOuterJoin(ratings)//key of the first input

.where("f0")//key of the second input

.equalTo("name")//applying the JoinFunction on joining pairs

.with(new PointAssigner());

4.2 FlatJoinFuncation实现OuterJoin

public classPointAssignerimplements FlatJoinFunction, Rating, Tuple2>{

@Overridepublic void join(Tuple2movie, Rating rating

Collector>out) {if (rating == null) {

out.collect(new Tuple2(movie.f0, -1));

}else if (rating.points < 10) {

out.collect(new Tuple2(movie.f0, rating.points));

}else{//do not emit

}

}

DataSet>moviesWithPoints=movies.leftOuterJoin(ratings)//[...]

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值