dataset转换算子
常见的DataSet转换算子算子
-
Map
- 输入一个元素,然后返回一个元素,中间可以做一些清洗转换等操作
-
FlatMap
- 输入一个元素,可以返回零个,一个或者多个元素
-
MapPartition
- 类似map,一次处理一个分区的数据【如果在进行map处理的时候需要获取第三方资源链接,建议使用MapPartition】
-
Filter
- 过滤函数,对传入的数据进行判断,符合条件的数据会被留下
-
Reduce
- 对数据进行聚合操作,结合当前元素和上一次reduce返回的值进行聚合操作,然后返回一个新的值
-
Aggregate
- sum、max、min等
-
Distinct
- 返回一个数据集中去重之后的元素,data.distinct()
-
Join
- 内连接
-
OuterJoin
- 外链接
-
Cross
- 获取两个数据集的笛卡尔积
-
Union
- 返回两个数据集的总和,数据类型需要一致
-
First-n
- 获取集合中的前N个元素
-
Sort Partition
- 在本地对数据集的所有分区进行排序,通过sortPartition()的链接调用来完成对多个字段的排序
mapPartition
import org.apache.flink.api.scala.{
DataSet, ExecutionEnvironment}
import scala.collection.mutable.ArrayBuffer
object map {
def main(args: Array[String]): Unit = {
val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
val arrayBuffer = new ArrayBuffer[String]()
arrayBuffer.+=("hadoop")
arrayBuffer.+=("flink")
val collectionDataSet: DataSet[String] = environment.fromCollection(arrayBuffer)
val resultPartition: DataSet[String] = collectionDataSet.mapPartition(eachPartition => {
eachPartition.map(eachLine => {
val returnValue = eachLine + " result"
returnValue
})
})
resultPartition.print()
}
}
distinct
distinct会将整个dataset中的数据进行去重
import org.apache.flink.api.scala.ExecutionEnvironment
import scala.collection.mutable.ArrayBuffer
object distinct {
def main(args: Array[String]): Unit = {
val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
val arrayBuffer = new ArrayBuffer[String]()
arrayBuffer.+=("hello world1")
arrayBuffer.+=("hello world2")
arrayBuffer.+=("hello world3")
arrayBuffer.+=("hello world4")
val collectionDataSet: DataSet[String] = environment.fromCollection(arrayBuffer)
val dsDataSet: DataSet[String] = collectionDataSet.flatMap(x => x.split(" ")).distinct()
dsDataSet.print()
}
}
输出
链接操作
join
- 取交集,两边都存在的数据才会被保留
import org.apache.flink.api.scala.ExecutionEnvironment
import scala.collection.mutable.ArrayBuffer
object join {
def main(args: Array[String]): Unit = {
val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
val array1 = ArrayBuffer((1,"张三"),(2,"李四"),(3,"王五"))
val array2 =ArrayBuffer((1,"15"),(4,"20"),(3,"25"))
val firstDataStream: DataSet[(Int,