flink-10 DataSet算子

常见的DataSet转换算子算子

  • Map

    • 输入一个元素,然后返回一个元素,中间可以做一些清洗转换等操作
  • FlatMap

    • 输入一个元素,可以返回零个,一个或者多个元素
  • MapPartition

    • 类似map,一次处理一个分区的数据【如果在进行map处理的时候需要获取第三方资源链接,建议使用MapPartition】
  • Filter

    • 过滤函数,对传入的数据进行判断,符合条件的数据会被留下
  • Reduce

    • 对数据进行聚合操作,结合当前元素和上一次reduce返回的值进行聚合操作,然后返回一个新的值
  • Aggregate

    • sum、max、min等
  • Distinct

    • 返回一个数据集中去重之后的元素,data.distinct()
  • Join

    • 内连接
  • OuterJoin

    • 外链接
  • Cross

    • 获取两个数据集的笛卡尔积
  • Union

    • 返回两个数据集的总和,数据类型需要一致
  • First-n

    • 获取集合中的前N个元素
  • Sort Partition

    • 在本地对数据集的所有分区进行排序,通过sortPartition()的链接调用来完成对多个字段的排序

mapPartition

import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import scala.collection.mutable.ArrayBuffer
object map {
  def main(args: Array[String]): Unit = {
      val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
      import org.apache.flink.api.scala._

      val arrayBuffer = new ArrayBuffer[String]()
      arrayBuffer.+=("hadoop")
      arrayBuffer.+=("flink")
      val collectionDataSet: DataSet[String] = environment.fromCollection(arrayBuffer)

      val resultPartition: DataSet[String] = collectionDataSet.mapPartition(eachPartition => {
        eachPartition.map(eachLine => {
          val returnValue = eachLine + " result"
          returnValue
        })
      })
      resultPartition.print()

    }

}

在这里插入图片描述

distinct

distinct会将整个dataset中的数据进行去重

import org.apache.flink.api.scala.ExecutionEnvironment
import scala.collection.mutable.ArrayBuffer

object distinct {
  def main(args: Array[String]): Unit = {
    val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._

    val arrayBuffer = new ArrayBuffer[String]()
    arrayBuffer.+=("hello world1")
    arrayBuffer.+=("hello world2")
    arrayBuffer.+=("hello world3")
    arrayBuffer.+=("hello world4")

    val collectionDataSet: DataSet[String] = environment.fromCollection(arrayBuffer)
    val dsDataSet: DataSet[String] = collectionDataSet.flatMap(x => x.split(" ")).distinct()
    dsDataSet.print()
  }
}

输出
在这里插入图片描述

链接操作

join

  • 取交集,两边都存在的数据才会被保留
import org.apache.flink.api.scala.ExecutionEnvironment
import scala.collection.mutable.ArrayBuffer

object join {
  def main(args: Array[String]): Unit = {
    val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

    import org.apache.flink.api.scala._
    val array1 = ArrayBuffer((1,"张三"),(2,"李四"),(3,"王五"))
    val array2 =ArrayBuffer((1,"15"),(4,"20"),(3,"25"))

    val firstDataStream: DataSet[(Int, String)] = environment.fromCollection(array1)
    val secondDataStream: DataSet[(Int, String)] = environment.fromCollection(array2)

    val joinResult: UnfinishedJoinOperation[(Int, String), (Int, String)] = firstDataStream.join(secondDataStream)

    //where指定左边流关联的字段 ,equalTo指定与右边流相同的字段
    val resultDataSet: DataSet[(Int, String, String)] = joinResult.where(0).equalTo(0).map(x => {
      (x._1._1, x._1._2, x._2._2)
    })

    resultDataSet.print()
  }
}

输出
在这里插入图片描述

leftOuterJoin、rightOuterJoin

  • leftOuterJoin
    • 左侧数据全部保留,对应的右侧没有数据自定义处理
  • rightOuterJoin
    • 右侧数据全部保留,对应的左侧没有数据自定义处理
import org.apache.flink.api.common.functions.JoinFunction
import org.apache.flink.api.scala.ExecutionEnvironment
import scala.collection.mutable.ArrayBuffer

object leftOuterJoin_rightOuterJoin {
  def main(args: Array[String]): Unit = {

    val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

    import org.apache.flink.api.scala._
    val array1 = ArrayBuffer((1,"张三"),(2,"李四"),(3,"王五"),(4,"张飞"))
    val array2 =ArrayBuffer((1,"18"),(2,"35"),(3,"42"),(5,"50"))

    val firstDataStream: DataSet[(Int, String)] = environment.fromCollection(array1)
    val secondDataStream: DataSet[(Int, String)] = environment.fromCollection(array2)

    //左外连接
    val leftOuterJoin: UnfinishedOuterJoinOperation[(Int, String), (Int, String)] = firstDataStream.leftOuterJoin(secondDataStream)
    //where指定左边流关联的字段 ,equalTo指定与右边流相同的字段
    val leftDataSet: JoinFunctionAssigner[(Int, String), (Int, String)] = leftOuterJoin.where(0).equalTo(0)
    //对关联的数据进行函数操作
    //JoinFunction的三个元组参数分别为:左表数据,右表数据,输出数据
    val leftResult: DataSet[(Int, String,String)] = leftDataSet.apply(new JoinFunction[(Int, String), (Int, String), (Int,String, String)] {
      override def join(left: (Int, String), right: (Int, String)): (Int, String, String) = {
        val result = if (right == null) {
          Tuple3[Int, String, String](left._1, left._2, "null")
        } else {
          Tuple3[Int, String, String](left._1, left._2, right._2)
        }
        result
      }
    })
    leftResult.print()


    //右外连接
    val rightOuterJoin: UnfinishedOuterJoinOperation[(Int, String), (Int, String)] = firstDataStream.rightOuterJoin(secondDataStream)
    //where指定左边流关联的字段 ,equalTo指定与右边流相同的字段
    val rightDataSet: JoinFunctionAssigner[(Int, String), (Int, String)] = rightOuterJoin.where(0).equalTo(0)
    //对关联的数据进行函数操作
     //JoinFunction的三个元组参数分别为:左表数据,右表数据,输出数据
    val rightResult: DataSet[(Int, String,String)] = rightDataSet.apply(new JoinFunction[(Int, String), (Int, String), (Int,String, String)] {
      override def join(left: (Int, String), right: (Int, String)): (Int, String, String) = {
        val result = if (left == null) {
          Tuple3[Int, String, String](right._1, right._2, "null")
        } else {
          Tuple3[Int, String, String](right._1, right._2, left._2)
        }
        result
      }
    })
    rightResult.print()
  }
}

leftOuterJoin输出
在这里插入图片描述

rightOuterJoin输出
在这里插入图片描述

  • 不处理右侧没有数据的情况
 //左外连接
   val leftOuterJoin: UnfinishedOuterJoinOperation[(Int, String), (Int, String)] = firstDataStream.leftOuterJoin(secondDataStream)
    //where指定左边流关联的字段 ,equalTo指定与右边流相同的字段
    val leftDataSet: JoinFunctionAssigner[(Int, String), (Int, String)] = leftOuterJoin.where(0).equalTo(0)
    //对关联的数据进行函数操作
    val leftResult: DataSet[(Int, String,String)] = leftDataSet.apply(new JoinFunction[(Int, String), (Int, String), (Int,String, String)] {
      override def join(left: (Int, String), right: (Int, String)): (Int, String, String) = {
        val result = if (right == null) {
          //Tuple3[Int, String, String](left._1, left._2, "null")
          Tuple3[Int, String, String](left._1, left._2, right._2)
        } else {
          Tuple3[Int, String, String](left._1, left._2, right._2)
        }
        result
      }
    })
    leftResult.print()

报错如下
空指针
在这里插入图片描述

cross

对两边表做笛卡尔积。设左表有n条数据,右表有m条数据,则笛卡尔积后的结果有n*m条数据

import org.apache.flink.api.scala.ExecutionEnvironment
import scala.collection.mutable.ArrayBuffer

object cross {
  def main(args: Array[String]): Unit = {
    val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

    import org.apache.flink.api.scala._
    val array1 = ArrayBuffer((1,"张三"),(2,"李四"))
    val array2 =ArrayBuffer((1,"18"),(2,"35"),(3,"42"))

    val firstDataStream: DataSet[(Int, String)] = environment.fromCollection(array1)
    val secondDataStream: DataSet[(Int, String)] = environment.fromCollection(array2)

    //cross笛卡尔积
    val crossDataSet: CrossDataSet[(Int, String), (Int, String)] = firstDataStream.cross(secondDataStream)
    crossDataSet.print()
  }
}

输出
在这里插入图片描述

first-N和sortPartition

  • first-N: 取前N位
  • sortPartition:排序
import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment

import scala.collection.mutable.ArrayBuffer

object first_n_sortPartition {
  def main(args: Array[String]): Unit = {

    val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._

    //数组
    val array = ArrayBuffer((1,"张三",10),(2,"李四",20),(3,"王五",30),(3,"赵6",40))

    val collectionDataSet: DataSet[(Int, String,Int)] = environment.fromCollection(array)

    //获取前3个元素
    collectionDataSet.first(3).print()

    collectionDataSet
      .groupBy(0) //按照第一个字段进行分组
      .sortGroup(2,Order.DESCENDING)  //按照第三个字段进行排序
      .first(1)  //获取每组的前一个元素
      .print()
    /**
      * 不分组排序,针对所有元素进行排序,第一个元素降序,第三个元素升序
      */
    collectionDataSet.sortPartition(0,Order.DESCENDING).sortPartition(2,Order.ASCENDING).print()
  }
}

first-N输出
在这里插入图片描述
按照第三个字段进行排序,每组取前一个元素输出

  • 从结果可以看到没有(3,"王五",30)。因为王五赵六处于同一组,按降序排序,赵六为第一个元素
    在这里插入图片描述

不分组排序,第一个元素降序,第三个元素升序

  • 优先按照第一个规则(降序)排序,当按照第一个规则存在相同排序key时,再按照第二个规则排序
    在这里插入图片描述

partition

import org.apache.flink.api.scala.ExecutionEnvironment
import scala.collection.mutable.ArrayBuffer

object partition {
  def main(args: Array[String]): Unit = {
    val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._
    val array = ArrayBuffer((1,"hello"),
      (2,"hello"),
      (2,"hello"),
      (3,"hello"),
      (3,"hello"),
      (3,"hello"),
      (4,"hello"),
      (4,"hello"),
      (4,"hello"),
      (4,"hello"),
      (5,"hello"),
      (5,"hello"),
      (5,"hello"),
      (5,"hello"),
      (5,"hello"),
      (6,"hello"),
      (6,"hello"),
      (6,"hello"),
      (6,"hello"),
      (6,"hello"),
      (6,"hello"))
    environment.setParallelism(2)
    val sourceDataSet: DataSet[(Int, String)] = environment.fromCollection(array)

    //partitionByHash:按照指定的字段hashPartitioner分区
    sourceDataSet.partitionByHash(0).mapPartition(eachPartition => {
      eachPartition.foreach(t=>{
        println("当前线程ID为" + Thread.currentThread().getId +"============="+t._1)
      })
      eachPartition
    }).print()


    //partitionByRange:按照指定的字段进行范围分区
    sourceDataSet.partitionByRange(x => x._1).mapPartition(eachPartition =>{
      eachPartition.foreach(t=>{
        println("当前线程ID为" + Thread.currentThread().getId +"============="+t._1)
      })
      eachPartition
    }).print()
  }

}

部分输出
在这里插入图片描述

官网地址

官网地址

©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页