flink笔记6 DataStream API(二)Transform、sink介绍和使用

Transform、sink介绍和使用

3.Transform

(1) 简单转换算子

(2)键控流转换算子

(3)多流转换算子

4.sink


3.Transform

(1) 简单转换算子

①  Map:输入一个元素,然后返回一个元素,中间可以做一些清洗转换等操作

object Transform1 {
  def main(args: Array[String]): Unit = {
    
    val inputdata = List(1,2,3,4,5)
    val resultdata = inputdata.map(_+10)
    print(resultdata)

  }
}

// 结果: List(11, 12, 13, 14, 15)
object Transform1 {
  def main(args: Array[String]): Unit = {

    val inputdata = List(1,2,3,4,5)
    val resultdata = inputdata.map(data=>data*10)
    print(resultdata)

  }
}

// 结果: List(10, 20, 30, 40, 50)

②  Flatmap:输入一个元素,可以返回零个,一个或多个元素

object Transform1 {
  def main(args: Array[String]): Unit = {

    val inputdata = List("i like english","hello world")
    val resultdata = inputdata.flatMap(_.split(" "))
    print(resultdata)

  }
}

// 结果: List(i, like, english, hello, world)

③  Filter:过滤函数,对传入数据进行判断,符合条件的数据会被留下

object Transform1 {
  def main(args: Array[String]): Unit = {

    val inputdata = List(1,2,3,4,5,6,7,8)
    val resultdata = inputdata.filter(_%2==0)
    print(resultdata)

  }
}

// 结果: List(2, 4, 6, 8)

(2)键控流转换算子

①  KeyBy(分组聚合):根据指定的key进行分组,相同key的数据会进入同一个分区(但是同一个分区不一定是同一种数据)。DataStream -> KeyStream(比原始数据流多了一个key 分组)

②  Rolling Aggregation(滚动聚合算子):对KeyedStream的每一支支流做聚合

  • sum()   求和
  • min()    求最小值
  • max()   求最大值
  • minBy()      
  • maxBy()
test.txt文件内容

sensor_1,199908188,36.2
sensor_2,199908189,36.0
sensor_3,199908190,36.5
sensor_4,199908191,36.3
sensor_2,199908192,35.6
sensor_2,199908196,36.8
sensor_2,199908197,30.6
sensor_2,199908180,33.4
sensor_2,199908199,36.6
sensor_2,199908200,35.3
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}

case class sensorReading(id:String,timestamp:Long,temperature:Double)

object Transform1 {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val path = "src/main/resources/test.txt"
    val  inputdata = env.readTextFile(path)

    //转换成样例类类型
    val data_stream = inputdata
      .map ( data => {
        val arr = data.split(",")
        sensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
        }
      )

    //分组聚合
    val keyby_stream = data_stream
      .keyBy("id")   //输入整数代表它的位置,输入string类型对应样例类类型里面的定义

    //Rolling Aggregation
    val RA_stream = keyby_stream
      .min("temperature")
//      .minBy("temperature")

    RA_stream.print()

    env.execute("test keyby")

  }
}

              

对比看出:min和minBy的时间戳有不同,min的时间戳一直都是第一个sensor_2的时间戳,即使之后有更低温度温度的时候时间戳也不变,而minBy会改变

③  Reduce:一个分组数据流聚合操作,合并当前的元素和上次聚合的结果,产生一个新的值,返回的流中包含每一次聚合的结果,而不是只返回最后一次的最终结果。                                               keyedStream ->DataStream

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}

case class sensorReading(id:String,timestamp:Long,temperature:Double)

object Transform1 {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val path = "src/main/resources/test.txt"
    val  inputdata = env.readTextFile(path)

    //转换成样例类类型
    val data_stream = inputdata
      .map ( data => {
        val arr = data.split(",")
        sensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
      }
      )

    //Reduce 需要输出当前最小的温度值,对应的时间戳是最新的
    val reduce_stream = data_stream
      .keyBy("id")
      .reduce((curstate,newstate) =>
          sensorReading(curstate.id,newstate.timestamp,curstate.temperature.min(newstate.temperature))
      )

    reduce_stream.print()

    env.execute()

  }
}

还可以自己定义一个reduce函数(结果与上图一样)

import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}

case class sensorReading(id:String,timestamp:Long,temperature:Double)

object Transform1 {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val path = "src/main/resources/test.txt"
    val  inputdata = env.readTextFile(path)

    val data_stream = inputdata
      .map ( data => {
        val arr = data.split(",")
        sensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
      }
      )

    //Reduce 需要输出当前最小的温度值,对应的时间戳是最新的
    val reduce_stream = data_stream
      .keyBy("id")
      .reduce(new MyReduce())
    reduce_stream.print()

    env.execute()

  }
}

class MyReduce extends ReduceFunction[sensorReading]{
  override def reduce(t: sensorReading, t1: sensorReading): sensorReading =
    sensorReading(t.id,t1.timestamp,t.temperature.min(t1.temperature))
}

(3)多流转换算子

①Split&Slect

Split:根据某些特征把一个DataStream拆分成两个或者多个DataStream。DataStream -> SplitStream

分开之后其实还是一个整体的流SplitStream

Select:从一个SplitStream中获取一个或者多个DataStream。SplitStream->DataStream

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}

case class sensorReading(id:String,timestamp:Long,temperature:Double)

object split_slect {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val path = "src/main/resources/test.txt"
    val  inputdata = env.readTextFile(path)

    //转换成样例类类型
    val data_stream = inputdata
      .map ( data => {
        val arr = data.split(",")
        sensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
      }
      )

    //split&slect
    //根据温度是否超过36.5,对数据进行拆分
    val split_stream = data_stream
      .split(data =>
        if(data.temperature > 36.5) Seq("high")else Seq("low")
      )

    val highStream = split_stream.select("high")
    val lowStream = split_stream.select("low")
    val allStream = split_stream.select("high","low")

    highStream.print("high")
    lowStream.print("low")
    allStream.print("all")

    env.execute()

  }
}

②Connect&comap

Connect:连接两个保持他们类型的数据流,两个数据流被Connext之后,只是被放在了同一个流中,内部依然保持各自的数据和形式不发生任何变化,两个流相互独立。

DataStream,DataStream->ConnectedStream

Comap:作用于ConnectedStreams上,功能与map和flatMap一样,对ConnectedStreams中的每一个Stream分别进行map和flatMap处理。

ConnectedStreams->DataStream

connectedStreams是不能输出(print)的,只有进行map或flatMap才能输出。

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}

case class sensorReading(id:String,timestamp:Long,temperature:Double)

object Cnnect_Comap {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val path = "src/main/resources/test.txt"
    val  inputdata = env.readTextFile(path)

    val data_stream = inputdata
      .map ( data => {
        val arr = data.split(",")
        sensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
      }
      )

    //split&slect
    val split_stream = data_stream
      .split(data =>
        if(data.temperature > 36.5) Seq("high")else Seq("low")
      )

    val highStream = split_stream.select("high")
    val lowStream = split_stream.select("low")
    val allStream = split_stream.select("high","low")

    //connect&Comap
    val warning_stream = highStream.map(data => (data.id,data.temperature))
    val connected_stream = warning_stream.connect(lowStream)

    val comap_stream = connected_stream
      .map(
        warning_data => (warning_data._1,warning_data._2,"warning"),
        low_data => (low_data.id,low_data.temperature,"normal")
      )

    comap_stream.print()

    env.execute()

  }
}

③union

对两个或者两个以上的DataStream进行union操作,产生一个包含所有DataStream元素的新DataStream。DataStream->DataStream

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}

case class sensorReading(id:String,timestamp:Long,temperature:Double)

object union_test {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val path = "src/main/resources/test.txt"
    val  inputdata = env.readTextFile(path)

    val data_stream = inputdata
      .map ( data => {
        val arr = data.split(",")
        sensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
      }
      )

    //split&slect
    val split_stream = data_stream
      .split(data =>
        if(data.temperature >= 36.5) Seq("high")else Seq("low")
      )

    val highStream = split_stream.select("high")
    val lowStream = split_stream.select("low")

    val newHighStream = highStream.map(data => (data.id,data.timestamp,data.temperature,"high"))
    val newLowStream = lowStream.map(data => (data.id,data.timestamp,data.temperature,"low"))

    //union
    val union_stream = newHighStream.union(newLowStream)

    union_stream.print()

    env.execute()
  }
}

4.sink

(1) writeAsTest() / writeAsCsv()

将元素以字符串的形式逐行写入,这些字符串通过调用每个元素的toString()方法来获取

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}

case class sensorReading(id:String,timestamp:Long,temperature:Double)

object sink_test {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val path = "src/main/resources/test.txt"
    val  inputdata = env.readTextFile(path)

    val data_stream = inputdata
      .map ( data => {
        val arr = data.split(",")
        sensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
      }
      )

    //sink
    data_stream.writeAsCsv("src/main/resources/output.csv")   //不推荐使用

    env.execute("sink test")
  }
}

                       

import org.apache.flink.api.common.serialization.SimpleStringEncoder
import org.apache.flink.core.fs.Path
import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}

case class sensorReading(id:String,timestamp:Long,temperature:Double)

object sink_test {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val path = "src/main/resources/test.txt"
    val  inputdata = env.readTextFile(path)

    val data_stream = inputdata
      .map ( data => {
        val arr = data.split(",")
        sensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
      }
      )

    //sink建议使用
    data_stream.addSink(StreamingFileSink.forRowFormat(
      new Path("src/main/resources/output2.csv"),new SimpleStringEncoder[sensorReading]()
      ).build()
    )

    env.execute("sink test")
  }
}

(2) print() / printToErr()

打印每个元素的toString()方法的值到标准输出或者标准错误输出流中

(3) kafka(自定义输出addsink)

Flink对外界的输入输出都要利用sink完成

①如果没有添加maven依赖,首先记得先添加依赖

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka-0.11_2.12</artifactId>
    <version>1.10.1</version>
</dependency>

②编写代码

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011

case class sensorReading(id:String,timestamp:Long,temperature:Double)

object sink_kafka {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val path = "src/main/resources/test.txt"
    val  inputdata = env.readTextFile(path)

    val data_stream = inputdata
      .map ( data => {
        val arr = data.split(",")
        sensorReading(arr(0), arr(1).toLong, arr(2).toDouble).toString
      }
      )

    data_stream.addSink(new FlinkKafkaProducer011[String](
      "192.168.100.3:9092","kafkasinktest",new SimpleStringSchema()))

    env.execute("kafka sink test")
  }
}

③启动zookeeper和kafka,然后运行程序,可以看到结果输入到了kafka

[root@master ~]# cd /home/hadoop//softs/zookeeper-3.5.6/
[root@master zookeeper-3.5.6]# ./bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /home/hadoop/softs/zookeeper-3.5.6/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[root@master zookeeper-3.5.6]# jps
6893 QuorumPeerMain
6925 Jps
[root@master zookeeper-3.5.6]# cd ..
[root@master softs]# cd kafka_2.11-0.11.0.3/
[root@master kafka_2.11-0.11.0.3]# ./bin/kafka-server-start.sh -daemon ./config/server.properties
[root@master kafka_2.11-0.11.0.3]# jps
7169 Kafka
7244 Jps
6893 QuorumPeerMain
[root@master kafka_2.11-0.11.0.3]# ./bin/kafka-console-consumer.sh --bootstrap-server 192.168.100.3:9092 --topic kafkasinktest
sensorReading(sensor_1,199908188,36.2)
sensorReading(sensor_2,199908189,36.0)
sensorReading(sensor_3,199908190,36.5)
sensorReading(sensor_4,199908191,36.3)

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值