Transform、sink介绍和使用
3.Transform
(1) 简单转换算子
① Map:输入一个元素,然后返回一个元素,中间可以做一些清洗转换等操作
object Transform1 {
def main(args: Array[String]): Unit = {
val inputdata = List(1,2,3,4,5)
val resultdata = inputdata.map(_+10)
print(resultdata)
}
}
// 结果: List(11, 12, 13, 14, 15)
object Transform1 {
def main(args: Array[String]): Unit = {
val inputdata = List(1,2,3,4,5)
val resultdata = inputdata.map(data=>data*10)
print(resultdata)
}
}
// 结果: List(10, 20, 30, 40, 50)
② Flatmap:输入一个元素,可以返回零个,一个或多个元素
object Transform1 {
def main(args: Array[String]): Unit = {
val inputdata = List("i like english","hello world")
val resultdata = inputdata.flatMap(_.split(" "))
print(resultdata)
}
}
// 结果: List(i, like, english, hello, world)
③ Filter:过滤函数,对传入数据进行判断,符合条件的数据会被留下
object Transform1 {
def main(args: Array[String]): Unit = {
val inputdata = List(1,2,3,4,5,6,7,8)
val resultdata = inputdata.filter(_%2==0)
print(resultdata)
}
}
// 结果: List(2, 4, 6, 8)
(2)键控流转换算子
① KeyBy(分组聚合):根据指定的key进行分组,相同key的数据会进入同一个分区(但是同一个分区不一定是同一种数据)。DataStream -> KeyStream(比原始数据流多了一个key 分组)
② Rolling Aggregation(滚动聚合算子):对KeyedStream的每一支支流做聚合
- sum() 求和
- min() 求最小值
- max() 求最大值
- minBy()
- maxBy()
test.txt文件内容
sensor_1,199908188,36.2
sensor_2,199908189,36.0
sensor_3,199908190,36.5
sensor_4,199908191,36.3
sensor_2,199908192,35.6
sensor_2,199908196,36.8
sensor_2,199908197,30.6
sensor_2,199908180,33.4
sensor_2,199908199,36.6
sensor_2,199908200,35.3
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}
case class sensorReading(id:String,timestamp:Long,temperature:Double)
object Transform1 {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val path = "src/main/resources/test.txt"
val inputdata = env.readTextFile(path)
//转换成样例类类型
val data_stream = inputdata
.map ( data => {
val arr = data.split(",")
sensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
}
)
//分组聚合
val keyby_stream = data_stream
.keyBy("id") //输入整数代表它的位置,输入string类型对应样例类类型里面的定义
//Rolling Aggregation
val RA_stream = keyby_stream
.min("temperature")
// .minBy("temperature")
RA_stream.print()
env.execute("test keyby")
}
}
对比看出:min和minBy的时间戳有不同,min的时间戳一直都是第一个sensor_2的时间戳,即使之后有更低温度温度的时候时间戳也不变,而minBy会改变
③ Reduce:一个分组数据流聚合操作,合并当前的元素和上次聚合的结果,产生一个新的值,返回的流中包含每一次聚合的结果,而不是只返回最后一次的最终结果。 keyedStream ->DataStream
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}
case class sensorReading(id:String,timestamp:Long,temperature:Double)
object Transform1 {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val path = "src/main/resources/test.txt"
val inputdata = env.readTextFile(path)
//转换成样例类类型
val data_stream = inputdata
.map ( data => {
val arr = data.split(",")
sensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
}
)
//Reduce 需要输出当前最小的温度值,对应的时间戳是最新的
val reduce_stream = data_stream
.keyBy("id")
.reduce((curstate,newstate) =>
sensorReading(curstate.id,newstate.timestamp,curstate.temperature.min(newstate.temperature))
)
reduce_stream.print()
env.execute()
}
}
还可以自己定义一个reduce函数(结果与上图一样)
import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}
case class sensorReading(id:String,timestamp:Long,temperature:Double)
object Transform1 {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val path = "src/main/resources/test.txt"
val inputdata = env.readTextFile(path)
val data_stream = inputdata
.map ( data => {
val arr = data.split(",")
sensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
}
)
//Reduce 需要输出当前最小的温度值,对应的时间戳是最新的
val reduce_stream = data_stream
.keyBy("id")
.reduce(new MyReduce())
reduce_stream.print()
env.execute()
}
}
class MyReduce extends ReduceFunction[sensorReading]{
override def reduce(t: sensorReading, t1: sensorReading): sensorReading =
sensorReading(t.id,t1.timestamp,t.temperature.min(t1.temperature))
}
(3)多流转换算子
①Split&Slect
Split:根据某些特征把一个DataStream拆分成两个或者多个DataStream。DataStream -> SplitStream
分开之后其实还是一个整体的流SplitStream
Select:从一个SplitStream中获取一个或者多个DataStream。SplitStream->DataStream
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}
case class sensorReading(id:String,timestamp:Long,temperature:Double)
object split_slect {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val path = "src/main/resources/test.txt"
val inputdata = env.readTextFile(path)
//转换成样例类类型
val data_stream = inputdata
.map ( data => {
val arr = data.split(",")
sensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
}
)
//split&slect
//根据温度是否超过36.5,对数据进行拆分
val split_stream = data_stream
.split(data =>
if(data.temperature > 36.5) Seq("high")else Seq("low")
)
val highStream = split_stream.select("high")
val lowStream = split_stream.select("low")
val allStream = split_stream.select("high","low")
highStream.print("high")
lowStream.print("low")
allStream.print("all")
env.execute()
}
}
②Connect&comap
Connect:连接两个保持他们类型的数据流,两个数据流被Connext之后,只是被放在了同一个流中,内部依然保持各自的数据和形式不发生任何变化,两个流相互独立。
DataStream,DataStream->ConnectedStream
Comap:作用于ConnectedStreams上,功能与map和flatMap一样,对ConnectedStreams中的每一个Stream分别进行map和flatMap处理。
ConnectedStreams->DataStream
connectedStreams是不能输出(print)的,只有进行map或flatMap才能输出。
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}
case class sensorReading(id:String,timestamp:Long,temperature:Double)
object Cnnect_Comap {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val path = "src/main/resources/test.txt"
val inputdata = env.readTextFile(path)
val data_stream = inputdata
.map ( data => {
val arr = data.split(",")
sensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
}
)
//split&slect
val split_stream = data_stream
.split(data =>
if(data.temperature > 36.5) Seq("high")else Seq("low")
)
val highStream = split_stream.select("high")
val lowStream = split_stream.select("low")
val allStream = split_stream.select("high","low")
//connect&Comap
val warning_stream = highStream.map(data => (data.id,data.temperature))
val connected_stream = warning_stream.connect(lowStream)
val comap_stream = connected_stream
.map(
warning_data => (warning_data._1,warning_data._2,"warning"),
low_data => (low_data.id,low_data.temperature,"normal")
)
comap_stream.print()
env.execute()
}
}
③union
对两个或者两个以上的DataStream进行union操作,产生一个包含所有DataStream元素的新DataStream。DataStream->DataStream
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}
case class sensorReading(id:String,timestamp:Long,temperature:Double)
object union_test {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val path = "src/main/resources/test.txt"
val inputdata = env.readTextFile(path)
val data_stream = inputdata
.map ( data => {
val arr = data.split(",")
sensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
}
)
//split&slect
val split_stream = data_stream
.split(data =>
if(data.temperature >= 36.5) Seq("high")else Seq("low")
)
val highStream = split_stream.select("high")
val lowStream = split_stream.select("low")
val newHighStream = highStream.map(data => (data.id,data.timestamp,data.temperature,"high"))
val newLowStream = lowStream.map(data => (data.id,data.timestamp,data.temperature,"low"))
//union
val union_stream = newHighStream.union(newLowStream)
union_stream.print()
env.execute()
}
}
4.sink
(1) writeAsTest() / writeAsCsv()
将元素以字符串的形式逐行写入,这些字符串通过调用每个元素的toString()方法来获取
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}
case class sensorReading(id:String,timestamp:Long,temperature:Double)
object sink_test {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val path = "src/main/resources/test.txt"
val inputdata = env.readTextFile(path)
val data_stream = inputdata
.map ( data => {
val arr = data.split(",")
sensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
}
)
//sink
data_stream.writeAsCsv("src/main/resources/output.csv") //不推荐使用
env.execute("sink test")
}
}
import org.apache.flink.api.common.serialization.SimpleStringEncoder
import org.apache.flink.core.fs.Path
import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}
case class sensorReading(id:String,timestamp:Long,temperature:Double)
object sink_test {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val path = "src/main/resources/test.txt"
val inputdata = env.readTextFile(path)
val data_stream = inputdata
.map ( data => {
val arr = data.split(",")
sensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
}
)
//sink建议使用
data_stream.addSink(StreamingFileSink.forRowFormat(
new Path("src/main/resources/output2.csv"),new SimpleStringEncoder[sensorReading]()
).build()
)
env.execute("sink test")
}
}
(2) print() / printToErr()
打印每个元素的toString()方法的值到标准输出或者标准错误输出流中
(3) kafka(自定义输出addsink)
Flink对外界的输入输出都要利用sink完成
①如果没有添加maven依赖,首先记得先添加依赖
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.11_2.12</artifactId>
<version>1.10.1</version>
</dependency>
②编写代码
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, createTypeInformation}
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011
case class sensorReading(id:String,timestamp:Long,temperature:Double)
object sink_kafka {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val path = "src/main/resources/test.txt"
val inputdata = env.readTextFile(path)
val data_stream = inputdata
.map ( data => {
val arr = data.split(",")
sensorReading(arr(0), arr(1).toLong, arr(2).toDouble).toString
}
)
data_stream.addSink(new FlinkKafkaProducer011[String](
"192.168.100.3:9092","kafkasinktest",new SimpleStringSchema()))
env.execute("kafka sink test")
}
}
③启动zookeeper和kafka,然后运行程序,可以看到结果输入到了kafka
[root@master ~]# cd /home/hadoop//softs/zookeeper-3.5.6/
[root@master zookeeper-3.5.6]# ./bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /home/hadoop/softs/zookeeper-3.5.6/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[root@master zookeeper-3.5.6]# jps
6893 QuorumPeerMain
6925 Jps
[root@master zookeeper-3.5.6]# cd ..
[root@master softs]# cd kafka_2.11-0.11.0.3/
[root@master kafka_2.11-0.11.0.3]# ./bin/kafka-server-start.sh -daemon ./config/server.properties
[root@master kafka_2.11-0.11.0.3]# jps
7169 Kafka
7244 Jps
6893 QuorumPeerMain
[root@master kafka_2.11-0.11.0.3]# ./bin/kafka-console-consumer.sh --bootstrap-server 192.168.100.3:9092 --topic kafkasinktest
sensorReading(sensor_1,199908188,36.2)
sensorReading(sensor_2,199908189,36.0)
sensorReading(sensor_3,199908190,36.5)
sensorReading(sensor_4,199908191,36.3)