spark和flink设计理念
Spark:是以批次处理为理念设计出来的,后期加上的流处理。(伪实时处理)
Flink:以流处理为理念设计出来的。
DataSet API:是Flink用于批处理应用程序的核心
DataStream API:流处理
Gelly:是一个可扩展图形处理和分析库。
execute用法
批处理,将数据sink到指定地点需要调用execute方法
流处理,必须调用execute 方法
Flink原理结构图
Flink集群架构
flink集群启动命令
开启集群
启动高可用主节点,hadoop101,hadoop102
启动从节点
作业管理器jobManager
四张图的创建地点:
四张图:streamGraph、jobGraph、ExecutionGraph、物理执行图
- 在client端:streamGraph→jobGraph
- 在jobManager端:jobGraph→ExecutionGraph
- 在taskmanager端:ExecutionGraph→物理执行图
JobManager作用
1:画图jobGraph→ExecutionGraph
2:负责协调job作业的分发
3:协调taskmanager做好checkpoint检查点
4:负责管理worker(taskmanager)节点
5:接收worker节点的执行状态以及心跳信息
并行度
.setParallelism(N):动态概念,管理并行度。
Slot:静态概念,配置了多少个slot,证明有多大并行执行任务的能力。
任务链
达成条件:one-to-one模式,并行度相同
union和connect算子
union:整合前数据类型必须一致,可以多条dataStream进行整合
connect:整合前数据类型可以不一致,在后面通过CoMap或者CoFlatMap将数据流变成一致就可以,只能操作两条数据流
//union
val listDataStream1: DataStream[String] = env.fromCollection(List("1","2","3","4"))
val listDataStream2: DataStream[String] = env.fromCollection(List("mysql hive", "hbase", "hadoop", "hbase"))
val dataStream3 = env.fromCollection(List("tom jerry", "hauhau dahuang", "xiaoming", "xiaohong"))
val result: DataStream[String] = listDataStream1.union(listDataStream2)
// result.print().setParallelism(1)
val result1: DataStream[String] = listDataStream1.union(listDataStream2,dataStream3)
result1.print().setParallelism(1)
//Connect:整合两条数据流,整合前数据类型可以不一样,输出前需转换成一致
// CoMap和CoFlatMap:专门操作通过connect整合后的数据流
val listDataStream1: DataStream[Int] = env.fromCollection(List(1,2,3,4))
val listDataStream2: DataStream[String] = env.fromCollection(List("mysql hive", "hbase", "hadoop", "hbase"))
val result: ConnectedStreams[Int, String] = listDataStream1.connect(listDataStream2)
val result1: DataStream[String] = result.flatMap(new CoFlatMapFunction[Int, String, String] {
override def flatMap1(in1: Int, collector: Collector[String]): Unit = {
val string: String = in1.toString
collector.collect(string)
}
override def flatMap2(in2: String, collector: Collector[String]): Unit = {
// collector.collect(in2)
val strings: Array[String] = in2.split(" ")
for (x<-strings){
collector.collect(x)
}
}
})
result1.print().setParallelism(1)
Flink自定义UDF函数(过滤函数)
函数类(Function Classes)
package flink.chapter3
import org.apache.flink.api.common.functions.FilterFunction
import org.apache.flink.api.scala._
class FilterFun extends FilterFunction[String]{
override def filter(value: String): Boolean = {
//value.startsWith("h")
value.contains("o")
}
}
object Enter3{
def main(args: Array[String]): Unit = {
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
val data: DataSet[String] = env.fromElements("hadoop","spark","hive")
val result: DataSet[String] = data.filter(new FilterFun)
result.print()
}
}
第二种(将函数实现成匿名类):
package flink.chapter3
import org.apache.flink.api.common.functions.RichFilterFunction
import org.apache.flink.api.scala._
object RichFilter_Demo {
def main(args: Array[String]): Unit = {
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
val data: DataSet[String] = env.fromElements("hadoop","spark","hive")
val result: DataSet[String] = data.filter(new RichFilterFunction[String] {
override def filter(t: String): Boolean = {
t.contains("i")
}
})
result.print()
}
}
自定义富函数传参(Rich Functions)
package flink.chapter3
import org.apache.flink.api.common.functions.FilterFunction
import org.apache.flink.api.scala._
class FilterFun1(word:String)extends FilterFunction[String]{
override def filter(value: String): Boolean = {
value.startsWith(word)
}
}
object Enter4{
def main(args: Array[String]): Unit = {
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
val data: DataSet[String] = env.fromElements("hadoop","spark","hive")
val result: DataSet[String] = data.filter(new FilterFun1("s"))
result.print()
}
}
匿名函数(Lambda Functions)
package flink.chapter3
import org.apache.flink.api.scala._
object DefineFun {
def main(args: Array[String]): Unit = {
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
val data: DataSet[String] = env.fromElements("hadoop","spark","hive")
val result: DataSet[String] = data.filter(_.startsWith("s"))
result.print()
}
}
富函数(Rich Functions)
package flink.chapter3
import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
import org.apache.flink.util.Collector
class DefineMap extends RichFlatMapFunction[String,(Int,String)]{
var subTask = 0
override def open(parameters: Configuration): Unit = {
//获取子任务编号
subTask=getRuntimeContext.getIndexOfThisSubtask
}
override def flatMap(in: String, collector: Collector[(Int, String)]): Unit = {
//输出前需要收集信息,通过collector.collect方法
collector.collect(subTask,in)
}
override def close(): Unit = {
}
}
object enter5{
def main(args: Array[String]): Unit = {
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
val data: DataSet[String] = env.fromElements("hadoop","spark","hive")
val result: DataSet[(Int, String)] = data.flatMap(new DefineMap)
result.print()
}
}