flink资料

FLINK

1 关于Flink介绍

1.1 flink的模块分布

Hadoop-MR(map\reduce)–>Tez(DAG,批式处理)–》Spark(DAG,批式,Spark Streaming:micro batch)–>FLINK(batch、streaming)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ibzLcIC1-1615266600693)(001.png)]

1.2 flink的底层结构

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wRc2HHZl-1615266600695)(002.png)]

1.3 常见的组件

1.3.1 JobManager

​ JobManager是Flink系统的协调者,负责收集所有job的状态信息以及集群中的所有从节点:TaskManager。还可以负责管理task,checkpoint以及自动故障转移。JobManager包含了以下3个重要组件:

  • Actor system:一个容器,负责调度等服务。
  • scheduler : 在Flink中executors被称为task slot,每个taskmanagerd都需要有一个或者多个task slot。在内部,Flink决定哪些task共享slot
  • checkpoint : 容错

1.3.2 TaskManager

​ taskmanager就类似于spark中的worker。在jvm中执行要给或者多个线程。task执行的并行度取决于taskmanager的task slot的数量决定。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dWSRFs2C-1615266600696)(003.png)]

1.3.3 Client

​ 当用户提交一个应用给flink的时候,先创建一个客户端。client会对用户提交的flink程序进行预处理,所以客户端需要设置一些参数比如,我的程序要提交到jobmanager的位置。换言之client会将用户提交的flink程序组装成一个job。

1.4 flink的流式处理和批式处理

​ flink在内部有一个缓存块为单位进行网络数据传输,用户可以自己配置这个缓存块的超时时间。比如你将缓存块的超时时间设置为0,则flink的数据传输方式就是“全实时”,系统处理方式就是最低限度的延迟。如果你将这个缓存块的超时时间设置为无限大,那么flink在处理数据上就类似于批式处理。

2 安装flink(****)

2.1 local

  1. jdk1.8
  2. hadoop-2.7.6/2.8.1
##1. 安装
[root@chancechance software]# tar -zxvf flink-1.6.1-bin-hadoop2	7-scala_2.11.tgz -C /opt/apps/

[root@chancechance flink-1.6.1]# vi /etc/profile
export FLINK_HOME=/opt/apps/flink-1.6.1
export PATH=$PATH:$FLINK_HOME/bin

##2. 启动
[root@chancechance flink-1.6.1]# start-cluster.sh
[root@chancechance flink-1.6.1]# stop-cluster.sh

##3. 测试
ip:8081

2.2 standalone

3台虚拟机

qphone01:jobmanager

qphone02/qphone03:taskmanager

tip:我这里搭建的微分布式

##1. flink-conf.yaml
#==============================================================================
# Common
#==============================================================================
#jobmanager的ip
jobmanager.rpc.address: 10.206.0.4
#jobmanager的port
jobmanager.rpc.port: 6123
#jobmanager在jvm中分配的堆内存的大小
jobmanager.heap.size: 1024m
#taskmanager在jvm中分配的堆内存的大小
taskmanager.heap.size: 1024m
#指定每个taskmanager上的task slot的数量
taskmanager.numberOfTaskSlots: 1
#flink程序默认的并行度
parallelism.default: 1

#==============================================================================
# Web Frontend
#==============================================================================
#flink的wen ui的端口号
rest.port: 8081

#==============================================================================
# Advanced
#==============================================================================
#flink的数据的临时保存目录
io.tmp.dirs: /opt/apps/flink-1.6.1/tmp
#在启动taskmanager的时候是否需要预分配内存给他
taskmanager.memory.preallocate: false

##2. slaves
10.206.0.4

##3. 如果式全分布式,就将flink-1.6.1的目录拷贝给其他的节点
##4. 启动集群
##4.1 方式1
[root@chancechance flink-1.6.1]# start-cluster.sh
[root@chancechance flink-1.6.1]# stop-cluster.sh

##4.2 方式2
jobmanager.sh start/start-foreground cluster | stop | stop-all
taskmanager.sh start|start-foreground | stop | stop-all

##5. 测试程序
[root@chancechance flink-1.6.1]# fink run /opt/apps/flink-1.6.1/examples/batch/WordCount.jar \
> --input /opt/apps/hive-1.2.1/logs/metastore.log \
> --output /home/output/00

2.3 yarn模式安装(*****)

至少hadoop-2.2以上

hdfs/yarn

2.3.1 Flink On Yarn的两种方式

有两种方式:

  1. 会在yarn开辟一块资源专门用于运行flink集群,这块资源式一直被占用,除非手动停止
  2. 每次提交任务开启一个新的flink集群,每次都开辟一个新的,两次运行之间完全没有影响,方便以后的管理,推荐后者。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Qa0axk1s-1615266600698)(004.png)]

##1. 启动yarn/hdfs
start-dfs.sh/start-yarn.sh

##2. 启动yarn模式
##2.1 第一种方式:
yarn-session.sh -n 2 -jm 1024 -tm 1024

flink run /opt/apps/flink-1.6.1/examples/batch/WordCount.jar \
--input hdfs://10.206.0.4:9000/input/1.data \
--output hdfs://10.206.0.4:9000/output/out.data

##2.2 第二种方式
flink run -m yarn-cluster -yn 2 -yjm 1024 -ytm 1024 /opt/apps/flink-1.6.1/examples/batch/WordCount.jar \
--input hdfs://10.206.0.4:9000/input/1.data \
--output hdfs://10.206.0.4:9000/output/out2.data

tip:
环境变量:HADOOP_HOME或HADOOP_CONF_DIR或YARN_CONF_DIR

2.4 ha的搭建

[root@chancechance bin]# start-zookeeper-quorum.sh/stop-zookeeper-quorum.sh

flink-conf.yaml
#==============================================================================
# High Availability
#==============================================================================
high-availability: zookeeper
high-availability.zookeeper.quorum: 10.206.0.4:2181
high-availability.zookeeper.client.acl: open
high-availability.storageDir: hdfs:///flink/ha/

2.5 flink on yarn的执行底层

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fCzPNaAE-1615266600699)(005.png)]

2.6 flink scala shell

start-scala-shell.sh [local|remote|yarn] [options] <args>...

[root@chancechance bin]# start-cluster.sh
[root@chancechance bin]# start-scala-shell.sh remote 10.206.0.4 8081

scala> val text = benv.fromElements("hello wangjunjie", "hello junjie")
text: org.apache.flink.api.scala.DataSet[String] = org.apache.flink.api.scala.DataSet@41492479

scala> val cnts = text.flatMap(_.toLowerCase.split("\\s+")).map((_,1)).groupBy(0).sum(1)
cnts: org.apache.flink.api.scala.AggregateDataSet[(String, Int)] = org.apache.flink.api.scala.AggregateDataSet@75ff2b6d

scala> cnts.print
(hello,2)
(junjie,1)
(wangjunjie,1)

3 flink api

3.1 搭建环境

<!-- flink java -->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-java</artifactId>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-java_2.11</artifactId>
</dependency>

<!-- flink scala -->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-scala_2.11</artifactId>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-scala_2.11</artifactId>
</dependency>

3.2 WordCount_java

 package cn.qphone.flink.day1;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

public class Demo1_Wordcount_Java {
    public static void main(String[] args) throws Exception {
        //1. 参数准备
        int port;
        try {
            //1. 获取到参数工具类,作用加载你传递的参数
            ParameterTool parameterTool = ParameterTool.fromArgs(args);
            port = parameterTool.getInt("port");
        }catch (Exception e) {
            System.err.println("no port set, default:port is 6666");
            port = 6666;
        }

        String hostname = "10.206.0.4";

        //2. 获取到编程的入口
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //3. 通过web socket:获取到数据
        DataStreamSource<String> data = env.socketTextStream(hostname, port);

        SingleOutputStreamOperator<WordWithCount> pairWords = data.flatMap(new FlatMapFunction<String, WordWithCount>() {
            public void flatMap(String line, Collector<WordWithCount> out) throws Exception {
                String[] split = line.split("\\s+");
                for (String word : split) {
                    out.collect(new WordWithCount(word, 1L));
                }
            }
        });
        KeyedStream<WordWithCount, Tuple> grouped = pairWords.keyBy("word");
        WindowedStream<WordWithCount, Tuple, TimeWindow> window = grouped.timeWindow(Time.seconds(2), Time.seconds(1));
        SingleOutputStreamOperator<WordWithCount> cnts = window.sum("count");
        cnts.print().setParallelism(1);

        env.execute("wordcount");
    }

    public static class WordWithCount {
        public String word;
        public long count;

        public WordWithCount() {
        }

        public WordWithCount(String word, long count) {
            this.word = word;
            this.count = count;
        }

        @Override
        public String toString() {
            return "WordWithCount{" +
                    "word='" + word + '\'' +
                    ", count=" + count +
                    '}';
        }
    }
}

3.3 WordCount_scala

package cn.qphone.flink.day1

import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time


object Demo1_WordCount_Scala {
  def main(args: Array[String]): Unit = {
    //1. 准备参数
    var port:Int = try {
      ParameterTool.fromArgs(args).getInt("port")
    } catch {
      case e:Exception => System.err.println("no port set, default:port is 6666")
      6666
    }

    //2. 获取数据
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val data: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)

    //3. 导入隐式参数
    import org.apache.flink.api.scala._

    //4. 计算
    val cnts: DataStream[WordWithScalaCount] = data.flatMap(_.split("\\s+")).map(WordWithScalaCount(_, 1)).keyBy("word")
      .timeWindow(Time.seconds(2), Time.seconds(1))
      .reduce((a, b) => WordWithScalaCount(a.word, a.count + b.count))

    //5. 结果打印
    cnts.print().setParallelism(1)

    //6. 执行
    env.execute("wordcount scala")
  }
  case class WordWithScalaCount(word:String, count:Int)
}

3.4 打包

##1. 将代码打包,然后上传到指定的服务器节点中并运行
[root@chancechance bin]# flink run /opt/software/flink-parent.jar --port 6666

##2. 异常
org.apache.flink.client.program.ProgramInvocationException: Neither a 'Main-Class', nor a 'program-class' entry was found in the jar file
原因:flink在执行这个jar包的时候找不到jar中的入口
解决方式:使用另外方式打包

##3. 要使用idea自带的打包方式去指定主类,执行之后的结果文件
standalone
/opt/apps/flink-1.6.1/log/flink-root-taskexecutor-0-chancechance.out

yarn
/opt/apps/hadoop-2.8.1/logs/userlogs/application_xxxx/container_xxxx_000002/taskmanager.out

3.5 窗口的理解

每隔一段时间统计一段时间间隔的数据

两个重要的参数:窗口长度、时间间隔

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YiMN1crI-1615266600700)(006.png)]

3.6 Flink API分层

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-31ipiNjb-1615266600701)(007.png)]

4 流式API操作(*****)

4.1 DataStream的Source

4.1.1 自带的Source

package cn.qphone.flink.day2

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}

import scala.collection.mutable.ListBuffer
import org.apache.flink.api.scala._

object Demo1_DataStreamSource {
  def main(args: Array[String]): Unit = {
    //1. 获取上下文
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //2. source

    //2.1 list source
    val list: ListBuffer[Int] = ListBuffer(10, 20, 30)
    val dStream: DataStream[Int] = env.fromCollection(list)
    val mapStream1: DataStream[Int] = dStream.map(_ * 100)
    mapStream1.print().setParallelism(1)
    println("=============================================")

    //2.2 string source
    val dStream2: DataStream[String] = env.fromElements("wangjunjie hen shuai")
    dStream2.print().setParallelism(1)
    println("=============================================")

    //2.3 文件作为源
    val dStream3: DataStream[String] = env.readTextFile("file:///C:\\real_win10\\day30-flink\\doc\\笔记.md", "utf-8")
    dStream3.print().setParallelism(1)
    println("=============================================")

    //2.4 socket作为源
    val dStream4: DataStream[String] = env.socketTextStream("146.56.208.76", 6665)
    dStream4.print().setParallelism(1)
    println("=============================================")

    //启动
    env.execute("collection source")
  }
}

4.1.2 自定义的Source

自定义DataStream的Source:

  1. 继承SourceFunction:非并行,不能指定其并行度。不能指定setParallelism(1),如:socketTextStreamFunction
  2. 继承ParallelSourceFunction:是一个并行的SourceFunction,可以指定并行度
  3. 继承RichParallelSourceFunction:实现了ParallelSourceFunction,不但能够并行,还有其他功能,比如增加了open和close、getRuntimeContext。。。
package cn.qphone.flink.day2
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}

import scala.util.Random

/**
 * 自定义DataStream的Source:
 * 1. 继承SourceFunction:非并行,不能指定其并行度。不能指定setParallelism(1),如:socketTextStreamFunction
 * 2. 继承ParallelSourceFunction:是一个并行的SourceFunction,可以指定并行度
 * 3. 继承RichParallelSourceFunction:实现了ParallelSourceFunction,不但能够并行,还有其他功能,比如增加了open和close、getRuntimeContext。。。
 */
object Demo2_DataStreamCustomSource {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //1. 添加自定义的source
    val dataStream1: DataStream[String] = env.addSource(new MyRichParallelSourceFunction)
    dataStream1.print().setParallelism(1)

    env.execute("custom source")
  }
}

class MySourceFunction extends SourceFunction[String] {
  /**
   * 向下游产生数据
   */
  override def run(ctx: SourceFunction.SourceContext[String]): Unit = {
    val random = new Random()
    while (true) {
      val num: Int = random.nextInt(100)
      ctx.collect(s"random:${num}")
      Thread.sleep(500)
    }
  }

  /**
   * 取消,用于控制run方法的结束
   */
  override def cancel(): Unit = {

  }
}

class MyRichParallelSourceFunction extends RichParallelSourceFunction[String] {
  override def run(ctx: SourceFunction.SourceContext[String]): Unit = {
    val random = new Random()
    while (true) {
      val num: Int = random.nextInt(1000)
      ctx.collect(s"random_rich:${num}")
      Thread.sleep(1000)
    }
  }

  override def cancel(): Unit = ???
    
      /**
   * 初始化方法
   */
  override def open(parameters: Configuration): Unit = super.open(parameters)

  /**
   * 适合在关闭的时候处理
   */
  override def close(): Unit = super.close()
}

4.1.3 自定义source-mysql

4.1.3.1 建表
create table `flink`.`stu1`(
    `id` int(11) default null,
    `name` varchar(32) default null
) engine=InnoDB default charset=utf8;

<!-- jdbc driver -->
<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.47</version>
</dependency>
4.1.3.2 自定义source
package cn.qphone.flink.day2

import java.sql.{Connection, DriverManager, PreparedStatement, ResultSet}

import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration

import scala.beans.BeanProperty


object Demo3_DataStreamCustomSource_Mysql {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val mysqlStream: DataStream[Stu1] = env.addSource(new MysqlSourceFunction)
    mysqlStream.print().setParallelism(1)
    env.execute("mysql source")
  }
}
case class Stu1(id:Int, name:String)
class MysqlSourceFunction extends RichParallelSourceFunction[Stu1] {

  @BeanProperty var ps:PreparedStatement = _
  @BeanProperty var conn:Connection = _
  @BeanProperty var rs:ResultSet = _

  /**
   * 初始化
   */
  override def open(parameters: Configuration): Unit = {
    super.open(parameters)
    val driver = "com.mysql.jdbc.Driver"
    val url = "jdbc:mysql://146.56.208.76:3306/flink?useSSL=false"
    val username = "root"
    val password = "wawyl1314bb*"
    Class.forName(driver)
    try {
      conn = DriverManager.getConnection(url, username, password)
      val sql = "select * from stu1"
      ps = conn.prepareStatement(sql)
    } catch {
      case e:Exception => e.printStackTrace()
    }
  }

  override def run(ctx: SourceFunction.SourceContext[Stu1]): Unit = {
    try {
      rs = ps.executeQuery
      while (rs.next()) {
        val stu1: Stu1 = Stu1(rs.getInt("id"), rs.getString("name"))
        ctx.collect(stu1)
      }
    } catch {
      case e:Exception => e.printStackTrace()
    }
  }

  override def cancel(): Unit = {

  }

  override def close(): Unit = {
    super.close()
    if (conn != null) conn.close()
    if (ps != null) ps.close()
    if (rs != null) rs.close()
  }
}

4.1.4 flink-jdbc-InputFormat

4.1.4.1 导入依赖
<!-- 如果flink是1.6.1就使用前者,否则从1.7.0开始就用后者 -->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-jdbc</artifactId>
    <verison>1.6.1</verison>
</dependency>

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-jdbc_2.11</artifactId>
    <verison>1.7.0</verison>
</dependency>
4.1.4.2 代码
package cn.qphone.flink.day2

import org.apache.flink.api.common.typeinfo.{BasicTypeInfo, TypeInformation}
import org.apache.flink.api.java.io.jdbc.JDBCInputFormat
import org.apache.flink.api.java.typeutils.RowTypeInfo
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.types.Row


object Demo4_DataStreamJdbcInputFormat {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //1. 设置inputformat
    //1.1 创建rowtypeinfo
    val fieldsType: Array[TypeInformation[_]] = Array[TypeInformation[_]](
      BasicTypeInfo.INT_TYPE_INFO,
      BasicTypeInfo.STRING_TYPE_INFO
    )
    val rowTypeInfo = new RowTypeInfo(fieldsType:_*)

    //1.2 定义jdbcinputformat
    val jdbcInputFormat: JDBCInputFormat = JDBCInputFormat.buildJDBCInputFormat()
      .setDBUrl("jdbc:mysql://146.56.208.76:3306/flink?useSSL=false")
      .setDrivername("com.mysql.jdbc.Driver")
      .setUsername("root")
      .setPassword("wawyl1314bb*")
      .setQuery("select * from stu1")
      .setRowTypeInfo(rowTypeInfo)
      .finish()

    val jdbcStream: DataStream[Row] = env.createInput(jdbcInputFormat)
    jdbcStream.print().setParallelism(1)
    env.execute("jdbc input format")
  }
}

4.1.5 source-kafka

4.1.5.1 导入依赖(大家自动将flink升级到1.9.1)
<!-- flink 2 kafka -->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka_2.11</artifactId>
    <version>1.9.1</version>
</dependency>
4.1.5.2 consumer.properties
bootstrap.servers=146.56.208.76:9092
group.id=hzbigdata_flink
auto.offset.reset=largest
4.1.4.3 kafka
##1. 启动kafka
##2. 创建主题

[root@chancechance apps]# kafka-topics.sh --create --topic flink --zookeeper 10.206.0.4/kafka --partitions 1 --replication-factor 1
Created topic flink.
4.1.5.4 代码
package cn.qphone.flink.day2

import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.api.scala._

/**
 * kafka source
 */
object Demo5_DataStreamCusomSource_Kafka {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val properties = new Properties()
    properties.load(this.getClass.getClassLoader.getResourceAsStream("consumer.properties"))
    val topic = "flink"
    val kafkaDataStream: DataStream[String] = env.addSource(new FlinkKafkaConsumer(topic, new SimpleStringSchema(), properties))
    kafkaDataStream.print("kafka source--->").setParallelism(1)
    env.execute("kafka source")
  }
}

4.1.5.5 开启生产者生产数据测试
[root@chancechance apps]# kafka-console-producer.sh --topic flink --broker-list 10.206.0.4:9092

4.2 DataStream的transformation

4.2.1 flatMap/map/filter/keyBy/Split/select/reduce/aggregation/union

package cn.qphone.flink.day2

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time

object Demo6_Transformation_Filter {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val socketStream: DataStream[String] = env.socketTextStream("146.56.208.76", 6665)
    val wcStream: DataStream[WordCount] = socketStream.flatMap(_.split("\\s+")).filter(_.length > 4).map(WordCount(_, 1))
      .keyBy("word").timeWindow(Time.seconds(2), Time.seconds(1))
      .sum("cnt")
    wcStream.print().setParallelism(1)
    env.execute(this.getClass.getSimpleName)
  }
  case class WordCount(word:String, cnt:Int)
}

4.2.2 split和select

package cn.qphone.flink.day2

import org.apache.flink.streaming.api.scala.{DataStream, SplitStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._


/**
 * split : DataStream -> SplitStream
 * 作用:将DataStream拆分成多个流,用splitStream
 * select : SplitStream -> DataStream
 * 作用:和split搭配使用,从splitStream中选择一个或者多个流组成一个新的DataStream
 *
 * 1,wangjunjie,man,180,180
 */
object Demo7_Transformation_Split_Select {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val socketStream: DataStream[String] = env.socketTextStream("146.56.208.76", 6665)

    val splitStream: SplitStream[User] = socketStream.map(info => {
      val arr: Array[String] = info.split(",")
      val uid: String = arr(0).trim
      val name: String = arr(1)
      val sex: String = arr(2)
      val height: Double = arr(3).toDouble
      val weight: Double = arr(4).toDouble
      User(uid, name, sex, height, weight)
    }).split((user: User) => {
      if (user.name.equals("wangjunjie")) Seq("old")
      else Seq("new")
    })

    splitStream.select("old").print("wangjunjie666").setParallelism(1) // 判断子流被标记为大佬,打汪俊杰
    splitStream.select("new").print("didid").setParallelism(1) // 判断子流被标记为马仔,打所有人都是弟弟!!!!!

    env.execute("split select transformation")
  }
}

case class User(id:String, name:String, sex:String, height:Double, weight:Double)

4.2.3 union和connect

4.2.3.1 union
package cn.qphone.flink.day2

import org.apache.flink.streaming.api.scala.{DataStream, SplitStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._

/**
 * union: DataStream* --> DataStream
 * 作用:和spark sql中的union类似
 *
 * connect : * --> ConnectedStream
 * 作用:将两个流进行连接,两个流的类型可以不同,两个流会共享状态
 */
object Demo8_Transformation_Union_Connect {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val socketStream: DataStream[String] = env.socketTextStream("146.56.208.76", 6665)
    val splitStream: SplitStream[User] = socketStream.map(info => {
      val arr: Array[String] = info.split(",")
      val uid: String = arr(0).trim
      val name: String = arr(1)
      val sex: String = arr(2)
      val height: Double = arr(3).toDouble
      val weight: Double = arr(4).toDouble
      User(uid, name, sex, height, weight)
    }).split((user: User) => {
      if (user.name.equals("wangjunjie")) Seq("old")
      else Seq("new")
    })

    val oldStream: DataStream[User] = splitStream.select("old")
    val newStream: DataStream[User] = splitStream.select("new")

    val unionStream: DataStream[User] = oldStream.union(newStream)
    unionStream.print("union 合并结果 :").setParallelism(1)

    env.execute("union")
  }
}

4.2.3.2 connect
package cn.qphone.flink.day2

import org.apache.flink.streaming.api.scala.{ConnectedStreams, DataStream, SplitStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._

/**
 * union: DataStream* --> DataStream
 * 作用:和spark sql中的union类似
 *
 * connect : * --> ConnectedStream
 * 作用:将两个流进行连接,两个流的类型可以不同,两个流会共享状态
 */
object Demo8_Transformation_Union_Connect {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val socketStream: DataStream[String] = env.socketTextStream("146.56.208.76", 6665)
    val splitStream: SplitStream[User] = socketStream.map(info => {
      val arr: Array[String] = info.split(",")
      val uid: String = arr(0).trim
      val name: String = arr(1)
      val sex: String = arr(2)
      val height: Double = arr(3).toDouble
      val weight: Double = arr(4).toDouble
      User(uid, name, sex, height, weight)
    }).split((user: User) => {
      if (user.name.equals("wangjunjie")) Seq("old")
      else Seq("new")
    })

    val bigStream: DataStream[(String, String)] = splitStream.select("old").map(e => (e.name, s"大佬"))
    val smallStream: DataStream[(String, String)] = splitStream.select("new").map(e => (e.name, s"马仔"))

    val connectedStream: ConnectedStreams[(String, String), (String, String)] = bigStream.connect(smallStream)
    //connectedStream 不能直接打印
    connectedStream.map(
      big => ("name is " + big._1, "info is " + big._2)
      ,
      small => ("small is" + small._1, "info is " + small._2)
    ).print()

    env.execute("connect")
  }
}
4.2.4 keyBy和reduce
package cn.qphone.flink.day3
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment

/**
 * keyBy : DataStream --> KeyedStream
 * 作用:将具有相同的key的数据分配到一个区中。内部使用的是散列分区。类似于sql中的group by。后续获取到keyedStream的操作都是基于组内的操作。
 * reduce : 聚合
 * 作用:将数据合并成为一个新的数据,返回单个结果值。并且reduce在处理我们元素的时候总是会创建一个新的值,要使用它需要针对分组或者window来执行
 */
object Demo1_Transformation_Keyby_Reduce {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.fromElements(Tuple2(200, 33), Tuple2(100,66), Tuple2(100, 56), Tuple2(200, 666))
      .keyBy(0) // 按照数组/元组第一个元素来进行分组
//      .reduce((t1, t2) => (t1._1, t1._2+t2._2))
      .sum(1) // 聚合算子
      .print()
      .setParallelism(1)

    env.execute()
  }
}

4.3 DataStream的Sink

print

writeAsText

writeAsCsv

writeUsingOutputFormat

writeToSocket

addSink

4.3.1 自带的sink

package cn.qphone.flink.day3

import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaProducer, FlinkKafkaProducer011}
import org.apache.kafka.common.serialization.ByteArraySerializer

object Demo1_Sink_Basic {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//    val dStream: DataStream[(Int, Int)] = env.fromElements(Tuple2(200, 33), Tuple2(100, 66), Tuple2(100, 56), Tuple2(200, 666))
    val dStream: DataStream[String] = env.fromElements("李熙", "利息")
    dStream.print() // 输出到控制台
    dStream.writeAsText("file:///C:\\real_win10\\day31-flink\\resource\\out\\1.txt")
    dStream.writeAsCsv("file:///C:\\real_win10\\day31-flink\\resource\\out\\2.csv")
    dStream.writeToSocket("146.56.208.76", 9999, new SimpleStringSchema())

    dStream.print() // 输出到控制台
    env.execute()
  }
}

4.3.2 kafka sink

package cn.qphone.flink.day3

import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaProducer, FlinkKafkaProducer011}
import org.apache.kafka.common.serialization.ByteArraySerializer

object Demo1_Sink_Basic {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val dStream: DataStream[String] = env.fromElements("李熙", "利息")

    //kafka
    val topic = "flink"
    val properties = new Properties()
    properties.setProperty("bootstrap.servers", "146.56.208.76:9092")
    properties.setProperty("key.serializer", classOf[ByteArraySerializer].getName)
    properties.setProperty("value.serializer", classOf[ByteArraySerializer].getName)
    val sink = new FlinkKafkaProducer[String](topic, new SimpleStringSchema(), properties)
    dStream.addSink(sink)
    env.execute()
  }
}

4.3.3 mysql的OutputFormat

4.3.3.1 建表
create table `flink`.`obtain_employment`(
`id` int not null,
`name` varchar(32) not null,
`salary` double not null,
`address` varchar(32) not null
);
4.3.3.2 代码
package cn.qphone.flink.day3

import java.sql.{Connection, DriverManager, PreparedStatement, ResultSet}

import org.apache.flink.api.common.io.OutputFormat
import org.apache.flink.api.common.typeinfo.{BasicTypeInfo, TypeInformation}
import org.apache.flink.api.java.io.jdbc.{JDBCInputFormat, JDBCOutputFormat}
import org.apache.flink.api.java.typeutils.RowTypeInfo
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration

import scala.beans.BeanProperty

/**
 * 1 liujiahao 30000 shanghai
 * 2 张辉 23000 北京
 * 3 程志远 25000 杭州
 */
object Demo3_Mysql_OutputFormat {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val obtainEmploymentStream: DataStream[ObtainEmployment] = env.socketTextStream("146.56.208.76", 6666)
      .map(line => {
        val obtainEmployment: Array[String] = line.split("\\s+")
        println(obtainEmployment.mkString(","))
        ObtainEmployment(obtainEmployment(0).toInt, obtainEmployment(1), obtainEmployment(2).toDouble, obtainEmployment(3))
      })
    obtainEmploymentStream.print().setParallelism(1)
    obtainEmploymentStream.writeUsingOutputFormat(new MysqlOutputFormat)
    env.execute()
  }
}

case class ObtainEmployment(id:Int, name:String, salary:Double, address:String)

class MysqlOutputFormat extends OutputFormat[ObtainEmployment] {

  @BeanProperty var ps:PreparedStatement = _
  @BeanProperty var conn:Connection = _
  @BeanProperty var rs:ResultSet = _

  /**
   * 用于配置相关的初始化
   */
  override def configure(parameters: Configuration): Unit = {
    
  }

  /**
   * 业务初始化
   */
  override def open(taskNumber: Int, numTasks: Int): Unit = {
    val driver = "com.mysql.jdbc.Driver"
    val url = "jdbc:mysql://146.56.208.76:3306/flink?useSSL=false"
    val username = "root"
    val password = "wawyl1314bb*"
    Class.forName(driver)
    try {
      conn = DriverManager.getConnection(url, username, password)
    } catch {
      case e:Exception => e.printStackTrace()
    }
  }

  /**
   * 写记录
   */
  override def writeRecord(record: ObtainEmployment): Unit = {
    ps = conn.prepareStatement("insert into obtain_employment values(?, ?, ?, ?)")
    ps.setInt(1, record.id)
    ps.setString(2, record.name)
    ps.setDouble(3, record.salary)
    ps.setString(4, record.address)
    ps.execute()
  }

  /**
   * 最后被调用
   */
  override def close(): Unit = {
    if (conn != null) conn.close()
    if (ps != null) ps.close()
    if (rs != null) rs.close()
  }
}

4.3.4 flink中kafka的二阶段提交

4.3.4.1 2pc

​ 2-phase commit,2pc。是最基础的分布式一致性协议。在分布式系统中,为了让每个节点都感知到其他节点的执行的事务情况,引入了一个中心节点,是一个协调员(coodinator),用来统一处理所有节点的执行逻辑。被中心节点调度的其他的业务节点我们称之为参与者(participant)

​ 简单来说,2pc将分布式事务分为两个阶段:1准备(提交请求)和2执行(提交)。coodinator会根据participant的响应来决定是否真正执行事务。响应者会响应ok或者yes,就提交。否则就终止。需要注意的是zookeeper中的数据一致性也采取了2pc协议。画图如下:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-p2dRbRkL-1615266600702)(008.png)]

4.3.4.2 2pc在flink中的应用

在flink中的2pc应用FlinkKafkaProducer。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bNG4K7Xr-1615266600702)(009.png)]

假设一种场景,从kafka的source拉取数据;经过一次窗口聚合。最后将数据再发送到kafka的sink。

  1. jobmanager向source发送lixi,开始pre-commit阶段。当还在内部阶段的时候他是不需要执行的操作,初始化一些状态的变量。当checkpoint成功的时候才负责将变量写入,否则终止。
  2. 当source收到lixi,将自身的状态进行保存。后端可以配置进行选择,状态指的是消费的每个分区的偏移量。将lixi发送到下一个组件去
  3. 当window接收到lixi后,对自己状态进行保存。再window这里状态指的聚合的结果。然后将lixi发送给下一个组件。sink接收到lixi之后也会有优先保存自己状态,然后进行一次预提交。
  4. 预提交成功之后,jobmanager会通知每一个组件,这一轮的检查点就完成了。这个时候kafka sink就会向kafka进行真正的事务提交了。

以上就是2个阶段完成流程,提交过程中如果失败有以下两种情况:

  1. pre-commit失败,就恢复到最近的一次checkpoint位置
  2. 一旦pre-commit成功,必须要 保证commit成功。因此,所有的组件必须要和checkpoint达成共识,所有的组件以commit为标准,要么全部执行成功,要么全部终止回滚。

4.3.5 redis sink

4.3.5.1 导入依赖
<!-- flink2 redis -->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-redis_2.11</artifactId>
</dependency>
4.3.5.2 代码
package cn.qphone.flink.day4

import cn.qphone.flink.day3.ObtainEmployment
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.connectors.redis.RedisSink
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}

object Demo1_Sink_Reids {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val obtainEmploymentStream: DataStream[ObtainEmployment] = env.socketTextStream("146.56.208.76", 6666)
      .map(line => {
        val obtainEmployment: Array[String] = line.split("\\s+")
        println(obtainEmployment.mkString(","))
        ObtainEmployment(obtainEmployment(0).toInt, obtainEmployment(1), obtainEmployment(2).toDouble, obtainEmployment(3))
      })

    val tStream: DataStream[(String, String)] = obtainEmploymentStream.map(oe => (oe.name, oe.address))
    tStream.print().setParallelism(1)

    // 创建redis sink
    val config: FlinkJedisPoolConfig = new FlinkJedisPoolConfig.Builder()
      .setHost("146.56.208.76")
      .setPort(6379)
      .build()


    val sink = new RedisSink(config, new MyRedisSink)

    tStream.addSink(sink)

    env.execute("redis sink")

  }
}

/**
 * 自定义redis sink
 */
class MyRedisSink extends RedisMapper[(String, String)] {

  /**
   *
   * @return
   */
  override def getCommandDescription: RedisCommandDescription = {
    new RedisCommandDescription(RedisCommand.SET, null)
  }

  /**
   * key
   * @param t
   * @return
   */
  override def getKeyFromData(t: (String, String)): String = {
    return t._1
  }

  /**
   * value
   * @param t
   * @return
   */
  override def getValueFromData(t: (String, String)): String = {
    return t._2
  }
}

4.3.6 ElasticSearch sink

4.3.6.1 安装es
4.3.6.2 导入依赖
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-elasticsearch6_2.11</artifactId>
    <version>${flink-version}</version>
</dependency>
4.3.6.3 代码
package cn.qphone.flink.day4

import java.util

import cn.qphone.flink.day3.ObtainEmployment
import org.apache.flink.api.common.functions.RuntimeContext
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.connectors.elasticsearch.{ElasticsearchSinkFunction, RequestIndexer}
import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink
import org.apache.http.HttpHost
import org.elasticsearch.action.index.IndexRequest
import org.elasticsearch.client.Requests

object Demo2_Sink_ElasticSearch {
  def main(args: Array[String]): Unit = {
    //1. source
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val obtainEmploymentStream: DataStream[ObtainEmployment] = env.socketTextStream("146.56.208.76", 6666)
      .map(line => {
        val obtainEmployment: Array[String] = line.split("\\s+")
        println(obtainEmployment.mkString(","))
        ObtainEmployment(obtainEmployment(0).toInt, obtainEmployment(1), obtainEmployment(2).toDouble, obtainEmployment(3))
      })

    obtainEmploymentStream.print().setParallelism(1)

    //2. 整合es sink
    //2.1 集合指定es的位置
    val httpHosts = new util.ArrayList[HttpHost]()
    httpHosts.add(new HttpHost("146.56.208.76", 9200, "http"))
    //2.2 获取到sink对象
    val sink = new ElasticsearchSink.Builder[ObtainEmployment](httpHosts, new MyEsSink).build()
    //2.3 添加sink
    obtainEmploymentStream.addSink(sink)

    env.execute("sind 2 es")
  }
}

class MyEsSink extends ElasticsearchSinkFunction[ObtainEmployment] {

  /**
   * 当当前DataStream中每流动一个元素,此方法调用一次
   * @param t 数据
   * @param runtimeContext
   * @param requestIndexer
   */
  override def process(element: ObtainEmployment, runtimeContext: RuntimeContext, requestIndexer: RequestIndexer): Unit = {
    //1. 将javabean对象的数据封装到java的map中
    println(s"$element")
    val map = new util.HashMap[String, String]()
    map.put("name", element.name)
    map.put("address", element.address)
    //2. 将map中的数据构造一个IndexRequest请求
    val request: IndexRequest = Requests.indexRequest()
        .index("flink") // 索引库
        .`type`("info") // 索引类型
        .id(s"${element.id}") // docid
        .source(map) // 数据
    //3. 将索引请求对象传递给请求索引器
    requestIndexer.add(request)
  }
}

5 批操作API(*****)

5.1 常见source

package cn.qphone.flink.day4

import org.apache.flink.api.scala._

object Demo3_DataSet_Source {
  def main(args: Array[String]): Unit = {
    //1 获取到入口类
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//    val ds: DataSet[String] = env.readTextFile("file:///C:\\real_win10\\day32-flink\\resource\\1.txt")
//    ds.print()

    val ds2: DataSet[String] = env.fromElements("lixi", "rock", "lee")
    ds2.print()

    val ds3: DataSet[Long] = env.generateSequence(1, 100)
    ds3.print()

    //2. 执行
    env.execute("source batch")
  }
}

5.2 常见的算子

map
flatmap
mappartition
filter
distinct
group by
reduce
max
min
sum
join
union
hashpartiton
RangeParition
...

5.3 Sink

writeAsText
writeAsCsv
print
write
output

5.3.1 自定义mysql的outputformat

5.3.1.1 建表
create table `flink`.`wc`(
`word` varchar(32),
`count` int(11)
)
5.3.1.2 代码
package cn.qphone.flink.day5

import java.sql.{Connection, DriverManager, PreparedStatement, ResultSet}

import org.apache.flink.api.common.io.OutputFormat
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.configuration.Configuration

import scala.beans.BeanProperty

object Demo1_DataSet_Source {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._
    val text: DataSet[String] = env.fromElements("i love flink very much")
    val txt = text.flatMap(_.split("\\s+")).map((_, 1)).groupBy(0).sum(1).map(t => Wc(t._1, t._2))
    //1. 添加自定义outputformat
    txt.output(new BatchMysqlOutputFormat)

//    env.execute()
  }
}

case class Wc(word:String, count:Int)

class BatchMysqlOutputFormat extends OutputFormat[Wc] {

  @BeanProperty var ps:PreparedStatement = _
  @BeanProperty var conn:Connection = _
  @BeanProperty var rs:ResultSet = _


  override def configure(parameters: Configuration): Unit = {

  }

  override def open(taskNumber: Int, numTasks: Int): Unit = {
    val driver = "com.mysql.jdbc.Driver"
    val url = "jdbc:mysql://146.56.208.76:3306/flink?useSSL=false"
    val username = "root"
    val password = "wawyl1314bb*"
    Class.forName(driver)
    try {
      conn = DriverManager.getConnection(url, username, password)
    } catch {
      case e:Exception => e.printStackTrace()
    }
  }

  override def writeRecord(record: Wc): Unit = {
    ps = conn.prepareStatement("insert into wc values(?, ?)")
    ps.setString(1, record.word)
    ps.setInt(2, record.count)
    ps.execute()
  }

  override def close(): Unit = {
    if (conn != null) conn.close()
    if (ps != null) ps.close()
    if (rs != null) rs.close()
  }
}

6 TaskManager于Task Slot(****)

6.1 基本概念

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jTSUOziq-1615266600703)(010.png)]

​ 在flink中每一个TaskManager相当于式spark中的worker(yarn中的nodemanager);换言之每个taskmanager实际上都是jvm中的一条进程。在一个taskmanager下会被分配很多的task slot,这个task slot相当于spark中的executor(yarn中的container),主要作用式隔离不同的task对资源的要求。默认策略式均分。一个taskmanager能够同时接收多少task执行,是由task slot决定(一个taskmanager至少有一个task slot)

​ 默认情况下,如果两个task在不同的task slot下他们是使用不同的资源的。但是flink也允许共享task slot,但是有一个前提条件,两个task必须得是同一个job下的task.

6.2 并行度

task的Parallelism可以在flink的不同的级别上指定。

  • 算子级别:dataStream.print.setParallelism(1)
  • 执行环境:StreamExecutionEnvironment和ExecutionEnvironment;env.setParallelism(1)
  • 客户端
  • 配置文件: parallelism.default: 1

6.3 Operator Chain

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OXzpFUhI-1615266600703)(011.png)]

  • StreamGraph(客户端)
  • JobGraph(客户端)
  • ExecutionGraph
  • 物理执行图
object Demo1_WordCount_Scala {
  def main(args: Array[String]): Unit = {
    //2. 获取流式执行环境:批式执行环境ExecutionEnvironment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //3. 导入隐式参数
    import org.apache.flink.api.scala._

    //4. 计算
    //map操作符不能往前链接,但是往后链接。说白了就是map和print连接到一起
    env.fromElements("i love you").map((_,1)).startNewChain().print()

    //map操作符不能再链接到其他的操作符了——禁止链接map操作符
    env.fromElements("i hate you").map((_,1)).disableChaining().print()

    //map操作符放入到指定的task共享组中去共享task slot
    env.fromElements("i hate you").map((_,1)).slotSharingGroup("default")
    //6. 触发执行
    env.execute("wordcount scala")
  }
}

7 分区(***)

7.1 分区器概念

spark的RDD中有分区概念。Flink针对DataStream也有分区的概念。通过StreamPartitioner父类完成了flink分区的实现:常见的分区器有8种:

  • ShufflePartitioner:洗牌分区器,将所有的输出随机的选择下游
  • BroadcastPartitioner:广播分区器,将记录转发给下游的所有节点
  • CustomPartitonWrapper:自定义分区器
  • ForwardPartitioner:将分区器的记录转发本地运行的下游的operator
  • GlobalPartitioner : 默认的分区器
  • KeyGroupStreamPartitioner:通过记录的value获取到分区key
  • RebalancePartitioner
  • RescalePartitioner:可扩展的分区器,通过轮询的方式将记录传递给下游

7.2 分区器代码

object Demo2_Partitioner {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val data = env.fromElements("i love you very much")
    data.shuffle.print("shuffle-->").setParallelism(4) // ShufflePartitoner
    data.rescale.print("rescale-->").setParallelism(4) // RescalePartitoner
    data.rebalance.print("rebalance-->").setParallelism(4)
    env.execute()
  }
}

8 state、checkpoint、state backend、savepoint(***)

8.1 State:状态

8.1.1 介绍

state:flink中的function和operator都是有状态的,在处理数据过程种存储的数据就是状态。

8.1.2 为什么要有状态

​ 因为flink主要是用来做流式计算,7x24小时的不间断的计算。并且消费的数据应该是不重复、不间断、不丢失并且只被计算一次。这些都是属于flink的状态,当我们在生产中扩展并行度、防止服务器崩溃等等的操作的时候都需要保证flink的数据状态,flink有关于他状态管理的API:

KeyedState

ManagedState

分类KeyedStateManagedState
使用场景只能用在KeyedStream使用所有的算子
处理方式每个key对应一个state,一个operator可能处理多个state一个operator只能处理一个state
并发改变state随着key在实例间迁移当并发改变时,需要你选择分配方式:
访问方式通过RuntimeContext,自己实现RichFunctionCheckpointFunction
支持数据结构ValuesState、ListState、…支持ListState

8.1.3 ValueState代码

package cn.qphone.flink.day5

import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.util.Collector

object Dem3_State {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._
    env.fromElements((1,5), (1, 6), (1, 7), (2, 8), (2, 1))
      .keyBy(0)
      .flatMap(new MyFlatMapFunction)
      .print

    env.execute()
  }
}

/**
 * 自定义State
 * * 每个算子起始都有对应*Function或Rich*Function
 * * 一般我们在自定义的时候都会继承这个函数对应的富函数(AbstractRichFunction)
 * * 实现这个富函数中的抽象方法
 */
class MyFlatMapFunction extends RichFlatMapFunction[(Int,Int), (Int,Int)] {

  var state:ValueState[(Int, Int)] = _

  /**
   * 初始化
   */
  override def open(parameters: Configuration): Unit = {
    val descriptor = new ValueStateDescriptor[(Int, Int)](
      "avg",
      TypeInformation.of(new TypeHint[(Int, Int)] {}),
      (0, 0)
    )
    state = getRuntimeContext.getState(descriptor) // int sum = 0
  }

  override def flatMap(value: (Int, Int), out: Collector[(Int, Int)]): Unit = {
    // 获取到当前的状态
    val currentState: (Int, Int) = state.value() // sum - 0
    val count: Int = currentState._1 + 1 // 1
    val sum: Int = currentState._2 + value._2 // 0+1

    //更新状态
    state.update((count, sum)) // sum=0+1

    //输出状态
    out.collect(value._1, sum)

  }

  /**
   * 释放资源
   */
  override def close(): Unit = super.close()
}

8.2 Checkpoint

8.2.1 checkpoint概念以及流程

checkpoint是针对flink的job进行周期性的state的快照,便于作业的恢复以及稳定。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jtov09TI-1615266600704)(012.png)]

​ flink会在我们的数据中加入一个barrier(栅栏),栅栏从source开始到sink结束,过程每一个算子触碰到barrier都会自动进行快照

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NY05BSkH-1615266600705)(013.png)]

8.2.2 全局配置——flink-conf.yarml

#==============================================================================
# Fault tolerance and checkpointing
#==============================================================================

# The backend that will be used to store operator state checkpoints if
# checkpointing is enabled.
#
# Supported backends are 'jobmanager', 'filesystem', 'rocksdb', or the
# <class-name-of-factory>.
#
# state.backend: filesystem

# Directory for checkpoints filesystem, when using any of the default bundled
# state backends.
#
# state.checkpoints.dir: hdfs://namenode-host:port/flink-checkpoints

# Default target directory for savepoints, optional.
#
# state.savepoints.dir: hdfs://namenode-host:port/flink-checkpoints

# Flag to enable/disable incremental checkpoints for backends that
# support incremental checkpoints (like the RocksDB state backend).
#
# state.backend.incremental: false

8.3 state backend

8.3.1 介绍

默认情况下state都是保存在taskmanager的内存中。checkpoint则会存储在jobmanager的内存中。

state backend分为3类:

  • MemoryStateBackend:state本质是保存在jvm中的堆中,执行checkpoint的时候会将state保存在jobmanager的内存中
  • FsStateBackend:state数据保存在taskmanager的内存之中,执行checkpoint的时候会将state保存在我们配置的文件系统中。
  • RocketsDBStateBackend:它会在本地文件系统之中维护state,state被直接写入到rocketDB中。同时它需要配置一个远程URI(一般来说都是HDFS),作用在checkpoint的时候将数据复制到远程的URI中。使用RocketsDB有一个最大的好处是克服了state受限于内存大小的限制,同时又能够将其checkpoint到远程的fs中。

8.3.2 代码

package cn.qphone.flink.day5

import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.environment.CheckpointConfig
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment

object Demo4_RocketsDB_Backend {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._
    //获取checkpoint的配置对象
    val config: CheckpointConfig = env.getCheckpointConfig

    /**
     * DELETE_ON_CANCELLATION:取消作业的时候删除checkpoint
     * RETAIN_ON_CANCELLATION:取消作业的时候保留checkpoint
     */
    config.enableExternalizedCheckpoints(
      CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)

    // 设置EXACTLY_ONCE模式
    config.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)

    // 设置每个检查点最小间隔时间1s
    config.setMinPauseBetweenCheckpoints(1000)

    // 设置每次checkpoint快照必须在1分钟之内结束
    config.setCheckpointTimeout(60000)

    // 设置在统一时间范围内只允许一个检查点
    config.setMaxConcurrentCheckpoints(1)

    val data = env.fromElements("i love you very much").print()

    env.execute()
  }
}

8.4 checkpoint和savepoint的区别

  • 概念:checkpoint用来容错的,savepoint保存的全局的状态
  • 目的:checkpoint程序自动容错快速恢复;savepoint是针对程序修改或者程序升级的。
  • 用户交互:checkpoint是flink的行为,savepoint是用户出发的。说得再直白点,checkpoint的创建删除和修改都是flink管理的,savepoint需要用户自己创建删除修改。
  • 状态文件的保留策略:默认是程序自动删除。savepoint对一直保存,除非用户自己手动的删除
  • checkpint使用state backend

9 Flink的广播变量

package cn.qphone.flink.day5

import org.apache.flink.api.common.state.MapStateDescriptor
import org.apache.flink.api.common.typeinfo.BasicTypeInfo
import org.apache.flink.streaming.api.datastream.BroadcastStream
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.util.Collector

object Demo5_Broadcast {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._

    val desc = new MapStateDescriptor(
      "sexinfo",
      BasicTypeInfo.INT_TYPE_INFO,
      BasicTypeInfo.STRING_TYPE_INFO
    )

    val sex: DataStream[(Int, String)] = env.fromElements((1, "男"), (2, "女"))
    val sexB: BroadcastStream[(Int, String)] = sex.broadcast(desc)
    /**
     * lixi 1
     * wangjunjie 1
     * lihao 2
     */
    val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
    val map: DataStream[(String, Int)] = socket.map(line => {
      val fields: Array[String] = line.split("\\s+")
      val name: String = fields(0)
      val sexid: Int = fields(1).toInt
      (name, sexid)
    })

    map.connect(sexB).process(new BroadcastProcessFunction[(String, Int), (Int, String), (String, String)] {
        // 处理元素
        override def processElement(value: (String, Int),
                                    ctx: BroadcastProcessFunction[(String, Int), (Int, String), (String, String)]#ReadOnlyContext,
                                    out: Collector[(String, String)]): Unit = {
            val genderid: Int = value._2 // 获取到map数据集合中的性别编号
            var gender: String = ctx.getBroadcastState(desc).get(genderid) //从广播变量中通过编号获取到性别字符串
            if (gender == null) gender = "人妖"
          //输出
          out.collect((value._1, gender))
        }

      // 继续处理广播元素
      override def processBroadcastElement(value: (Int, String),
                                           ctx: BroadcastProcessFunction[(String, Int), (Int, String), (String, String)]#Context,
                                           out: Collector[(String, String)]): Unit = {
        ctx.getBroadcastState(desc).put(value._1, value._2)
      }
    }).print()
    env.execute()
  }
}

10 flink的分布式缓存

package cn.qphone.flink.day5

import java.io.File

import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}

import scala.collection.mutable
import scala.io.{BufferedSource, Source}
import scala.collection.mutable.Map

/**
 * gender.txt:1 男,2 女
 */
object Demo6_Distribute_cache {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._

    //读取hdfs的资源并设置分布式缓存
    env.registerCachedFile("file:///c://flink_cache/gender.txt", "info")

    env.socketTextStream("146.56.208.76", 6666)
      .map(new RichMapFunction[String, (String,String)] {

        var bc:BufferedSource = _
        val map: mutable.Map[Int, String] = Map() // 做的缓存

        override def open(parameters: Configuration): Unit = {
          //读取分布式缓存中的数据
          val file: File = getRuntimeContext.getDistributedCache.getFile("info") // 获取分布式缓存中的数据
          bc = Source.fromFile(file)
          val list: List[String] = bc.getLines().toList
          for(line <- list) {
            val fields: Array[String] = line.split("\\s+")
            val sexid: Int = fields(0).toInt
            val sex: String = fields(1)
            map.put(sexid, sex)
          }
        }
        override def map(line: String): (String, String) = {
          val fields: Array[String] = line.split("\\s+")
          val name: String = fields(0)
          val sexid: Int = fields(1).toInt
          val sex: String = map.getOrElse(sexid, "妖")
          (name, sex)
        }

        override def close(): Unit = {
          if(bc != null) bc.close()
        }
      }).print()
    env.execute()
  }
}

11 累加器

object Demo1_Accumulator {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._
    val data = env.fromElements("i love you very much")
    data.flatMap(new MyFlatMapFunction(env)).print()
    //4. 获取累加的结果
    val result: JobExecutionResult = env.execute()
    println(result.getAccumulatorResult("count_name").toString)
  }
}
class MyFlatMapFunction extends RichFlatMapFunction[String, String] {

  private var env:StreamExecutionEnvironment = _

  def this(env:StreamExecutionEnvironment) {
    this
    this.env = env
  }

  override def flatMap(value: String, out: Collector[String]): Unit = {
    //1. 创建累加器
    val counter = new IntCounter
    //2. 注册累加器
    getRuntimeContext.addAccumulator("count_name", counter)
    //3. 当达成了某种条件
    counter.add(1)
  }
}

12 Window和Time(****)

12.1 window

12.1.1 场景

最近一段时间、每隔一段时间、最近多少条数据的统计

实时计算,但是对结果的实时性要求不高

对数据延迟也可以接收,可以使用window

12.1.2 概念

将无界的数据流拆分为有界的数据流。flink支持大致上两种窗口:时间驱动和数据驱动

12.1.3 分类

  • 时间窗口
    • 混动时间窗口
    • 滑动事件窗口
    • 会话窗口
  • 数据窗口
    • 滑动计数窗口
    • 混动数据窗口
12.1.3.1 滚动窗口

特点:

时间对齐

窗口长度固定

没有重叠

比如:

求某个时间段的聚合,使用滚动就比较合适

12.1.3.2 滑动窗口

特点:

窗口长度固定

有重叠

比如:

求近几天的数据统计

12.1.3.3 代码测试
package cn.qphone.flink.day6

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.assigners.EventTimeSessionWindows
import org.apache.flink.streaming.api.windowing.time.Time
//import org.apache.flink.streaming.api.windowing.time.Time

object Demo2_Window {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._
    val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)

    /**
     * 20201111 chongqing 3
     * 20201112 hangzhou 2
     */
    socket.map(line => {
      val fields: Array[String] = line.split("\\s+")
      val date: String = fields(0).trim
      val province: String = fields(1)
      val add: Int = fields(2).toInt
      (date+"_"+province, add)
    }).keyBy(0)
//      .timeWindow(Time.seconds(5)) // 滚动窗口,只统计当前窗口数据
//      .timeWindow(Time.seconds(5), Time.seconds(5)) // 滑动窗口
//      .countWindow(3) // 滚动
//      .countWindow(5,2) // 滑动
//      .window(EventTimeSessionWindows.withGap(Time.milliseconds(1000)))
      .sum(1).print()

    env.execute()

  }
}

12.1.3.4 窗口聚合函数
- sum
- reduce

package cn.qphone.flink.day6

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
//import org.apache.flink.streaming.api.windowing.time.Time

object Demo2_Window {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._
    val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)

    /**
     * 20201111 chongqing 3
     * 20201112 hangzhou 2
     */
    socket.map(line => {
      val fields: Array[String] = line.split("\\s+")
      val date: String = fields(0).trim
      val province: String = fields(1)
      val add: Int = fields(2).toInt
      (date+"_"+province, add)
    }).keyBy(0)
      .timeWindow(Time.seconds(1)) // 滚动窗口,只统计当前窗口数据
//      .timeWindow(Time.seconds(5), Time.seconds(5)) // 滑动窗口
//      .countWindow(3) // 滚动
//      .countWindow(5,2) // 滑动
//      .window(EventTimeSessionWindows.withGap(Time.milliseconds(1000)))
        .aggregate(new AggregateFunction[(String,Int), (String,Int,Int), (String,Int)] {
          /**
           * 初始化
           * @return
           */
          override def createAccumulator(): (String, Int, Int) = ("", 0, 0)

          /**
           * 每读取一条记录,累加一次
           * @return
           */
          override def add(value: (String, Int), accumulator: (String, Int, Int)): (String, Int, Int) = {
            val cnt: Int = accumulator._2 + 1
            val sum: Int = accumulator._3 + value._2
            (value._1, cnt, sum)
          }

          /**
           * 获取结果
           * @param accumulator
           * @return
           */
          override def getResult(accumulator: (String, Int, Int)): (String, Int) = {
            (accumulator._1, accumulator._3 / accumulator._2)
          }

          /**
           * 多个分区结果合并
           * @return
           */
          override def merge(partition1: (String, Int, Int), partition2: (String, Int, Int)): (String, Int, Int) = {
            val cnt: Int = partition1._2 + partition2._2
            val sum: Int = partition1._3 + partition2._3
            (partition1._1, cnt, sum)
          }
        }).print()

    env.execute()
  }
}
12.1.3.5 窗口处理函数
package cn.qphone.flink.day6

import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

object Demo2_Window {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._
    val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)

    /**
     * 20201111 chongqing 3
     * 20201112 hangzhou 2
     */
    socket.map(line => {
      val fields: Array[String] = line.split("\\s+")
      val date: String = fields(0).trim
      val province: String = fields(1)
      val add: Int = fields(2).toInt
      (date+"_"+province, add)
    }).keyBy(0)
      .timeWindow(Time.seconds(1)) // 滚动窗口,只统计当前窗口数据
//      .timeWindow(Time.seconds(5), Time.seconds(5)) // 滑动窗口
//      .countWindow(3) // 滚动
//      .countWindow(5,2) // 滑动
//      .window(EventTimeSessionWindows.withGap(Time.milliseconds(1000)))
      .process[(String, Double)](new ProcessWindowFunction[(String, Int), (String, Double), Tuple, TimeWindow] {
        override def process(key: Tuple, context: Context, elements: Iterable[(String, Int)], out: Collector[(String, Double)]): Unit = {
          var cnt:Int = 0
          var sum:Double = 0.0
          //遍历记录
          elements.foreach(record => {
            cnt = cnt + 1
            sum = sum + record._2
          })
          out.collect((key.getField(0), sum / cnt))
        }
      }).print()

    env.execute()
  }
}

12.1.4 触发器

trigger——触发器:出发窗口的执行操作

  • EventTimeTrigger
  • ProcessingTimeTrigger
  • CountTrigger

如果用户没有设置触发器,flink会调用自己的默认触发器

12.1.4.1 自定义触发器
  • CONTINUE : 对窗口不执行任何操作
  • FIRE_AND_PURGE :对窗口的数据按照我们设计的窗口的代码来进行计算,并输出结果,最后清除窗口的数据
  • FIRE :对窗口的数据按照我们设计的窗口的代码来进行计算,并输出结果。计算完毕之后窗口的数据不会被清除
  • PURGE :将窗口的数据清除
package cn.qphone.flink.day6

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.triggers.{Trigger, TriggerResult}
import org.apache.flink.streaming.api.windowing.windows.TimeWindow

object Demo3_Trigger {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._
    val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)

    /**
     * 20201111 chongqing 3
     * 20201112 hangzhou 2
     */
    socket.map(line => {
      val fields: Array[String] = line.split("\\s+")
      val date: String = fields(0).trim
      val province: String = fields(1)
      val add: Int = fields(2).toInt
      (date+"_"+province, add)
    }).keyBy(0)
      .timeWindow(Time.seconds(5)) // 滚动窗口,只统计当前窗口数据
      .trigger(new MyTrigger) // 设置触发器
      .sum(1)
      .print()

    env.execute()
  }
}

class MyTrigger extends Trigger[(String,Int), TimeWindow] {

  var cnt:Int = 0
  /**
   * 每读取一个元素,此方法会被自动调用
   */
  override def onElement(element: (String, Int), timestamp: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = {
    // 注册时间触发器
    ctx.registerProcessingTimeTimer(window.maxTimestamp()) // 当前窗口的最大值
    println(window.maxTimestamp())
    if (cnt > 5) {
      println("触发的计数窗口")
      cnt = 0
      TriggerResult.FIRE
    }else {
      cnt = cnt + 1
      TriggerResult.CONTINUE
    }
  }

  /**
   * 当ProcessTime的定时器被出发的时候调用
   * @return
   */
  override def onProcessingTime(time: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = {
    println("触发的时间窗口")
    TriggerResult.FIRE
  }

  /**
   * 当eventTime的定时器被出发的时候调用
   */
  override def onEventTime(time: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = {
    TriggerResult.CONTINUE
  }

  /**
   * 窗口清楚的时候被调用
   */
  override def clear(window: TimeWindow, ctx: Trigger.TriggerContext): Unit = {
    ctx.deleteProcessingTimeTimer(window.maxTimestamp())
  }
}

12.1.5 watermark——水位线

​ 当flink在以eventtime处理数据流的时候,他会根据数据的时间戳来处理时间,就会导致数据乱序。所谓的乱序,起始就是指flink接收事件的先后顺序不是严格按照eventtime顺序排列的。

​ watermark是一种衡量eventTime进展的机制。它是由window来触发的。

[00:00:00, 00:00:03]

[00:00:03, 00:00:06]

[00:00:06, 00:00:09]

[00:00:57, 00:01:00]

12.1.5.1 有序的watermark
package cn.qphone.flink.day6

import java.text.SimpleDateFormat

import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.scala.function.RichWindowFunction
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector // 导入flink 隐式转换函数

object Demo10_WaterMark {
  /**
   * lixi 12312312312312312
   * @param args
   */
  def main(args: Array[String]): Unit = {
    //1. 获取到流式的环境对象
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
    socket.filter(_.nonEmpty).map(line => {
      val fields: Array[String] = line.split("\\s+")
      //name, timestamp
      (fields(0), fields(1).toLong)
    }) // 分配时间戳和水位线
      .assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[(String, Long)] {
        var maxTimestamp = 0L // 迄今位置最大的窗口的时间戳
        val maxLazy = 10000L // 允许最大的延迟时间
        val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")

        //获取当前的水位
        override def getCurrentWatermark: Watermark = new Watermark(maxTimestamp - maxLazy)


        //分配时间戳
        override def extractTimestamp(element: (String, Long), previousElementTimestamp: Long): Long = {
          val now_ts: Long = element._2 // 数据现在的时间戳
          maxTimestamp = Math.max(now_ts, maxTimestamp)
          val now_warter_ts: Long = getCurrentWatermark.getTimestamp
          println(s"eventTime->${format.format(now_ts)}")
          println(s"maxTimestamp->${format.format(maxTimestamp)}")
          println(s"now_warter_ts->${format.format(now_warter_ts)}")
          now_ts
        }
      }).keyBy(0).timeWindow(Time.seconds(3))
      .apply(new RichWindowFunction[(String,Long),String, Tuple, TimeWindow] {
        val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
        override def apply(key: Tuple, window: TimeWindow, input: Iterable[(String, Long)], out: Collector[String]): Unit = {
          val lst: List[(String, Long)] = input.iterator.toList.sortBy(_._2)
          val startTime: String = format.format(window.getStart)
          val endTIme: String = format.format(window.getEnd)
          val res = s"start eventTime--> ${format.format(lst.head._2)}," +
             s"end eventTime--> ${format.format(lst.last._2)}," +
             s"window startTime --> ${startTime}," +
             s"window startTime --> ${endTIme}"
          out.collect(res)
        }
      }).print()
    env.execute()
  }
}

13 Table & SQL(*****)

13.1 依赖

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table</artifactId>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-common</artifactId>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-planner_2.11</artifactId>
</dependency>

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-api-java-bridge_2.11</artifactId>
</dependency>

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-api-scala-bridge_2.11</artifactId>
</dependency>

13.2 table quick start

package cn.qphone.flink.day6

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.types.Row

object Demo4_Table_QuickStart {
  def main(args: Array[String]): Unit = {
    //1. 获取到流式的环境对象
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._
    //2. 获取到table的环境对象
    val tenv: StreamTableEnvironment = StreamTableEnvironment.create(env)
    //3. 使用流式的环境对象获取到source的数据
    val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
    val data: DataStream[(String, Int)] = socket.map(line => {
      val fields: Array[String] = line.split("\\s+")
      val date: String = fields(0).trim
      val province: String = fields(1)
      val add: Int = fields(2).toInt
      (date + "_" + province, add)
    })
    //4. 将datastream转换为一个table对象
    var table: Table = tenv.fromDataStream(data)
    //5. 使用table来查询数据
    table = table.select("_1,_2").where("_2>2")
    //6. 将table转换为datastream然后输出
    tenv.toAppendStream[Row](table).print("table ->")
    //7. 执行
    env.execute()
  }
}

13.3 table操作中包含了表的字段名

package cn.qphone.flink.day6

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.types.Row

object Demo5_Table_QuickStart2 {
  def main(args: Array[String]): Unit = {
    //1. 获取到流式的环境对象
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._ // 导入flink 隐式转换函数
    //2. 获取到table的环境对象
    val tenv: StreamTableEnvironment = StreamTableEnvironment.create(env)
    //3. 使用流式的环境对象获取到source的数据
    val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
    val data: DataStream[(String, Int)] = socket.map(line => {
      val fields: Array[String] = line.split("\\s+")
      val date: String = fields(0).trim
      val province: String = fields(1)
      val add: Int = fields(2).toInt
      (date + "_" + province, add)
    })
    //4. 将datastream转换为一个table对象
    import org.apache.flink.table.api.scala._ // 导入一个flink table的隐式转换函数

    var table: Table = tenv.fromDataStream(data, 'date_province, 'cnt) // 将data数据转换为table的过程中并给其中每个元组对应的元素赋值(别名)
    //5. 使用table来查询数据
    table = table.select("date_province,cnt").where("cnt>2")
    table.printSchema()
    //6. 将table转换为datastream然后输出
    tenv.toAppendStream[Row](table).print("table ->")
    //7. 执行
    env.execute()
  }
}

13.4. table操作中整合批式开发

package cn.qphone.flink.day6

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.table.api.scala.BatchTableEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row

object Demo6_Table_Batch {
  def main(args: Array[String]): Unit = {
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    val tenv: BatchTableEnvironment = BatchTableEnvironment.create(env)
    val ds: DataSet[(Int, String, String, Int)] = env.fromElements("1,lixi,man,1").map(line => {
      val fields: Array[String] = line.split(",")
      (fields(0).toInt, fields(1), fields(2), fields(3).toInt)
    })
    val table: Table = tenv.fromDataSet(ds, 'id, 'name, 'sex, 'salary)
    table.groupBy("name")
      .select('name, 'salary.sum as 'sum_age)
      .toDataSet[Row]
      .print()
    env.execute()
  }
}

13.5 sql操作

package cn.qphone.flink.day6

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.types.Row
import org.apache.flink.api.scala._ // 导入flink 隐式转换函数
import org.apache.flink.table.api.scala._ // 导入一个flink table的隐式转换函数

object Demo7_SQL_QuickStart {
  def main(args: Array[String]): Unit = {
    //1. 获取到流式的环境对象
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //2. 获取到table的环境对象
    val tenv: StreamTableEnvironment = StreamTableEnvironment.create(env)
    //3. 使用流式的环境对象获取到source的数据
    val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)

    val data: DataStream[DPA] = socket.map(line => {
      val fields: Array[String] = line.split("\\s+")
      val date: String = fields(0).trim
      val province: String = fields(1)
      val add: Int = fields(2).toInt
      DPA(date+"_"+province, add)
    })
    //4. 将datastream转换为一个table对象
    var table: Table = tenv.fromDataStream[DPA](data) // 将data数据转换为table的过程中并给其中每个元组对应的元素赋值(别名)
    //5. sql
    tenv.sqlQuery(
      s"""
        |select
        |*
        |from
        |$table
        |where
        |add > 2
        |""".stripMargin).toAppendStream[Row].print()

    //7. 执行
    env.execute()
  }
}

case class DPA(dp:String, add:Int)

13.6 sql操作时间类型

package cn.qphone.flink.day6

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.types.Row
import org.apache.flink.api.scala._ // 导入flink 隐式转换函数
import org.apache.flink.table.api.scala._ // 导入一个flink table的隐式转换函数

object Demo8_SQL_Time {
  def main(args: Array[String]): Unit = {
    //1. 获取到流式的环境对象
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //2. 获取到table的环境对象
    val tenv: StreamTableEnvironment = StreamTableEnvironment.create(env)
    //3. 使用流式的环境对象获取到source的数据
    val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)

    val data: DataStream[DPAT] = socket.map(line => {
      val fields: Array[String] = line.split("\\s+")
      val date: String = fields(0).trim
      val province: String = fields(1)
      val add: Int = fields(2).toInt
      val ts: Long = fields(3).toLong
      DPAT(date+"_"+province, add, ts)
    })
    //4. 将datastream转换为一个table对象
    var table: Table = tenv.fromDataStream[DPAT](data) // 将data数据转换为table的过程中并给其中每个元组对应的元素赋值(别名)
    //5. sql : tumble(时间值, 间隔时间),一般再group by后面用
    tenv.sqlQuery(
      s"""
        |select
        |dp,
        |sum(add) as sum_cnt
        |from
        |$table
        |group by dp
        |""".stripMargin).toAppendStream[Row].print()

    //7. 执行
    env.execute()
  }
}

case class DPAT(dp:String, add:Int, ts:Long)

13.7 wordcount

package cn.qphone.flink.day6

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.types.Row
import org.apache.flink.api.scala._ // 导入flink 隐式转换函数
import org.apache.flink.table.api.scala._ // 导入一个flink table的隐式转换函数

object Demo9_SQL_Wordcount{
  def main(args: Array[String]): Unit = {
    //1. 获取到流式的环境对象
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //2. 获取到table的环境对象
    val tenv: StreamTableEnvironment = StreamTableEnvironment.create(env)
    //3. 使用流式的环境对象获取到source的数据
    val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
    val word: DataStream[(String, Int)] = socket.flatMap(_.split("\\s+")).filter(_.nonEmpty).map((_,1))
    //4. 将datastream转换为一个table对象
    var table: Table = tenv.fromDataStream(word, 'word, 'cnt) // 将data数据转换为table的过程中并给其中每个元组对应的元素赋值(别名)
    //5. sql : tumble(时间值, 间隔时间),一般再group by后面用
    tenv.sqlQuery(
      s"""
        |select
        |word,
        |sum(cnt)
        |from
        |${table}
        |group by word
        |""".stripMargin).toRetractStream[Row].print()

    //7. 执行
    env.execute()
  }
}

14 flink datastream和spark streaming的区别

  1. 数据模型:spark streaming是micro batch;而flink就是dataflow。
  2. 部署方式:spark的yarn模式分为client和cluster。flink提供了yarn-session和per job
  3. 提供资源组件:spark的executor;flink用的task slot
  4. api
  5. flink 2pc,spark未提供
  6. 窗口区别
  7. spark严格上来说实时性只能做到秒级别,flink严格上来说可以做到毫秒
  8. spark:DAG;flink operator chain
  9. spark table引擎catalyst;flink table calsite在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值