Flink实时处理之DataStream

本文详细介绍了Flink的DataStream数据源,包括socket、文件、集合和自定义数据源,以及系统内置connectors。接着讲解了DataStream的算子,如Transformations、Partition和Sink,包括如何合并、连接和切分数据流,以及各种分区策略。还探讨了数据流的状态保存和恢复,包括keyed state、operator state、checkPoint和savePoint的概念、配置和使用。
摘要由CSDN通过智能技术生成

Flink的API概览

1、dataStream的数据源

1.1、socket数据源socketTextStream

从socket当中接收数据,并统计最近5秒钟每个单词出现的次数

第一步:node01开发socket服务
node01执行以下命令开启socket服务

nc -lk  9000

第二步:开发代码实现

import org.apache.flink.streaming.api.scala.{
   DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time

object FlinkSource1 {
   
  def main(args: Array[String]): Unit = {
   
    //获取程序入口类
    val streamExecution: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val socketText: DataStream[String] = streamExecution.socketTextStream("node01",9000)
    //注意:必须要添加这一行隐式转行,否则下面的flatmap方法执行会报错
    import org.apache.flink.api.scala._
    val result: DataStream[(String, Int)] = socketText.flatMap(x => x.split(" "))
      .map(x => (x, 1))
      .keyBy(0)
      .timeWindow(Time.seconds(5), Time.seconds(5)) //统计最近5秒钟的数据
      .sum(1)

    //打印结果数据
    result.print().setParallelism(1)
    //执行程序
    streamExecution.execute()
  }
}

1.2、文件数据源readTextFile

读取hdfs路径下面所有的文件数据进行处理
第一步:添加maven依赖

<repositories>
    <repository>
        <id>cloudera</id>
        <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
</repositories>

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.6.0-mr1-cdh5.14.2</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.6.0-cdh5.14.2</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.6.0-cdh5.14.2</version>
</dependency>

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapreduce-client-core</artifactId>
    <version>2.6.0-cdh5.14.2</version>
</dependency>

第二步:代码实现

object FlinkSource2 {
   
  def main(args: Array[String]): Unit = {
   
    val executionEnvironment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._
    //从文本读取数据
    val hdfStream: DataStream[String] = executionEnvironment.readTextFile("hdfs://node01:8020/flink_input/")
    val result: DataStream[(String, Int)] = hdfStream.flatMap(x => x.split(" ")).map(x =>(x,1)).keyBy(0).sum(1)

    result.print().setParallelism(1)

    executionEnvironment.execute("hdfsSource")
  }
}

1.3、集合数据源fromElements

代码实现

// Scala: 单词计数
import org.apache.flink.streaming.api.scala.{
   DataStream, StreamExecutionEnvironment}
object FlinkSource3 {
   
  def main(args: Array[String]): Unit = {
   
    val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    import org.apache.flink.api.scala._
    val value: DataStream[String] = environment.fromElements[String]("hello world","spark flink")
    val result2: DataStream[(String, Int)] 
    = value.flatMap(x => x.split(" "))
    	   .map(x =>(x,1))
           .keyBy(0).sum(1)
           
    result2.print().setParallelism(1)
    environment.execute()
  }
}
// java: 单词加前缀
public class StreamingSourceFromCollection {
   
    public static void main(String[] args) throws Exception {
   
        //步骤一:获取环境变量
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //步骤二:模拟数据
        ArrayList<String> data = new ArrayList
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值