SparkStreaming经典 demo

mustafa3264

于 2024-04-21 22:57:37 发布

阅读量682

点赞数 19

分类专栏： spark 教程文章标签： spark 大数据

本文链接：https://blog.csdn.net/fanghailiang2016/article/details/138046148

版权

spark 教程专栏收录该内容

19 篇文章 2 订阅

订阅专栏

💐💐扫码关注公众号，回复 spark 关键字下载geekbang 原价 90 元零基础入门 Spark 学习资料💐💐

流动的 WordCount——Window操作&Watermark

用户个人信息关联视频互动信息流——批数据关联流数据

视频流关联互动流——流数据关联流数据

cpu 内存数据上报定时取平均值——kafka+SparkStreaming 组合

在spark 经典demo 的 scala 和 java 实现的基础上补上针对 SparkStreming的经典 demo，这里只写 java 语言的 demo，有详细的注释，scale 自行脑补。

流动的 WordCount——Window操作&Watermark

package com.mustafa.mynetty;

import org.apache.spark.sql.*;
import org.apache.spark.sql.execution.streaming.ProcessingTimeTrigger;
import org.apache.spark.sql.streaming.StreamingQueryException;

import java.util.concurrent.TimeoutException;

public class StreamingWordCount {

    public static void main(String[] args) {

        SparkSession session = SparkSession
                .builder()
                .appName("StreamingWordCount")
                .master("local[*]")
                .config("spark.executor.instances", "1")
                .config("spark.cores.max", "8")
                .config("spark.executor.cores", "8")
                .config("spark.executor.memory", "4g")
                .config("spark.memory.fraction", "0.9")
                .config("spark.memory.storageFraction", "0.1")
                .getOrCreate();

        // 设置需要监听的本机地址与端口号
        String host = "127.0.0.1";
        String port = "9999";

        // 从监听地址创建DataFrame
        Dataset<Row> df = session.readStream()
                .format("socket")
                .option("host", host)
                .option("port", port)
                .load();

        /**
         * 使用DataFrame API完成Word Count计算
         */
        df = df.withColumn("inputs", functions.split(df.col("value"), ","))
                .withColumn("eventTime", functions.element_at(new Column("inputs"), 1).cast("timestamp"))
                // 首先把接收到的字符串，以空格为分隔符做拆分，得到单词数组words
                .withColumn("words", functions.split(functions.element_at(new Column("inputs"), 2), " "))
                // 把数组words展平为单词word
                .withColumn("word", functions.explode(new Column("words")))
                // 创建Watermark，设置最大容忍度为10分钟
                .withWatermark("eventTime", "10 minute")
                // 按照时间窗口、以单词word为Key做分组
                .groupBy(functions.window(new Column("eventTime"), "5 minute"), new Column("word"))
                // 分组计数
                .count();

        /**
         * 将Word Count结果写入到终端（Console）
         */
        try {
            df.writeStream()
                    //按照固定间隔，切割数据流
                    .trigger(new ProcessingTimeTrigger(5000))
                    // 指定Checkpoint存储地址
                    .option("checkpointLocation", "hdfs://node1:8020/software/spark/check/" + Long.toString(System.currentTimeMillis()))
                    // 指定Sink为终端（Console）
                    .format("console")
                    // 指定输出选项
                    .option("truncate", "false")
                    // 指定输出模式
                    .outputMode("complete")
                    // 启动流处理应用
                    .start()
                    // 等待中断指令
                    .awaitTermination();
        } catch (StreamingQueryException e) {
            throw new RuntimeException(e);
        } catch (TimeoutException e) {
            throw new RuntimeException(e);
        }
    }

}

其实就是在 9999 端口监听 socket 输入，mac使用 nc -l -p 9999 输入以下内容

2021-10-01 09:30:00,Apache Spark
2021-10-01 09:34:00,Spark Logo
2021-10-01 09:36:00,Structured Streaming
2021-10-01 09:39:00,Spark Streaming
2021-10-01 09:41:00,AMP Lab
2021-10-01 09:44:00,Spark SQL
2021-10-01 09:29:00,Test Test
2021-10-01 09:33:00,Spark is cool

先用,号分割，拿到eventTime，用eventTime按 5 分钟做时间窗口，并设置针对eventTime延迟数据的最大容忍度是 10分钟，使用 sparkSQL 的函数分割 word，一起分组统计，数据流入是 socket，数据输出是 console，指定了Checkpoint存储地址，输出模式是 complete

用户个人信息关联视频互动信息流——批数据关联流数据

在短视频流行的当下，推荐引擎扮演着极其重要的角色，而要想达到最佳的推荐效果，推荐引擎必须依赖用户的实时反馈。所谓实时反馈，其实就是我们习以为常的点赞、评论、转发等互动行为，不过，这里需要突出的，是一个“实时性”、或者说“及时性”。毕竟，在选择越来越多的今天，用户的兴趣与偏好，也在随着时间而迁移、变化，捕捉用户最近一段时间的兴趣爱好更加重要。假设，现在我们需要把离线的用户属性和实时的用户反馈相关联，从而建立用户特征向量。显然，在这个特征向量中，我们既想包含用户自身的属性字段，如年龄、性别、教育背景、职业，等等，更想包含用户的实时互动信息，比如 1 小时内的点赞数量、转发数量，等等，从而对用户进行更为全面的刻画。一般来说，实时反馈来自线上的数据流，而用户属性这类数据，往往存储在离线数据仓库或是分布式文件系统。因此，用户实时反馈与用户属性信息的关联，正是典型的流批关联场景。

用户个人信息userProfile.csv内容如下

id,name,age,gender
1,Alice,26,Female
2,Bob,32,Male
3,Cassie,18,Female
4,David,40,Male
5,Emma,16,Female

互动信息流 schema 如下：userId,videoId,event,eventTime，event 分为Forward（转发）、Like（点赞）、Comment（评论），数据随着时间分批供给。

interactions0.csv

userId,videoId,event,eventTime
1,1,Forward,2021-10-01 09:30:00
3,5,Like,2021-10-01 09:30:25
4,2,Comment,2021-10-01 09:31:02
2,1,Comment,2021-10-01 09:31:20
3,3,Like,2021-10-01 09:31:50

interactions1.csv

userId,videoId,event,eventTime
5,1,Like,2021-10-01 09:32:10
1,2,Like,2021-10-01 09:32:40
4,4,Forward,2021-10-01 09:33:11
5,2,Like,2021-10-01 09:33:52
2,3,Like,2021-10-01 09:34:02

interactions2.csv

userId,videoId,event,eventTime
1,5,Comment,2021-10-01 09:34:35
5,4,Like,2021-10-01 09:34:56
2,3,Forward,2021-10-01 09:35:20
3,5,Forward,2021-10-01 09:35:33
4,5,Comment,2021-10-01 09:35:54

代码如下：

package com.mustafa.mynetty;

import org.apache.spark.sql.*;
import org.apache.spark.sql.execution.streaming.ProcessingTimeTrigger;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.StructType;

import java.util.concurrent.TimeoutException;

public class StreamingUserProfile {

    public static void main(String[] args) {

        SparkSession session = SparkSession
                .builder()
                .appName("StreamingUserProfile")
                .master("local[*]")
                .config("spark.executor.instances", "1")
                .config("spark.cores.max", "8")
                .config("spark.executor.cores", "8")
                .config("spark.executor.memory", "4g")
                .config("spark.memory.fraction", "0.9")
                .config("spark.memory.storageFraction", "0.1")
                .getOrCreate();

        // 保存staging、interactions、userProfile等文件夹的根目录
        String rootPath = "hdfs://node1:8020/bigdata/spark/StreamingUserProfile/datas/";

        // 使用read API读取离线数据，创建DataFrame
        Dataset<Row> staticDF = session.read()
                .format("csv")
                .option("header", true)
                .load(rootPath + "/userProfile/userProfile.csv");

        // 定义用户反馈文件的Schema
        StructType actionSchema = new StructType()
                .add("userId", "integer")
                .add("videoId", "integer")
                .add("event", "string")
                .add("eventTime", "timestamp");

        // 使用readStream API加载数据流，注意对比readStream API与read API的区别与联系
        Dataset<Row> streamingDF = session.readStream()
                // 指定文件格式
                .format("csv")
                .option("header", true)
                // 指定监听目录
                .option("path", rootPath + "/interactions")
                // 指定数据Schema
                .schema(actionSchema)
                .load();

        // 互动数据分组、聚合，对应流程图中的步骤4
        streamingDF = streamingDF
                // 创建Watermark，设置最大容忍度为30分钟
                .withWatermark("eventTime", "30 minutes")
                // 按照时间窗口、userId与互动类型event做分组
                .groupBy(functions.window(new Column("eventTime"), "1 hours"), new Column("userId"), new Column("event"))
                // 记录不同时间窗口，用户不同类型互动的计数
                .count();

        /**
         流批关联，对应流程图中的步骤5
         可以看到，与普通的两个DataFrame之间的关联，看上去没有任何差别
         */
        Dataset<Row> jointDF = streamingDF.join(staticDF, streamingDF.col("userId").equalTo(staticDF.col("id")));

        /**
         * 将WStreamingUserProfile结果写入到终端（Console）
         */
        try {
            jointDF.writeStream()
                    //按照固定间隔，切割数据流
                    .trigger(new ProcessingTimeTrigger(5000))
                    // 指定Checkpoint存储地址
                    .option("checkpointLocation", "hdfs://node1:8020/software/spark/check/" + Long.toString(System.currentTimeMillis()))
                    // 指定Sink为终端（Console）
                    .format("console")
                    // 指定输出选项
                    .option("truncate", "false")
                    // 指定输出模式
                    .outputMode("complete")
                    // 启动流处理应用
                    .start()
                    // 等待中断指令
                    .awaitTermination();
        } catch (StreamingQueryException e) {
            throw new RuntimeException(e);
        } catch (TimeoutException e) {
            throw new RuntimeException(e);
        }


    }

}

用户信息读取 hdfs 文件系统上的 csv 文件，属于批数据，互动信息监控hdfs 文件系统interactions目录下的 csv 文件拿到输入流，对输入流设置watermark 容忍度30 分钟，按eventTime每小时的时间窗口统计用户各互动行为的次数。最后对互动流数据与个人信息进行关联，梅 5 秒钟对流数据切割统计一下，输出到 console 控制台上。

视频流关联互动流——流数据关联流数据

现在，我们想统计短视频在发布一段时间（比如 1 个小时、6 个小时、12 个小时，等等）之后，每个短视频的热度。所谓热度，其实就是转评赞等互动行为的统计计数。

视频流数据如下videoPosting.csv

id,name,postTime
1,1分钟整理衣物,2021-10-01 09:10:00
2,你不笑算我输,2021-10-01 09:12:13
3,九大行星自转速率对比,2021-10-01 09:12:15
4,人类幼崽,2021-10-01 09:14:42
5,什么是第二大脑,2021-10-01 09:14:58

视频流分批产生，上传到 hdfs 文件系统的 videoPosting目录下，互动信息数据如上，上传到 hdfs 文件系统的interactions目录下，代码如下

package com.mustafa.mynetty;

import org.apache.spark.sql.*;
import org.apache.spark.sql.execution.streaming.ProcessingTimeTrigger;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.StructType;

import java.util.concurrent.TimeoutException;

public class StreamingPostSchema {

    public static void main(String[] args) {
        SparkSession session = SparkSession
                .builder()
                .appName("StreamingPostSchema")
                .master("local[*]")
                .config("spark.executor.instances", "1")
                .config("spark.cores.max", "8")
                .config("spark.executor.cores", "8")
                .config("spark.executor.memory", "4g")
                .config("spark.memory.fraction", "0.9")
                .config("spark.memory.storageFraction", "0.1")
                .getOrCreate();

        // 保存staging、interactions、userProfile等文件夹的根目录
        String rootPath = "hdfs://node1:8020/bigdata/spark/StreamingUserProfile/datas/";

        // 定义视频流Schema
        StructType postSchema = new StructType()
                .add("id", "integer")
                .add("name", "string")
                .add("postTime", "timestamp");

        // 监听videoPosting目录，以实时数据流的方式，加载新加入的文件
        Dataset<Row> postStream = session.readStream()
                .format("csv")
                .option("header", true)
                .option("path", rootPath + "/videoPosting")
                .schema(postSchema)
                .load();

        // 定义Watermark，设置Late data容忍度
        Dataset<Row> postStreamWithWatermark = postStream
                .withWatermark("postTime", "5 minutes");

        // 定义互动流Schema
        StructType actionSchema = new StructType()
                .add("userId", "integer")
                .add("videoId", "integer")
                .add("event", "string")
                .add("eventTime", "timestamp");

        // 使用readStream API加载数据流，注意对比readStream API与read API的区别与联系
        Dataset<Row> actionStream = session.readStream()
                // 指定文件格式
                .format("csv")
                .option("header", true)
                // 指定监听目录
                .option("path", rootPath + "/interactions")
                // 指定数据Schema
                .schema(actionSchema)
                .load();

        // 定义Watermark，设置Late data容忍度
        Dataset<Row> actionStreamWithWatermark = actionStream.withWatermark("eventTime", "1 hours");

        // 双流关联
        Dataset<Row> jointDF = actionStreamWithWatermark
                .join(postStreamWithWatermark,
                        functions.expr("videoId = id AND eventTime >= postTime AND eventTime <= postTime + interval 1 hour"));

        /**
         * 将WStreamingPostSchema结果写入到终端（Console）
         */
        try {
            jointDF.writeStream()
                    //按照固定间隔，切割数据流
                    .trigger(new ProcessingTimeTrigger(5000))
                    // 指定Checkpoint存储地址
                    .option("checkpointLocation", "hdfs://node1:8020/software/spark/check/" + Long.toString(System.currentTimeMillis()))
                    // 指定Sink为终端（Console）
                    .format("console")
                    // 指定输出选项
                    .option("truncate", "false")
                    // 指定输出模式
                    .outputMode("append")
                    // 启动流处理应用
                    .start()
                    // 等待中断指令
                    .awaitTermination();
        } catch (StreamingQueryException e) {
            throw new RuntimeException(e);
        } catch (TimeoutException e) {
            throw new RuntimeException(e);
        }
    }

}

两个流都设置了Watermark容忍度，视频流容忍 5 分钟的延迟，互动流容忍 1 小时的延迟，两个流的关联条件是 videoId = id AND eventTime >= postTime AND eventTime <= postTime + interval 1 hour ，在关联条件中，除了要设置关联的主外键之外，还必须要对两张表各自的事件时间进行约束。其中，postTime 是视频流的事件时间，而 eventTime 是互动流的事件时间。上述代码的含义是，对于任意发布的视频流，我们只关心它一小时以内的互动行为，一小时以外的互动数据，将不再参与关联计算。最后以append 模式输出结果到控制台。

cpu 内存数据上报定时取平均值——kafka+SparkStreaming 组合

cpu 内存上报需要写入 kafka，pom.xml 添加如下

    <dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>kafka-clients</artifactId>
      <version>2.8.0</version>
    </dependency>

上报代码如下

package com.mustafa.mynetty;

import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.serialization.StringSerializer;

import java.lang.management.ManagementFactory;
import java.lang.management.OperatingSystemMXBean;
import java.lang.reflect.Method;
import java.lang.reflect.Modifier;
import java.util.Properties;

public class LocalUsageMonitor {

    private static Properties initConfig(String clientID) {
        Properties props = new Properties();
        String brokerList = "localhost:9092";
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList);
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        props.put(ProducerConfig.CLIENT_ID_CONFIG, clientID);
        return props;
    }

    private static Object getUsage(String mothedName) throws Exception {
        OperatingSystemMXBean operatingSystemMXBean = ManagementFactory.getOperatingSystemMXBean();
        for (Method method : operatingSystemMXBean.getClass().getDeclaredMethods()) {
            method.setAccessible(true);
            if (method.getName().startsWith(mothedName) && Modifier.isPublic(method.getModifiers())) {
                return method.invoke(operatingSystemMXBean);
            }
        }
        throw new Exception("cannot get the usage of " + mothedName);
    }

    private static String getMemoryUsage() throws Exception {
        long freeMemory = 0L;
        long totalMemory = 0L;
        double usage = 0.0;
        try{
            freeMemory = (long) getUsage("getFreePhysicalMemorySize");
            totalMemory = (long) getUsage("getTotalPhysicalMemorySize");
            usage = (double) (totalMemory - freeMemory) / totalMemory * 100;
        } catch (Exception e){
            throw e;
        }
        return Double.toString(usage);
    }

    private static String getCPUUsage() throws Exception {
        double usage = 0.0;
        try{
            usage = (double) getUsage("getSystemCpuLoad") * 100;
        } catch(Exception e) {
            throw e;
        }
        return Double.toString(usage);
    }

    static class UsageCallback implements Callback {

        @Override
        public void onCompletion(RecordMetadata metadata, Exception exception) {
            if (exception != null) {
                exception.printStackTrace();
            } else {
                System.out.println(metadata.topic() + "-" + metadata.partition() + ":" + metadata.offset());
            }
        }
    }

    public static void main(String[] args) {
        String clientID = "usage.monitor.client";
        String cpuTopic = "cpu-monitor";
        String memTopic = "mem-monitor";
        Properties props = initConfig(clientID);
        try(Producer<String, String> producer = new KafkaProducer<>(props)) {
            Callback usageCallback = new UsageCallback();

            while (true) {
                String cpuUsage = "";
                String memoryUsage = "";
                try {
                    cpuUsage = getCPUUsage();
                    memoryUsage = getMemoryUsage();
                    System.out.println("cpuUsage: " + cpuUsage + ", memoryUsage: " + memoryUsage + "}");
                } catch (Exception e) {
                    System.err.println(e.getMessage());
                    System.exit(1);
                }
                ProducerRecord<String, String> cpuRecord = new ProducerRecord<>(cpuTopic, clientID, cpuUsage);
                ProducerRecord<String, String> memRecord = new ProducerRecord<>(memTopic, clientID, memoryUsage);
                producer.send(cpuRecord, usageCallback);
                producer.send(memRecord, usageCallback);
                try {
                    Thread.sleep(2000);
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                }
            }
        }

    }

}

通过 OperatingSystemMXBean 的 getFreePhysicalMemorySize、getTotalPhysicalMemorySize 拿内存使用率，getSystemCpuLoad 拿 cpu 使用率。kafka 数据发送流程包含配置 Properties，创建 KafkaProducer 对象，创建 ProducerRecord 对象，send 发送消息，Callback 拿发送消息回调。

SparkStreaming 消费 kafka 消息

pom.xml 需要引入依赖

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql-kafka-0-10_2.12</artifactId>
      <version>3.5.1</version>
    </dependency>

代码如下

package com.mustafa.mynetty;

import org.apache.spark.sql.*;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.streaming.Trigger;

import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

public class StreamingCpuMemoryUseage {

    public static void main(String[] args) {
        SparkSession session = SparkSession
                .builder()
                .appName("StreamingCpuMemoryUseage")
                .master("local[*]")
                .config("spark.executor.instances", "1")
                .config("spark.cores.max", "8")
                .config("spark.executor.cores", "8")
                .config("spark.executor.memory", "4g")
                .config("spark.memory.fraction", "0.9")
                .config("spark.memory.storageFraction", "0.1")
                .getOrCreate();

        // 依然是依赖readStream API
        Dataset<Row> dfCPU = session.readStream()
                // format要明确指定Kafka
                .format("kafka")
                // 指定Kafka集群Broker地址，多个Broker用逗号隔开
                .option("kafka.bootstrap.servers", "localhost:9092")
                // 订阅相关的Topic，这里以cpu-monitor为例
                .option("subscribe", "cpu-monitor")
                .load();

        try {
            dfCPU
                    .withColumn("key", dfCPU.col("key").cast("string"))
                    .withColumn("value", dfCPU.col("value").cast("string"))
                    // 按照服务器做分组
                    .groupBy(new Column("key"))
                    // 求取均值
                    .agg(functions.avg(new Column("value")).cast("string").alias("value"))
                    .writeStream()
                    .outputMode("Complete")
                    // 指定Sink为Kafka
                    .format("kafka")
                    // 设置Kafka集群信息，本例中只有localhost一个Kafka Broker
                    .option("kafka.bootstrap.servers", "localhost:9092")
                    // 指定待写入的Kafka Topic，需事先创建好Topic：cpu-monitor-agg-result
                    .option("topic", "cpu-monitor-agg-result")
                    // 指定Checkpoint存储地址
                    .option("checkpointLocation", "hdfs://node1:8020/software/spark/check/" + Long.toString(System.currentTimeMillis()))
                    // 每10秒钟，触发一次Micro-batch
                    .trigger(Trigger.ProcessingTime(10, TimeUnit.SECONDS))
                    .start()
                    .awaitTermination();
        } catch (StreamingQueryException e) {
            throw new RuntimeException(e);
        } catch (TimeoutException e) {
            throw new RuntimeException(e);
        }
    }

}

输入流和输出流都是 kafka，转换过程中的字段必须是 key、value，按 key 分组取平均值