Structured Streaming笔记

Structured Streaming笔记


标准参考文档网站:http://spark.apache.org/docs/2.3.2/structured-streaming-programming-guide.html

2、Structured Streaming 入门

Scoket Source如下

2.1、简单stuctured Streaming模板
步骤:
	1、 需求梳理
	2、 Structured Streaming 代码实现
	3、 运⾏
	4、 验证结果

需求梳理:
	1、编写⼀个流式计算的应⽤, 不断的接收外部系统的消息
	2、对消息中的单词进⾏词频统计
	3、统计全局的结果
	
整体结构:
	1. Socket Server 等待 Structured Streaming 程序连接(Socket server 使⽤ Netcat nc 来实现)
	2. Structured Streaming 程序启动, 连接 Socket Server , 等待 Socket Server 发送数据
	3. Socket Server 发送数据, Structured Streaming 程序接收数据
	4. Structured Streaming 程序接收到数据后处理数据
	5. 数据处理后, ⽣成对应的结果集, 在控制台打印
package day10

import org.apache.spark.sql.{DataFrame, SparkSession}
import utils.MyApp

/**
 * 使用Structured Streaming  接收Socket流接收数据
 */
object SocketSource extends MyApp{

  val spark = SparkSession.builder().appName("socketSource").master("local[*]")
    .getOrCreate()
  spark.sparkContext.hadoopConfiguration.set("fs.defaultFS","file:///")
  //1、source 数据源 读取外部数据源, 并转为 DataFrame
  val df: DataFrame = spark.readStream.format("socket")
    .option("host", "qianfeng01")
    .option("port", "6666")
    .load()
  //2、op  转换操作
  df.printSchema()
  // df.show() //异常:Queries with streaming sources must be executed with
writeStream.start();
  //3、Sink 输出
  df.writeStream.format("console") // 将结果输出到控制台
    .start()  // 开始运⾏流式应⽤
    .awaitTermination() // 阻塞主线程, 在⼦线程中不断获取数据
}

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NpEApSXg-1630938772392)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624535208343.png)]

异常:df只能在sink中输出

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-P8znDRND-1630938772394)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624535642184.png)]

2.2、Structured Streaming 的wordcount
package day10

import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
import utils.MyApp

/**
 * 使用Structured Streaming  接收Socket流接收数据
 */
object SocketSource extends MyApp{

  System.setProperty("HADOOP_USER_NAME", "root")

  val spark = SparkSession.builder().appName("socketSource").master("local[*]")
    .getOrCreate()
  spark.sparkContext.hadoopConfiguration.set("fs.defaultFS","file:///")
  //1、source
  val df: DataFrame = spark.readStream.format("socket")
    .option("host", "qianfeng01")
    .option("port", "6666")
    .load()
  //2、op 转换操作
  df.printSchema()

  import org.apache.spark.sql.functions._
  import spark.implicits._

  //DSL方式
  val result= df.select(explode(split($"value"," ")).as("word"))
    .groupBy($"word")
    .count()

  //3、Sink
  result.writeStream.format("console") // 将结果输出到控制台
    .outputMode(OutputMode.Complete()) //当有聚合操作后不能够是Append输出,默认是Append,除了Append还有Complete+update
    .start()  // 开始运⾏流式应⽤
    .awaitTermination() // 阻塞主线程, 在⼦线程中不断获取数据
}

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-c6ZMnp75-1630938772396)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624537611592.png)]

异常:当有聚合操作后不能够是Append输出,默认是Append,除了Append还有Complete+update

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zhKEXTJR-1630938772400)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624537526756.png)]

总结:

	1、Structured Streaming 中的编程步骤依然是先读, 后处理, 最后落地
	2、Structured Streaming 中的编程模型依然是 DataFrame 和 Dataset
	3、Structured Streaming 中依然是有外部数据源读写框架的, 叫做 readStream 和 writeStream
	4、Structured Streaming 和 SparkSQL ⼏乎没有区别, 唯⼀的区别是, readStream 读出来的是流,writeStream 是将流输出, ⽽ SparkSQL 中的批处理使⽤ read 和 write

3、Stuctured Streaming 的体系和结构

提示:

​ 1、Structured Streaming中的DataFrame 可理解为一个无限增长的一个二维表。来一个数据就是往这个表的尾部加一条记录。

​ 2、Structured Streaming核心其实就是StreamExecution引擎。

下面的介绍也是围绕上面两点说明的。再具体的就回去看文档。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VnLrmoJD-1630938772401)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624538598946.png)]

<!--Structured Streaming + KAFKA-->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
            <version>2.4.5</version>
        </dependency>

4、Source

Scoket Source在入门简单模板

4.1、HDFS Source
示例1:

此案例是从HDFS上读取json,暂时不通,用的本地的json文件

流处理与批处理不同的是json格式需要制定schema信息

package day10

import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.types.{DataTypes, StructField, StructType}
import org.apache.spark.sql.{DataFrame, SparkSession}
import utils.MyApp

object HdfsSource02 extends MyApp{

  val spark = SparkSession.builder().appName("HdfsSource").master("local[*]")
    .getOrCreate()

  spark.sparkContext.hadoopConfiguration.set("fs.defaultFS","file:///")

  val schema = StructType(Seq(
    StructField("name",DataTypes.StringType),
    StructField("age",DataTypes.IntegerType),
    StructField("height",DataTypes.DoubleType),
    StructField("province",DataTypes.StringType)
  ))

  val df: DataFrame = spark.readStream.format("json")
    .schema(schema)
    .load("data")  //最好写目录 加上文件名就不行,也就是是优缺点的,只能这一个文件 ,不对是多个文件中的json就四个属性

  df.printSchema()

  df.writeStream.format("console") // 将结果输出到控制台
    .start()  // 开始运⾏流式应⽤
    .awaitTermination() // 阻塞主线程, 在⼦线程中不断获取数据

}

异常1:流处理与批处理不同的是json格式需要制定schema信息

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-x32WqK28-1630938772403)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624540282327.png)]

异常2:load(json文件目录)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ndjXpzkA-1630938772404)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624540657542.png)]

示例2:
4.2、Kafka Source

导入依赖

<!--Structured Streaming + KAFKA-->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
            <version>2.4.5</version>
        </dependency>
kafka-console-producer.sh --broker-list qianfeng01:9092 --topic test0621-1
package day10

import org.apache.spark.sql.SparkSession
import utils.MyApp

object KafkaSource05 extends MyApp{
  val spark=SparkSession.builder().appName("socket-source").master("local[2]").getOrCreate()

  spark.sparkContext.hadoopConfiguration.set("fs.defaultFS","file:///")

  //Source
  val df = spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "qianfeng01:9092")
    .option("subscribe", "test0621-1")
    .load()

 //op
 private val result: DataFrame = df.selectExpr("cast(value as string)")
  //相当于只输出 一列了 ,就是value

  //Sink
  result.writeStream.format("console")
    .start()
    .awaitTermination()
}

{"devices":{"cameras":{"device_id":"awJo6rH","last_event":{"has_sound":true,"has_motion":true,"has_person":true,"start_time":"2016-12-29T00:00:00.000Z","end_time":"2016-12-29T18:42:00.000Z"}}}}

注意:其实应该先执行IDEA 再执行 kafka的 这样才会有数据

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-k9wrwyzm-1630938772406)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624543903263.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-aZPPNtvM-1630938772406)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624543970129.png)]

序列化之后

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Dpj7Ct5O-1630938772408)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624544107036.png)]

增加checkpointLocation后

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TI6Gwihq-1630938772409)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624544376183.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wGO2CZTn-1630938772410)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624544385700.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Lux4ulDu-1630938772413)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624544494996.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NbwPrCdT-1630938772414)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624544484347.png)]

慢慢的 测试 如上

完整代码如下:

package day10

import org.apache.spark.sql.types.{DataTypes, StructType}
import org.apache.spark.sql.{DataFrame, SparkSession}
import utils.MyApp

object KafkaSource05 extends MyApp{
  val spark=SparkSession.builder().appName("socket-source").master("local[2]").getOrCreate()

  spark.sparkContext.hadoopConfiguration.set("fs.defaultFS","file:///")

  //1. Source
  val df = spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "qianfeng01:9092")
    .option("subscribe", "test0621-1")
    .load()

  df.printSchema()


  /*
  * {
  "devices": {
    "cameras": {
      "device_id": "awJo6rH",
      "last_event": {
        "has_sound": true,
        "has_motion": true,
        "has_person": true,
        "start_time": "2016-12-29T00:00:00.000Z",
        "end_time": "2016-12-29T18:42:00.000Z"
      }
    }
  }
}
  * */

  val l4=new StructType()
    .add("has_sound",DataTypes.BooleanType)
    .add("has_motion",DataTypes.BooleanType)
    .add("has_person",DataTypes.BooleanType)
    .add("start_time",DataTypes.StringType)
    .add("end_time",DataTypes.StringType)

  val l3=new StructType()
    .add("device_id",DataTypes.StringType)
    .add("last_event",l4)

  val l2=new StructType()
    .add("cameras",l3)

  val l1=new StructType().add("devices",l2)
  val schema=l1

  import org.apache.spark.sql.functions._
  import spark.implicits._
  private val result: DataFrame = df.selectExpr("cast(value as string)")//相当于只输出 一列了 ,就是value
        .select(from_json('value,schema).as("value"))
        .where("value.devices.cameras.last_event.has_person=true")

  //3、Sink
   result.writeStream.format("console")
//    .option("checkpointLocation","/checkpints/20210624-2")  //问题:从checkpoint恢复失败
    .option("truncate","false")  //解决办法
    .start()
    .awaitTermination()
}

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fsu4zgEG-1630938772415)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624545441233.png)]

5、Sink

无论kafka Sink 还是HDFS Sink 都别用 还是foreachBatch 好用 (2.4版本以后) 也就是 流概念 转换为 批概念(好用)

foreachBatch

部分重要代码

 result.writeStream.outputMode(OutputMode.Complete()).foreachBatch((ds,bid)=>{
 ds.show()
 
 ds.write.mode(SaveMode.Overwrite).saveAsTable("wc624")
// ds.write.save("/output/streaming/62417/")

 }).start().awaitTermination()

注意: 完全虐杀文档上的 那个 落地到kafka 什么hdfs啥的

package day10

import org.apache.spark.sql.types.{DataTypes, StructType}
import org.apache.spark.sql.{DataFrame, SparkSession}
import utils.MyApp

object KafkaSource05 extends MyApp{
  val spark=SparkSession.builder().appName("socket-source").master("local[2]").getOrCreate()

  spark.sparkContext.hadoopConfiguration.set("fs.defaultFS","file:///")

  //1. Source
  val df = spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "qianfeng01:9092")
    .option("subscribe", "test0621-1")
    .load()

  df.printSchema()


  /*
  * {
  "devices": {
    "cameras": {
      "device_id": "awJo6rH",
      "last_event": {
        "has_sound": true,
        "has_motion": true,
        "has_person": true,
        "start_time": "2016-12-29T00:00:00.000Z",
        "end_time": "2016-12-29T18:42:00.000Z"
      }
    }
  }
}
  * */

  val l4=new StructType()
    .add("has_sound",DataTypes.BooleanType)
    .add("has_motion",DataTypes.BooleanType)
    .add("has_person",DataTypes.BooleanType)
    .add("start_time",DataTypes.StringType)
    .add("end_time",DataTypes.StringType)

  val l3=new StructType()
    .add("device_id",DataTypes.StringType)
    .add("last_event",l4)

  val l2=new StructType()
    .add("cameras",l3)

  val l1=new StructType().add("devices",l2)
  val schema=l1

  import org.apache.spark.sql.functions._
  import spark.implicits._
  private val result: DataFrame = df.selectExpr("cast(value as string)")//相当于只输出 一列了 ,就是value
        .select(from_json('value,schema).as("value"))
        .where("value.devices.cameras.last_event.has_person=true")

  //3、Sink
//  result.writeStream.format("console")
    .option("checkpointLocation","/checkpints/20210624-2")  //问题:从checkpoint恢复失败
//    .option("truncate","false")  //解决办法
//    .start()
//    .awaitTermination()

  //下面落地 完全虐杀文档上的 那个 落地到kafka 什么hdfs啥的
    result.writeStream.foreachBatch((ds,bid)=>{
      ds.show()
    }).start()
      .awaitTermination()
}

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YCRz4G8T-1630938772417)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624546236627.png)]

周项目


准备工作:

1、kafka生产log数据

思路:在这里使用给定的log文件,将之发送到kafka,由kafka发送

package day11_weekProject

import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import utils.MyApp

import java.util.Properties
import scala.io.Source.fromFile

/**
 * 日志数据发送到Kafka
 *
 */
object Log2Kafka extends MyApp{
  //读取日志文件数据
  private val lines: Iterator[String] = fromFile("./src/main/data/logs/access.log").getLines()

  //kafkaProducer的配置
  val properties = new Properties()
  properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.1.101:9092")
  properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
  properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")

  val producer = new KafkaProducer[String,String](properties)
  val topic="nginx-log"

  while(lines.hasNext){
    val line = lines.next()
    //发送给kafka
    send2Kafka(topic,line)
  }
  producer.close()
  //将数据发送到kafka方法
  def send2Kafka(topic:String, line:String): Unit ={
    val producerRecord = new ProducerRecord[String, String](topic,line)
    producer.send(producerRecord)
  }
}

测试:

[root@qianfeng01 ~]# kafka-console-consumer.sh --bootstrap-server qianfeng01:9092 --topic nginx-log --from-beginning

截图如下:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OHlfGQAN-1630938772419)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624614787920.png)]

2、查看每5分钟内数据产生的速度

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lWhRoTlh-1630938772420)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624615094903.png)]

思路:其实就是用到窗口,还是滚动窗口,对落入窗口的数据进行count就可以了

package day11_weekProject
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
import utils.MyApp

object WeekProject extends MyApp{

  val spark = SparkSession.builder()
    .appName("WeekProject")
    .master("local[*]")
    .getOrCreate()

  spark.sparkContext.hadoopConfiguration.set("fs.defaultFS","file:///")

  private val logs: DataFrame = spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "qianfeng01:9092")
    .option("subscribe", "nginx-log")
    .load()

  logs.printSchema()

  mei5min(logs)
  //1、查看每五分钟之内数据产生的速度
  import org.apache.spark.sql.functions.window
  import spark.implicits._
  def mei5min(logs:DataFrame): Unit ={
    val ret: DataFrame = logs.selectExpr("cast(value as String) as value", "timestamp")
      .groupBy(window($"timestamp","5 minute")).count()

    ret.writeStream.foreachBatch((ds:Dataset[Row], num:Long)=>{
      ds.show(false)
    }).outputMode(OutputMode.Complete())
      .start().awaitTermination()
  }
}

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KtR8wlBf-1630938772421)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624675392305.png)]

3、查看最近五分钟内数据产生的速度


//DSL方式
def recent5min(logs: DataFrame): Unit = {
    val ret: DataFrame = logs.selectExpr("cast(value as String) as value", "timestamp")
      .groupBy(window($"timestamp", "5 minute", "1 second")).count()

    ret.writeStream.foreachBatch((ds: Dataset[Row], num: Long) => {
      ds.show()
    }).outputMode(OutputMode.Complete())
      .start()
  }


4统计哪个页面访问量最多

//2、统计哪个页面访问量最多
    def pageMax(logs:DataFrame)={
      val ret: DataFrame = logs.selectExpr("cast(value as String) as value")
        .select(split(split($"value"," ").getItem(6),"\\?").getItem(0) as("page"))
        .groupBy($"page").count()
      ret.writeStream.format("console")
        .outputMode(OutputMode.Complete())
        .start()
    }

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-eb5Fbwvh-1630938772422)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624677400478.png)]

5、统计非200的报错访问量(200为正常访问)

 errorsByDSL(logs)
  //DSL方式
  // 3、统计非200的报错访问量(200为正常访问)
  def errorsByDSL(logs:DataFrame)={
    //第一步:先将非200的数据找出来
    val errorResult = logs.selectExpr("cast(value as string) as value")
      .select(split($"value"," ").getItem(8).as("status"))
      .where($"status"=!=200)

    //第二步:分组求和得到访问量
    val result = errorResult.groupBy($"status").count()

    //第三步:输出到控制台
    result.writeStream.format("console")
      .outputMode(OutputMode.Complete())
      .start()
  }
  //纯SQL方式
  // 3、统计非200的报错访问量(200为正常访问)
  def errorsBySQL(logs:DataFrame)={
    logs.createTempView("logss")
    val result = spark.sql(
      """
        |select a.status,count(1)
        |from
        |(
        |select split(cast(value as string)," ")[8] as status
        |from logss
        |) a
        |where a.status!=200
        |group by a.status
        |
        |""".stripMargin)
    result.writeStream.format("console")
      .outputMode(OutputMode.Complete())
      .start()

  }

DSL方式运行截图如下:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4XaZdlyH-1630938772423)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624675939509.png)]

纯SQL方式运行截图如下:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GF0WEQi3-1630938772424)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624676416759.png)]

6、统计⽹站某模块访问量降序排序

def visitDesc(logs:DataFrame)={

      logs.createTempView("logss")

      val frame = spark.sql(
        """
          |select b.page,sum(b.va) as sum1
          |from
          |(
          |select
          |split(split(a.value,' ')[6],'\\?')[0] as page,
          |split(a.value,' ')[9] as va
          |from (
          |select cast(value as string) as value
          |from logss ) a
          |) b
          |group by b.page
          |order by sum1 desc
          |""".stripMargin)

      frame.writeStream.format("console")
        .outputMode(OutputMode.Complete())
        .option("truncate","false")
        .start()
    }

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SBQ20wW6-1630938772425)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624676339744.png)]

7、统计user agent的数量(最后一对双引号中的内容)(采用自定义函数*)

提示:不适用的常规思路,没有直接用类似于数组指定查所在的位置,然后再取

package day11_weekProject
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.functions.split
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
import utils.MyApp

object WeekProject extends MyApp {

  val spark = SparkSession.builder()
    .appName("WeekProject")
    .master("local[*]")
    .getOrCreate()

  spark.sparkContext.hadoopConfiguration.set("fs.defaultFS", "file:///")

  private val logs: DataFrame = spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "qianfeng01:9092")
    .option("subscribe", "nginx-log")
    .load()

  logs.printSchema()

  import org.apache.spark.sql.functions.window
  import spark.implicits._

  //4、统计user agent的数量(最后一对双引号中的内容)
  def userAgentCount(logs:DataFrame): Unit ={
   spark.udf.register("get_user_agent",(value:String)=>{
     //1.注册user agent解析函数
     spark.udf.register("get_user_agent",(row:String)=>{
       val userAgentPattern=""""[^"]*"""".r
       val strs = userAgentPattern.findAllIn(row).toList
       if(strs.size>0)
         StringUtils.strip(strs(strs.size-1).split("\\s")(0),"\"")
       else
         "-"
     })
     //2、统计
     logs.selectExpr("cast(value as string) as value").createOrReplaceTempView("logs")
     val ret = spark.sql(
       """
         |select a.user_agent,count(1) from (
         | select get_user_agent(value) as user_agent from logs
         |)a
         |group by a.user_agent
         |
         |""".stripMargin)
     ret.writeStream.format("console").outputMode("complete").start()
   })
 }

  spark.streams.awaitAnyTermination()
}

小技巧:strip方法,来自于StringUtils

注意:org.apache.commons.lang3包下的方法

功能:去掉在字符串前后的指定字符

rn=""""[^"]*"""".r
val strs = userAgentPattern.findAllIn(row).toList
if(strs.size>0)
StringUtils.strip(strs(strs.size-1).split("\s")(0),""")
else
“-”
})
//2、统计
logs.selectExpr(“cast(value as string) as value”).createOrReplaceTempView(“logs”)
val ret = spark.sql(
“”"
|select a.user_agent,count(1) from (
| select get_user_agent(value) as user_agent from logs
|)a
|group by a.user_agent
|
|""".stripMargin)
ret.writeStream.format(“console”).outputMode(“complete”).start()
})
}

spark.streams.awaitAnyTermination()
}


### 小技巧:strip方法,来自于StringUtils

​	**注意:org.apache.commons.lang3包下的方法**

​	**功能:去掉在字符串前后的指定字符**

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NYDyizh6-1630938772425)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624678339840.png)]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值