Spark Streaming之dataset实例

  Spark Streaming是核心Spark API的扩展,可实现实时数据流的可扩展,高吞吐量,容错流处理。

  bin/spark-submit --class Streaming /home/wx/Stream.jar
  hadoop fs -put /home/wx/123.txt /user/wx/

文本123.txt

NOTICE:07-26 logId[0072]
NOTICE:07-26 logId[0073]
NOTICE:07-26 logId[0074]
NOTICE:07-26 logId[0075]
NOTICE:07-26 logId[0076]

 

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.sql.SparkSession

object Streaming {
  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setMaster("local[2]").setAppName("RegexpExtract")
    val ssc = new StreamingContext(conf, Seconds(1))

    println("hello world")

    val lines = ssc.textFileStream("hdfs://name-ha/user/wx/")

    val ds = lines.flatMap(_.split("\n"))

    ds.print()

    ds.foreachRDD { rdd =>

      // Get the singleton instance of SparkSession
      val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
      import spark.implicits._

      // Convert RDD[String] to DataFrame
      val wordsDataFrame = rdd.toDF("str_col")

      // Create a temporary view
      wordsDataFrame.createOrReplaceTempView("df")

      // Do word count on DataFrame using SQL and print it
      val wordCountsDataFrame =
        spark.sql(raw"""
          select str_col,
          regexp_extract(str_col,"NOTICE:\\d{2}",0) notice,
          regexp_extract(str_col,"logId\\[(.*?)\\]",0) logId 
          from df""")
      wordCountsDataFrame.show(false)
    }

    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate
  }
}

 

执行结果

hello world
-------------------------------------------
Time: 1501501752000 ms
-------------------------------------------

NOTICE:07-26 logId[0072]
NOTICE:07-26 logId[0073]
NOTICE:07-26 logId[0074]
NOTICE:07-26 logId[0075]
NOTICE:07-26 logId[0076]

+------------------------+---------+-----------+
|str_col                 |notice   |logId      |
+------------------------+---------+-----------+
|NOTICE:07-26 logId[0072]|NOTICE:07|logId[0072]|
|NOTICE:07-26 logId[0073]|NOTICE:07|logId[0073]|
|NOTICE:07-26 logId[0074]|NOTICE:07|logId[0074]|
|NOTICE:07-26 logId[0075]|NOTICE:07|logId[0075]|
|NOTICE:07-26 logId[0076]|NOTICE:07|logId[0076]|
+------------------------+---------+-----------+

-------------------------------------------
Time: 1501501770000 ms
-------------------------------------------

 

转载于:https://www.cnblogs.com/wwxbi/p/7265210.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值