Spark sparkSQL介绍 及 自定义 RDD集合数据转换DataFrame

SparkSQL是什么?

  SparkSQL 是 Spark 的一个模块
  相当于Hive框架(提供SQL,使用户不需要编写MapReduce程序),功能远远大于Hive
  Hive 2.x版本,底层推荐使用Spark

(1) 功能,前世今生

  取代Hive框架,提供SQL语句,将SQL语句转成SparkCore程序,提交运行

前世今生:
  hive ->shark(spark on hive )->sparkSQL(重写依赖Hive的代码)
  shark0.9版本,shark框架太依赖于Hive框架,就是将Hive框架源码拿过去修改了部分代码(将以前SQL转成MapReduce代码改成 将SQL转成SparkCore的代码)

(1)提供2中分析语句

  (a)SQL
    99%与Hive一样,将数据放到Table,使用SQL语句查询分析
  (b)DSL (domain specific)
    调用DateSet集合类中的函数进行分析(类似RDD)

代码实现:RDD集合数据转换DataFrame
  object SparkSQLAccessLog {

    def main(args: Array[String]): Unit = {

      // The entry point into all functionality in Spark is the SparkSession class.
      val spark: SparkSession = SparkSession
        .builder()
        .appName("SQLLogAnalyzerSpark Application")
        .master("local[2]")
        .getOrCreate()

      // 设置日志级别
      spark.sparkContext.setLogLevel("WARN")

      // For implicit conversions like converting RDDs to DataFrames
      import spark.implicits._

      // 读取 要处理的数据
      val accessLogRDD = spark.read.textFile("file:///F:/tmpData/access_log")
        .rdd
        .filter(ApacheAccessLog.isValidateLogLine)
        .map(line => ApacheAccessLog.parseLogLine(line))

      val accessLogsDS = accessLogRDD.toDS()
      accessLogsDS.printSchema()
      accessLogsDS.show(3,false)

      accessLogsDS.persist(StorageLevel.MEMORY_AND_DISK)

      accessLogsDS.createOrReplaceTempView("view_tmp_access_log")

      spark.sql("select count(1) as cnt from view_tmp_access_log").show()

//SQL
      spark.sql(
        """
          |select
          |   ipAddress,count(1) as total_ipAddress
          |from
          |   view_tmp_access_log
          |group by
          |    ipAddress
          |having total_ipAddress>20
        """.stripMargin).show()

//DSL
      accessLogsDS
          .select($"ipAddress")
          .groupBy($"ipAddress")
          .count()
          .filter("count>20")
          .show()

 Thread.sleep(1000000)
 spark.stop()
 
	 }
}
运行结果
root
 |-- ipAddress: string (nullable = true)
 |-- clientIdented: string (nullable = true)
 |-- userId: string (nullable = true)
 |-- dateTime: string (nullable = true)
 |-- method: string (nullable = true)
 |-- endpoint: string (nullable = true)
 |-- protocol: string (nullable = true)
 |-- responseCode: integer (nullable = false)
 |-- contentSize: long (nullable = false)

+------------+-------------+------+--------------------------+------+---------------------------------------------------------------------------------+--------+------------+-----------+
|ipAddress   |clientIdented|userId|dateTime                  |method|endpoint                                                                         |protocol|responseCode|contentSize|
+------------+-------------+------+--------------------------+------+---------------------------------------------------------------------------------+--------+------------+-----------+
|64.242.88.10|-            |-     |07/Mar/2004:16:05:49 -0800|GET   |/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables|HTTP/1.1|401         |12846      |
|64.242.88.10|-            |-     |07/Mar/2004:16:06:51 -0800|GET   |/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2                         |HTTP/1.1|200         |4523       |
|64.242.88.10|-            |-     |07/Mar/2004:16:10:02 -0800|GET   |/mailman/listinfo/hsdivision                                                     |HTTP/1.1|200         |6291       |
+------------+-------------+------+--------------------------+------+---------------------------------------------------------------------------------+--------+------------+-----------+
only showing top 3 rows

//SQL和 DSL 输出结果一致
+--------------------+---------------+
|           ipAddress|total_ipAddress|
+--------------------+---------------+
|cr020r01-3.sac.ov...|             44|
|ip68-228-43-49.tc...|             22|
|h24-71-236-129.ca...|             36|
|ts04-ip92.hevanet...|             28|
|          10.0.0.153|            188|
|market-mail.pandu...|             29|
|h24-70-69-74.ca.s...|             32|
|        64.242.88.10|            452|
+--------------------+---------------+

+--------------------+-----+
|           ipAddress|count|
+--------------------+-----+
|cr020r01-3.sac.ov...|   44|
|ip68-228-43-49.tc...|   22|
|h24-71-236-129.ca...|   36|
|ts04-ip92.hevanet...|   28|
|          10.0.0.153|  188|
|market-mail.pandu...|   29|
|h24-70-69-74.ca.s...|   32|
|        64.242.88.10|  452|
+--------------------+-----+

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值