SparkSQL是什么?
SparkSQL 是 Spark 的一个模块
相当于Hive框架(提供SQL,使用户不需要编写MapReduce程序),功能远远大于Hive
Hive 2.x版本,底层推荐使用Spark
(1) 功能,前世今生
取代Hive框架,提供SQL语句,将SQL语句转成SparkCore程序,提交运行
前世今生:
hive ->shark(spark on hive )->sparkSQL(重写依赖Hive的代码)
shark0.9版本,shark框架太依赖于Hive框架,就是将Hive框架源码拿过去修改了部分代码(将以前SQL转成MapReduce代码改成 将SQL转成SparkCore的代码)
(1)提供2中分析语句
(a)SQL
99%与Hive一样,将数据放到Table,使用SQL语句查询分析
(b)DSL (domain specific)
调用DateSet集合类中的函数进行分析(类似RDD)
代码实现:RDD集合数据转换DataFrame
object SparkSQLAccessLog {
def main(args: Array[String]): Unit = {
// The entry point into all functionality in Spark is the SparkSession class.
val spark: SparkSession = SparkSession
.builder()
.appName("SQLLogAnalyzerSpark Application")
.master("local[2]")
.getOrCreate()
// 设置日志级别
spark.sparkContext.setLogLevel("WARN")
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
// 读取 要处理的数据
val accessLogRDD = spark.read.textFile("file:///F:/tmpData/access_log")
.rdd
.filter(ApacheAccessLog.isValidateLogLine)
.map(line => ApacheAccessLog.parseLogLine(line))
val accessLogsDS = accessLogRDD.toDS()
accessLogsDS.printSchema()
accessLogsDS.show(3,false)
accessLogsDS.persist(StorageLevel.MEMORY_AND_DISK)
accessLogsDS.createOrReplaceTempView("view_tmp_access_log")
spark.sql("select count(1) as cnt from view_tmp_access_log").show()
//SQL
spark.sql(
"""
|select
| ipAddress,count(1) as total_ipAddress
|from
| view_tmp_access_log
|group by
| ipAddress
|having total_ipAddress>20
""".stripMargin).show()
//DSL
accessLogsDS
.select($"ipAddress")
.groupBy($"ipAddress")
.count()
.filter("count>20")
.show()
Thread.sleep(1000000)
spark.stop()
}
}
运行结果
root
|-- ipAddress: string (nullable = true)
|-- clientIdented: string (nullable = true)
|-- userId: string (nullable = true)
|-- dateTime: string (nullable = true)
|-- method: string (nullable = true)
|-- endpoint: string (nullable = true)
|-- protocol: string (nullable = true)
|-- responseCode: integer (nullable = false)
|-- contentSize: long (nullable = false)
+------------+-------------+------+--------------------------+------+---------------------------------------------------------------------------------+--------+------------+-----------+
|ipAddress |clientIdented|userId|dateTime |method|endpoint |protocol|responseCode|contentSize|
+------------+-------------+------+--------------------------+------+---------------------------------------------------------------------------------+--------+------------+-----------+
|64.242.88.10|- |- |07/Mar/2004:16:05:49 -0800|GET |/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables|HTTP/1.1|401 |12846 |
|64.242.88.10|- |- |07/Mar/2004:16:06:51 -0800|GET |/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 |HTTP/1.1|200 |4523 |
|64.242.88.10|- |- |07/Mar/2004:16:10:02 -0800|GET |/mailman/listinfo/hsdivision |HTTP/1.1|200 |6291 |
+------------+-------------+------+--------------------------+------+---------------------------------------------------------------------------------+--------+------------+-----------+
only showing top 3 rows
//SQL和 DSL 输出结果一致
+--------------------+---------------+
| ipAddress|total_ipAddress|
+--------------------+---------------+
|cr020r01-3.sac.ov...| 44|
|ip68-228-43-49.tc...| 22|
|h24-71-236-129.ca...| 36|
|ts04-ip92.hevanet...| 28|
| 10.0.0.153| 188|
|market-mail.pandu...| 29|
|h24-70-69-74.ca.s...| 32|
| 64.242.88.10| 452|
+--------------------+---------------+
+--------------------+-----+
| ipAddress|count|
+--------------------+-----+
|cr020r01-3.sac.ov...| 44|
|ip68-228-43-49.tc...| 22|
|h24-71-236-129.ca...| 36|
|ts04-ip92.hevanet...| 28|
| 10.0.0.153| 188|
|market-mail.pandu...| 29|
|h24-70-69-74.ca.s...| 32|
| 64.242.88.10| 452|
+--------------------+-----+