第70课:SparkSQL内置函数解密与实战学习笔记

第70:SparkSQL内置函数解密与实战学习笔记

本期内容:

1 SparkSQL内置函数解析

2 SparkSQL内置函数实战

 

SparkSQLDataFrame引入了大量的内置函数,这些内置函数一般都有CGCodeGeneration)功能,这样的函数在编译和执行时都会经过高度优化。

问题:SparkSQL操作HiveHive on spark一样吗?

=> 不一样。SparkSQL操作Hive只是把Hive当作数据仓库的来源,而计算引擎就是SparkSQL本身。Hive on sparkHive的子项目,Hive on Spark的核心是把Hive的执行引擎换成Spark。众所周知,目前Hive的计算引擎是Mapreduce,因为性能低下等问题,所以Hive的官方就想替换这个引擎。

SparkSQL操作Hive上的数据叫Spark on Hive,而Hive on Spark依旧是以Hive为核心,只是把计算引擎由MapReduce替换为Spark

Spark官网上DataFrame API Docs

http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.DataFrame

classDataFrame extends Queryable with Serializable

Experimental

A distributed collection of data organized into named columns.

DataFrame is equivalent to a relational table in Spark SQL. The following example creates a DataFrame by pointing Spark SQL to a Parquet data set.

val people = sqlContext.read.parquet("...")  // in Scala

DataFrame people = sqlContext.read().parquet("...")  // in Java

Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame (this class), Column, and functions.

To select a column from the data frame, use apply method in Scala and col in Java.

val ageCol = people("age")  // in Scala

Column ageCol = people.col("age")  // in Java

Note that the Column type can also be manipulated through its various functions.

// The following creates a new column that increases everybody's age by 10.

people("age") + 10  // in Scala

people.col("age").plus(10);  // in Java

A more concrete example in Scala:

// To create DataFrame using SQLContextval people = sqlContext.read.parquet("...")val department = sqlContext.read.parquet("...")

 

people.filter("age > 30")

  .join(department, people("deptId") === department("id"))

  .groupBy(department("name"), "gender")

  .agg(avg(people("salary")), max(people("age")))

and in Java:

// To create DataFrame using SQLContext

DataFrame people = sqlContext.read().parquet("...");

DataFrame department = sqlContext.read().parquet("...");

 

people.filter("age".gt(30))

  .join(department, people.col("deptId").equalTo(department("id")))

  .groupBy(department.col("name"), "gender")

  .agg(avg(people.col("salary")), max(people.col("age")));

 

以上内容中的join,groupBy,agg都是SparkSQL的内置函数。

SParkl1.5.x以后推出了很多内置函数,据不完全统计,有一百多个内置函数。

下面实战开发一个聚合操作的例子:

package com.dt.spark.sql

 

import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}

import org.apache.spark.sql.{Row, SQLContext}

import org.apache.spark.sql.hive.HiveContext

import org.apache.spark.{SparkContext, SparkConf}

import org.apache.spark.sql.functions._

 

/**

  * 使用Scala开发集群运行的Spark WordCount程序

  * @author DT大数据梦工厂

  * 新浪微博:http://weibo.com/ilovepains/

  * Created by hp on 2016/3/28.

  *

  * 使用Spark SQL中的内置函数对数据进行分析,Spark SQL API不同的是,DataFrame中的内置函数操作的结果是返回一个Column对象,而

  * DataFrame天生就是"A distributed collection of data organized into named columns.",这就为数据的复杂分析建立了坚实的基础

  * 并提供了极大的方便性,例如说,我们在操作DataFrame的方法中可以随时调用内置函数进行业务需要的处理,这之于我们构建附件的业务逻辑而言是可以

  * 极大的减少不必须的时间消耗(基于上就是实际模型的映射),让我们聚焦在数据分析上,这对于提高工程师的生产力而言是非常有价值的

  * Spark 1.5.x开始提供了大量的内置函数,例如agg

  * def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame = {

  *  groupBy().agg(aggExpr, aggExprs : _*)

  *}

  * 还有maxmeanminsumavgexplodesizesort_arraydayto_dateabsacrosasinatan

  * 总体上而言内置函数包含了五大基本类型:

  * 1,聚合函数,例如countDistinctsumDistinct等;

  * 2,集合函数,例如sort_arrayexplode

  * 3,日期、时间函数,例如hourquarternext_day

  * 4,数学函数,例如asinatansqrttanround等;

  * 5,开窗函数,例如rowNumber

  * 6,字符串函数,concatformat_numberrexexp_extract

  * 7,其它函数,isNaNsharandncallUDF

  *

  */

object SparkSQLAgg {

 

  def main (args: Array[String]) {

    /**

      * 1步:创建Spark的配置对象SparkConf,设置Spark程序的运行时的配置信息,

      * 例如说通过setMaster来设置程序要链接的Spark集群的MasterURL,如果设置

      * local,则代表Spark程序在本地运行,特别适合于机器配置条件非常差(例如

      * 只有1G的内存)的初学者       *

      */

    val conf = new SparkConf() //创建SparkConf对象

    conf.setAppName("SparkSQLInnerFunctions") //设置应用程序的名称,在程序运行的监控界面可以看到名称

    //    conf.setMaster("spark://Master:7077") //此时,程序在Spark集群

    conf.setMaster("local")

    /**

      * 2步:创建SparkContext对象

      * SparkContextSpark程序所有功能的唯一入口,无论是采用ScalaJavaPythonR等都必须有一个SparkContext

      * SparkContext核心作用:初始化Spark应用程序运行所需要的核心组件,包括DAGSchedulerTaskSchedulerSchedulerBackend

      * 同时还会负责Spark程序往Master注册程序等

      * SparkContext是整个Spark应用程序中最为至关重要的一个对象

      */

    val sc = new SparkContext(conf) //创建SparkContext对象,通过传入SparkConf实例来定制Spark运行的具体参数和配置信息

 

    val sqlContext = new SQLContext(sc)   //构建SQL上下文

 

 

    //要使用Spark SQL的内置函数,就一定要导入SQLContext下的隐式转换

    import sqlContext.implicits._

 

 

    /**

      * 第三步:模拟电商访问的数据,实际情况会比模拟数据复杂很多,最后生成RDD

      */

 

    val userData = Array(

      "2016-3-27,001,http://spark.apache.org/,1000",

      "2016-3-27,001,http://hadoop.apache.org/,1001",

      "2016-3-27,002,http://fink.apache.org/,1002",

      "2016-3-28,003,http://kafka.apache.org/,1020",

      "2016-3-28,004,http://spark.apache.org/,1010",

      "2016-3-28,002,http://hive.apache.org/,1200",

      "2016-3-28,001,http://parquet.apache.org/,1500",

      "2016-3-28,001,http://spark.apache.org/,1800"

    )

 

 

    val userDataRDD = sc.parallelize(userData)  //生成DD分布式集合对象

 

 

    /**

      * 第四步:根据业务需要对数据进行预处理生成DataFrame,要想把RDD转换成DataFrame,需要先把RDD中的元素类型变成Row类型

      * 于此同时要提供DataFrame中的Columns的元数据信息描述

      */

    val userDataRDDRow = userDataRDD.map(row => {val splited = row.split(",") ;Row(splited(0),splited(1).toInt,splited(2),splited(3).toInt)})

    val structTypes = StructType(Array(

      StructField("time", StringType, true),

      StructField("id", IntegerType, true),

      StructField("url", StringType, true),

      StructField("amount", IntegerType, true)

    ))

 

    val userDataDF = sqlContext.createDataFrame(userDataRDDRow,structTypes)

 

 

    /**

      * 第五步:使用Spark SQL提供的内置函数对DataFrame进行操作,特别注意:内置函数生成的Column对象且自定进行CG

      *

      *

      */

 

    userDataDF.groupBy("time").agg('time, countDistinct('id))

      .map(row=>Row(row(1),row(2))).collect.foreach(println)

 

    userDataDF.groupBy("time").agg('time, sum('amount)).show()

 

  }

}

 

Eclipse中运行如下:

16/04/10 23:54:04 INFO TaskSetManager: Finished task 58.0 in stage 6.0 (TID 461) in 18 ms on localhost (59/199)

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms

16/04/10 23:54:04 INFO Executor: Finished task 59.0 in stage 6.0 (TID 462). 1652 bytes result sent to driver

16/04/10 23:54:04 INFO TaskSetManager: Starting task 60.0 in stage 6.0 (TID 463, localhost, partition 61,NODE_LOCAL, 1999 bytes)

16/04/10 23:54:04 INFO Executor: Running task 60.0 in stage 6.0 (TID 463)

16/04/10 23:54:04 INFO TaskSetManager: Finished task 59.0 in stage 6.0 (TID 462) in 15 ms on localhost (60/199)

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms

16/04/10 23:54:04 INFO Executor: Finished task 60.0 in stage 6.0 (TID 463). 1652 bytes result sent to driver

16/04/10 23:54:04 INFO TaskSetManager: Starting task 61.0 in stage 6.0 (TID 464, localhost, partition 62,NODE_LOCAL, 1999 bytes)

16/04/10 23:54:04 INFO TaskSetManager: Finished task 60.0 in stage 6.0 (TID 463) in 17 ms on localhost (61/199)

16/04/10 23:54:04 INFO Executor: Running task 61.0 in stage 6.0 (TID 464)

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms

16/04/10 23:54:04 INFO Executor: Finished task 61.0 in stage 6.0 (TID 464). 1652 bytes result sent to driver

16/04/10 23:54:04 INFO TaskSetManager: Starting task 62.0 in stage 6.0 (TID 465, localhost, partition 63,NODE_LOCAL, 1999 bytes)

16/04/10 23:54:04 INFO Executor: Running task 62.0 in stage 6.0 (TID 465)

16/04/10 23:54:04 INFO TaskSetManager: Finished task 61.0 in stage 6.0 (TID 464) in 99 ms on localhost (62/199)

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms

16/04/10 23:54:04 INFO Executor: Finished task 62.0 in stage 6.0 (TID 465). 1652 bytes result sent to driver

16/04/10 23:54:04 INFO TaskSetManager: Starting task 63.0 in stage 6.0 (TID 466, localhost, partition 64,NODE_LOCAL, 1999 bytes)

16/04/10 23:54:04 INFO TaskSetManager: Finished task 62.0 in stage 6.0 (TID 465) in 18 ms on localhost (63/199)

16/04/10 23:54:04 INFO Executor: Running task 63.0 in stage 6.0 (TID 466)

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms

16/04/10 23:54:04 INFO Executor: Finished task 63.0 in stage 6.0 (TID 466). 1652 bytes result sent to driver

16/04/10 23:54:04 INFO TaskSetManager: Starting task 64.0 in stage 6.0 (TID 467, localhost, partition 65,NODE_LOCAL, 1999 bytes)

16/04/10 23:54:04 INFO TaskSetManager: Finished task 63.0 in stage 6.0 (TID 466) in 16 ms on localhost (64/199)

16/04/10 23:54:04 INFO Executor: Running task 64.0 in stage 6.0 (TID 467)

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms

16/04/10 23:54:04 INFO Executor: Finished task 64.0 in stage 6.0 (TID 467). 1652 bytes result sent to driver

16/04/10 23:54:04 INFO TaskSetManager: Starting task 65.0 in stage 6.0 (TID 468, localhost, partition 66,NODE_LOCAL, 1999 bytes)

16/04/10 23:54:04 INFO Executor: Running task 65.0 in stage 6.0 (TID 468)

16/04/10 23:54:04 INFO TaskSetManager: Finished task 64.0 in stage 6.0 (TID 467) in 18 ms on localhost (65/199)

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms

16/04/10 23:54:04 INFO Executor: Finished task 65.0 in stage 6.0 (TID 468). 1652 bytes result sent to driver

16/04/10 23:54:04 INFO TaskSetManager: Starting task 66.0 in stage 6.0 (TID 469, localhost, partition 67,NODE_LOCAL, 1999 bytes)

16/04/10 23:54:04 INFO TaskSetManager: Finished task 65.0 in stage 6.0 (TID 468) in 47 ms on localhost (66/199)

16/04/10 23:54:04 INFO Executor: Running task 66.0 in stage 6.0 (TID 469)

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms

16/04/10 23:54:04 INFO Executor: Finished task 66.0 in stage 6.0 (TID 469). 1652 bytes result sent to driver

16/04/10 23:54:04 INFO TaskSetManager: Starting task 67.0 in stage 6.0 (TID 470, localhost, partition 68,NODE_LOCAL, 1999 bytes)

16/04/10 23:54:04 INFO Executor: Running task 67.0 in stage 6.0 (TID 470)

16/04/10 23:54:04 INFO TaskSetManager: Finished task 66.0 in stage 6.0 (TID 469) in 17 ms on localhost (67/199)

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks

16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms

16/04/10 23:54:04 INFO Executor: Finished task 67.0 in stage 6.

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值