DataFrames 基本操作和 DSL SQL风格 UDF函数 以及数据源:
SparkSQL查询
Json数据准备
1 2 3 {"name" :"Michael" } {"name" :"Andy" , "age" :30 } {"name" :"Justin" , "age" :19 }
1 2 3 4 5 val df =spark.read.json("/input/sparksql/json/people.json" )df.show() df.filter($"age" >21 ).show(); df.createOrReplaceTempView("person" ) spark.sql("SELECT * FROM person" ).show()
IDEA创建SparkSQL程序
1 2 3 4 5 6 <dependency > <groupId > org.apache.spark</groupId > <artifactId > spark-sql_2.11</artifactId > <version > 2.1.1</version > <scope > provided</scope > </dependency >
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 package com.hph.sqlimport org.apache.spark.sql.SparkSession object HelloWorld { def main (args: Array [String ]) { val spark = SparkSession .builder() .appName("Spark SQL basic example" ) .config("spark.some.config.option" , "some-value" ) .master("local[*]" ) .getOrCreate() import spark.implicits._ val df = spark.read.json("F:\\spark\\examples\\src\\main\\resources\\people.json" ) df.show() df.filter($"age" > 21 ).show() df.createOrReplaceTempView("persons" ) spark.sql("SELECT * FROM persons where age > 21" ).show() spark.stop() } }
SparkSession
老的版本中,SparkSQL提供两种SQL查询起始点,一个叫SQLContext,用于Spark自己提供的SQL查询,一个叫HiveContext,用于连接Hive的查询,SparkSession是Spark最新的SQL查询起始点,实质上是SQLContext和HiveContext的组合,所以在SQLContext和HiveContext上可用的API在SparkSession上同样是可以使用的。SparkSession内部封装了sparkContext,所以计算实际上是由sparkContext完成的。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 import org.apache.spark.sql.SparkSession object HelloWorld { def main (args: Array [String ]) { val spark = SparkSession .builder() .appName("Spark SQL basic example" ) .config("spark.some.config.option" , "some-value" ) .master("local[*]" ) .getOrCreate() import spark.implicits._ val df = spark.read.json("hdfs://datanode1:9000/input/sparksql/people.json" ) df.show() df.filter($"age" > 21 ).show() df.createOrReplaceTempView("persons" ) spark.sql("SELECT * FROM persons where age > 21" ).show() spark.stop() } }
创建DataFrames
SparkSession是创建DataFrames和执行SQL的入口,创建DataFrames有三种方式,一种是可以从一个存在的RDD进行转换,还可以从Hive Table进行查询返回,或者通过Spark的数据源进行创建。
1 2 3 4 5 val df = spark.read.json("/input/sparksql/people.json" ) df.show() val peopleRdd = sc.textFile("/input/sparksql/people.txt" )val peopleDF = peopleRdd.map(_.split("," )).map(paras => (paras(0 ),paras(1 ).trim().toInt)).toDF("name" ,"age" )peopleDF.show()
常用操作
DSL风格语法
1 2 3 4 5 df.printSchema() df.select("name" ).show() df.select($"name" , $"age" + 1 ).show() df.filter($"age" > 21 ).show() df.groupBy("age" ).count().show()
SQL风格语法
1 2 3 4 5 6 df.createOrReplaceTempView("people" ) val sqlDF = spark.sql("SELECT * FROM people" )sqlDF.show() df.createGlobalTempView("people" ) spark.sql("SELECT * FROM global_temp.people" ).show() spark.newSession().sql("SELECT * FROM global_temp.people" ).show()
临时表是Session范围内的,Session退出后,表就失效了。如果想应用范围内有效,可以使用全局表。注意使用全局表时需要全路径访问,如:global_temp.people
创建DataSet
注意: Case classes in Scala 2.10只支持 22 字段. 你可以使用自定义的 classes 来实现对字段的映射case class Person(name: String, age: Long)
1 2 3 4 5 6 7 8 9 10 case class Person (name: String , age: Long )val caseClassDS = Seq (Person ("Andy" , 32 )).toDS()caseClassDS.show() import spark.implicits._ val primitiveDS = Seq (1 , 2 , 3 ).toDS() primitiveDS.map(_ + 1 ).collect() val path = "/input/sparksql/people.json" val peopleDS = spark.read.json(path).as[Person ]peopleDS.show()
相互转化
具体的转换可以参考: 三者共性
UDF函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import org.apache.spark.sql.SparkSession object UDF { def main (args: Array [String ]): Unit = { val spark = SparkSession .builder() .appName("Spark SQL UDF example" ) .master("local[*]" ) .getOrCreate() val df = spark.read.json("hdfs://datanode1:9000/input/sparksql/people.json" ) df.show() spark.udf.register("AddOne" , (age: Int ) => age + 1 ) df.createOrReplaceTempView("people" ) spark.sql("Select name,AddOne(age), age from people" ).show() spark.stop() } }
自定义聚合函数
强类型的Dataset和弱类型的DataFrame都提供了相关的聚合函数, 如 count(),countDistinct(),avg(),max(),min()。除此之外,用户可以设定自己的自定义聚合函数。
弱类型用户自定义聚合函数:通过继承UserDefinedAggregateFunction
来实现用户自定义聚
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 package com.hph.sqlimport org.apache.spark.sql.{Row , SparkSession }import org.apache.spark.sql.expressions.{MutableAggregationBuffer , UserDefinedAggregateFunction }import org.apache.spark.sql.types._class AverageSal extends UserDefinedAggregateFunction { override def inputSchema : StructType = StructType (StructField ("salary" , LongType ) :: Nil ) override def bufferSchema : StructType = StructType (StructField ("sum" , LongType ) :: StructField ("count" , IntegerType ) :: Nil ) override def dataType : DataType = DoubleType override def deterministic : Boolean = true override def initialize (buffer: MutableAggregationBuffer ): Unit = { buffer(0 ) = 0 L; buffer(1 ) = 0 } override def update (buffer: MutableAggregationBuffer , input: Row ): Unit = { buffer(0 ) = buffer.getLong(0 ) + input.getLong(0 ) buffer(1 ) = buffer.getInt(1 ) + 1 } override def merge (buffer1: MutableAggregationBuffer , buffer2: Row ): Unit = { buffer1(0 ) = buffer1.getLong(0 ) + buffer2.getLong(0 ) buffer1(1 ) = buffer1.getInt(1 ) + buffer2.getInt(1 ) } override def evaluate (buffer: Row ): Any = { buffer.getLong(0 ).toDouble / buffer.getInt(1 ) } } object AverageSal { def main (args: Array [String ]): Unit = { val spark = SparkSession .builder() .appName("Spark SQL UDF example" ) .master("local[*]" ) .getOrCreate() val employee = spark.read.json("hdfs://datanode1:9000/input/sparksql/employees.json" ) employee.createOrReplaceTempView("employee" ) spark.udf.register("average" , new AverageSal ) spark.sql("select average(salary) from employee" ).show() spark.stop() } }
强类型函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 package com.hph.sqlimport org.apache.spark.sql.{Encoder , Encoders , SparkSession }import org.apache.spark.sql.expressions.Aggregator case class Employee (name: String , salary: Long )case class Aver (var sum: Long , var count: Int )class Average extends Aggregator [Employee , Aver , Double ] { override def zero : Aver = Aver (0 L, 0 ) override def reduce (b: Aver , a: Employee ): Aver = { b.sum = b.sum + a.salary b.count = b.count + 1 b } override def merge (b1: Aver , b2: Aver ): Aver = { b1.sum = b1.sum + b2.sum b1.count = b1.count + b2.count b1 } override def finish (reduction: Aver ): Double = { reduction.sum.toDouble / reduction.count } override def bufferEncoder : Encoder [Aver ] = Encoders .product override def outputEncoder : Encoder [Double ] = Encoders .scalaDouble } object Average { def main (args: Array [String ]): Unit = { val spark = SparkSession .builder() .appName("Spark SQL Strong Type UDF example" ) .master("local[*]" ) .getOrCreate() import spark.implicits._ val employee = spark.read.json("hdfs://datanode1:9000/input/sparksql/employees.json" ).as[Employee ] val aver = new Average ().toColumn.name("average" ) employee.select(aver).show() spark.close() } }