SparkSQL（一）

最新推荐文章于 2022-11-21 14:23:26 发布

GatsbyNewton

最新推荐文章于 2022-11-21 14:23:26 发布

阅读量868

点赞数

分类专栏： Spark 文章标签： spark sql SparkSQL

本文链接：https://blog.csdn.net/u010376788/article/details/49591337

版权

Spark 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

Spark支持两种方式将RDDs转为为SchemaRDDs：一种方法是使用反射来推断包含特定对象类型的RDD的模式（schema）；一种方法是通过编程接口来实现，这个接口允许你构造一个模式，然后在存在的RDDs上使用它。

1.反射推断模式

Spark SQL的Scala接口支持将包含case class的RDDs自动转换为SchemaRDDs。而case class定义了表的模式。case class的参数名字通过反射来读取，然后作为列的名字。这个RDD可以隐式转化为一个SchemaRDD，然后注册为一个表。

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext

/**
 * Created by Administrator on 2015/10/17.
 */
object UseCaseClass {

  case class Person(firstName: String, lastName: String, age: Int)
  def main(args: Array[String]): Unit ={
    if(args.length != 1){
      System.err.println("Usage: <data path>")
      System.exit(1)
    }

    val conf = new SparkConf()
    val sc = new SparkContext(conf)
    //Loaded with sqlContext(which is the instance of HiveContext not SQLContext)
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._

    val data = sc.textFile(args(0))
    val personRDD = data.map(_.split(",")).map(person => Person(person(0), person(1), person(2).toInt))
    //Convert the personRDD into the personDF DataFrame
    val personDF = personRDD.toDF()
    //Register the personDF as a table
    personDF.registerTempTable("person")
    //Run a SQL query against it
    val people = sqlContext.sql("SELECT * FROM person WHERE age < 30")
    people.collect().foreach(println)

    sc.stop()
  }
}

2.编程指定模式

当case class不能提前确定（如，记录的结构是经过编码的字符串），这时case class就不能正常工作了。此时，我们可以用下面三个步骤创建一个SchemaRDD：

从原来的RDD创建Row类型的RDD；
用StructType和StructField创建一个由StructType表示的模式与上一步创建的RDD的Row对象结构相同；
将Row类型的RDD转换为DataFrame。

在此，先看下StrutType和StructField的定义：

StructType(fields: Array[StructField])
StructField(name: String, dataType: DataType, nullable: Boolean = true, metadata: Metadata = Metadata.empty)

name：field的名称；
dataType：field的数据类型，一共有以下几种数据类型：

IntegerType FloatType
BooleanType ShortType
LongType ByteType
DoubleType StringType

nullable：field是否可以为空，默认可以为空；
metadata：field的元数据，Metadata是Map[String, Any]类型，可以存储任何类型的元数据。

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{IntegerType, StringType, StructType, StructField}

/**
 * Created by Administrator on 2015/10/17.
 */
object SpecifySchema {

  def main(args: Array[String]): Unit ={
    if(args.length != 1){
      System.err.println("Usage: <data path>")
      System.exit(1)
    }

    val conf = new SparkConf()
    val sc = new SparkContext(conf)
    //Loaded with sqlContext(which is the instance of HiveContext not SQLContext)
    val sqlContext = new SQLContext(sc)

    val data = sc.textFile(args(0))
    //Convert the RDD of array[string] to the RDD of the Row objects
    val personRow = data.map(_.split(",")).map(person => Row(person(0), person(1), person(2).toInt))
    /**
    * Create schema using the StructType and StructField objects.
    * The StructField object takes parameters in the form of param name, param type, and nullability
    */
    val schema = StructType(
      Array(
        StructField("firstName", StringType, true),
        StructField("lastName", StringType, true),
        StructField("age", IntegerType, true)
      )
    )

    //Apply schema to create the personDF DataFrame
    val personDF = sqlContext.createDataFrame(personRow, schema)
    //Register the personDF as a table
    personDF.registerTempTable("person")
    //Run a SQL query against it
    val people = sqlContext.sql("SELECT * FROM person WHERE age < 30")
    people.collect().foreach(println)

    sc.stop()
  }
}