SparkSQL操作RDD转DataFrame

最新推荐文章于 2024-01-11 17:46:43 发布

yjgithub

最新推荐文章于 2024-01-11 17:46:43 发布

阅读量557

点赞数 1

分类专栏： Spark 文章标签： spark

本文链接：https://blog.csdn.net/yjgithub/article/details/56673153

版权

Spark 专栏收录该内容

22 篇文章 3 订阅

订阅专栏

SparkSQL操作RDD两种方式对比

1.使用反射推断Schema类型，具体解析参考下面的官网描述

The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns. Case classes can also be nested or contain complex types such as Sequences or Arrays. This RDD can be implicitly converted to a DataFrame and then be registered as a table. Tables can be used in subsequent SQL statements.

ps:
import sqlContext.implicits._
case class Person(name: String, age: Int)

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index:
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

// or by field name:
teenagers.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
// Map("name" -> "Justin", "age" -> 19)

2.使用编程方式制定Schema类型，具体解析参考下面的官网描述

When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.

Create an RDD of Rows from the original RDD;
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.

ps:
val schemaString = “name age”
val schema =
StructType(
schemaString.split(” “).map(fieldName => StructField(fieldName, StringType, true)))

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create an RDD
val people = sc.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string
val schemaString = "name age"

// Import Row.
import org.apache.spark.sql.Row;

// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType};

// Generate the schema based on the string of schema
val schema =
  StructType(
    schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))

// Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)

// Register the DataFrames as a table.
peopleDataFrame.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val results = sqlContext.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index or by field name.
results.map(t => "Name: " + t(0)).collect().foreach(println)

对比总结

1.case class模板类模式可视化比较好
2.case class模板类参数上限为22个，对于字段多的不能使用
3.编程方式更适合日常开发

代码实现

数据源

Michael,20
Andy,30
Justin,19

1.反射

package Sql

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by Administrator on 2016/12/6.
  *  textFlie -> table
    Michael,20
    Andy,30
    Justin,19
    需要添加import sqlContext.implicits._,才可以调用.toDF
    定义People格式
    testFile->map(p=>People(p(1),p(2))) ->DataFrame->Table
    可以是用Sql.Context.sql("Sql语句")
  */
object Sqltest3 {
  case class User(name:String,age:Int)
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf() //创建SparkConf对象
    //设置应用程序的名称，在程序运行的监控界面可以看到名称
    conf.setAppName("SparkSql App!")
    //此时，程序在本地运行，不需要安装Spark集群
    conf.setMaster("local[2]")
    //创建SparkContext对象，通过传入SparkConf实例来定制Spark运行的具体参数和配置信息
    val ssc = new SparkContext(conf)
    ssc.setLogLevel("ERROR")
    //加载SQL对象
    val sqlContext = new SQLContext(ssc)

    //加载数据
    val dr =  ssc.textFile("d://sparktest/people2.txt")
    import sqlContext.implicits._
    val df = dr.map(_.split(","))
               .map(p => User(p(0), p(1).trim.toInt)).toDF()
    //注册到临时表
    df.registerTempTable("user")
    //输出, df 的数据类型
    df.printSchema()
    //输出所有名字
    df.select("name").show()
    df.filter(df("age") > 21).show()
    df.groupBy("age").count().show()
    val teenagers = sqlContext.sql("SELECT name, age FROM user WHERE age >= 13 AND age <= 19")
    teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

    println("-------")
    ssc.stop()
  }
}

编程方式

package Sql

import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by Administrator on 2016/12/6.
  *  textFlie -> table
    Michael,20
    Andy,30
    Justin,19
    定义列簇格式
    testFile->map(p=>Row(p(1),p(2))) ->转换列簇->DataFrame->Table
    可以是用Sql.Context.sql("Sql语句")
  */
object Sqltest2 {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf() //创建SparkConf对象
    conf.set("spark.executor.memory", "1g");
    //设置应用程序的名称，在程序运行的监控界面可以看到名称
    conf.setAppName("Wow,TopNGroup App!")
    //此时，程序在本地运行，不需要安装Spark集群
    conf.setMaster("local[2]")
    //创建SparkContext对象，通过传入SparkConf实例来定制Spark运行的具体参数和配置信息

    val ssc = new SparkContext(conf)
    ssc.setLogLevel("ERROR")
    //加载SQL对象
    val sqlContext = new SQLContext(ssc)
    //加载数据
    var dr =  ssc.textFile("d://sparktest/people2.txt")
    val schemaString = "name age"
    /*  val schema = StructType(Array(
      StructField("name", StringType, true),
      StructField("age", StringType, true)
    ))*/
    val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
    val rowRDD = dr.map(_.split(",")).map(p => Row(p(0), p(1)))
    val df = sqlContext.createDataFrame(rowRDD, schema)

    //注册到临时表
    df.registerTempTable("user")
    //输出, df 的数据类型
    df.printSchema()
    //输出所有名字
    df.select("name").show()
    df.filter(df("age") > 21).show()
    df.groupBy("age").count().show()
    val teenagers = sqlContext.sql("SELECT name, age FROM user WHERE age >= 13 AND age <= 19")
    teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
    println("-------")
    ssc.stop()
  }
}