目录
-
通过反射获取Schema
在scala接口中,Spark SQL支持自动将包含case类的RDD转换成DataFrame。case类定义了table的结构,case类通过属性反射变成了列名。
scala> case class Person(name:String, age:Int)
defined class Person
scala> val df = spark.sparkContext.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(attributes => Person(attributes(0).trim(),attributes(1).trim().toInt)).toDF
df: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> df.show
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Andy| 30|
| Justin| 19|
+-------+---+
scala> df.createOrReplaceTempView("persons")
scala> val teen = spark.sql("select * from persons where age < 30")
teen: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> teen.show
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Justin| 19|
+-------+---+
// 结果的row获取字段的值可以通过下标
scala> teen.map(row => "Name : " + row(0)).show
+--------------+
| value|
+--------------+
|Name : Michael|
| Name : Justin|
+--------------+
// 通过列名也可以
scala> teen.map(row => "Name : " + row.getAs[String]("name")).show
+--------------+
| value|
+--------------+
|Name : Michael|
| Name : Justin|
+--------------+
DateFrame = DataSet[Row]
-
通过编程设置Schema(StructType)
如果case类不能够提前定义,可以通过以下三个步骤创建一个DataFrame:
1.创建一个多行结构的RDD
2.创建StructType
来匹配行结构信息
3.通过createDataFrame方法将结构StructType
应用到RDD的行中
scala> val peopleRDD = sc.textFile("examples/src/main/resources/people.txt")
peopleRDD: org.apache.spark.rdd.RDD[String] = examples/src/main/resources/people.txt MapPartitionsRDD[21] at textFile at <console>:24
scala> peopleRDD.collect
res6: Array[String] = Array(Michael, 29, Andy, 30, Justin, 19)
scala> val schemaStr = "name age"
schemaStr: String = name age
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val fields = schemaStr.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))
fields: Array[org.apache.spark.sql.types.StructField] = Array(StructField(name,StringType,true), StructField(age,StringType,true))
scala> val schema = StructType(fields)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,StringType,true))
scala> import org.apache.spark.sql._
import org.apache.spark.sql._
scala> val rowRDD = peopleRDD.map(_.split(",")).map(attrs => Row(attrs(0).trim, attrs(1).trim))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[23] at map at <console>:31
scala> val peopleDF = spark.createData
createDataFrame createDataset
scala> val peopleDF = spark.createDataFrame(rowRDD, schema)
peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: string]
scala> peopleDF.show
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Andy| 30|
| Justin| 19|
+-------+---+