Spark支持两种方式将RDDs转为为SchemaRDDs:一种方法是使用反射来推断包含特定对象类型的RDD的模式(schema);一种方法是通过编程接口来实现,这个接口允许你构造一个模式,然后在存在的RDDs上使用它。
1.反射推断模式
Spark SQL的Scala接口支持将包含case class的RDDs自动转换为SchemaRDDs。而case class定义了表的模式。case class的参数名字通过反射来读取,然后作为列的名字。这个RDD可以隐式转化为一个SchemaRDD,然后注册为一个表。import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext
/**
* Created by Administrator on 2015/10/17.
*/
object UseCaseClass {
case class Person(firstName: String, lastName: String, age: Int)
def main(args: Array[String]): Unit ={
if(args.length != 1){
System.err.println("Usage: <data path>")
System.exit(1)
}
val conf = new SparkConf()
val sc = new SparkContext(conf)
//Loaded with sqlContext(which is the instance of HiveContext not SQLContext)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val data = sc.textFile(args(0))
val personRDD = data.map(_.split(",")).map(person => Person(person(0), person(1), person(2).toInt))
//Convert the personRDD into the personDF DataFrame
val personDF = personRDD.toDF()
//Register the personDF as a table
personDF.registerTempTable("person")
//Run a SQL query against it
val people = sqlContext.sql("SELECT * FROM person WHERE age < 30")
people.collect().foreach(println)
sc.stop()
}
}
2.编程指定模式
当case class不能提前确定(如,记录的结构是经过编码的字符串),这时case class就不能正常工作了。此时,我们可以用下面三个步骤创建一个SchemaRDD:- 从原来的RDD创建Row类型的RDD;
- 用StructType和StructField创建一个由StructType表示的模式与上一步创建的RDD的Row对象结构相同;
- 将Row类型的RDD转换为DataFrame。
StructType(fields: Array[StructField])
StructField(name: String, dataType: DataType, nullable: Boolean = true, metadata: Metadata = Metadata.empty)
- name:field的名称;
- dataType:field的数据类型,一共有以下几种数据类型:
IntegerType FloatType BooleanType ShortType LongType ByteType DoubleType StringType
- nullable:field是否可以为空,默认可以为空;
- metadata:field的元数据,Metadata是Map[String, Any]类型,可以存储任何类型的元数据。
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{IntegerType, StringType, StructType, StructField}
/**
* Created by Administrator on 2015/10/17.
*/
object SpecifySchema {
def main(args: Array[String]): Unit ={
if(args.length != 1){
System.err.println("Usage: <data path>")
System.exit(1)
}
val conf = new SparkConf()
val sc = new SparkContext(conf)
//Loaded with sqlContext(which is the instance of HiveContext not SQLContext)
val sqlContext = new SQLContext(sc)
val data = sc.textFile(args(0))
//Convert the RDD of array[string] to the RDD of the Row objects
val personRow = data.map(_.split(",")).map(person => Row(person(0), person(1), person(2).toInt))
/**
* Create schema using the StructType and StructField objects.
* The StructField object takes parameters in the form of param name, param type, and nullability
*/
val schema = StructType(
Array(
StructField("firstName", StringType, true),
StructField("lastName", StringType, true),
StructField("age", IntegerType, true)
)
)
//Apply schema to create the personDF DataFrame
val personDF = sqlContext.createDataFrame(personRow, schema)
//Register the personDF as a table
personDF.registerTempTable("person")
//Run a SQL query against it
val people = sqlContext.sql("SELECT * FROM person WHERE age < 30")
people.collect().foreach(println)
sc.stop()
}
}