Spark SQL 之 DataSet
文章目录
1. 创建 DataSet
-
使用样例类的序列得到
DataSet
。scala> case class Person(name: String, age: Int) defined class Person scala> val ds = Seq(Person("zs",18), Person("ls",20)).toDS ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int] scala> ds.show +----+---+ |name|age| +----+---+ | zs| 18| | ls| 20| +----+---+
-
使用基本类型的序列得到
DataSet
。scala> val ds = Seq(1,2,3,4,5).toDS ds: org.apache.spark.sql.Dataset[Int] = [value: int] scala> ds.show +-----+ |value| +-----+ | 1| | 2| | 3| | 4| | 5| +-----+
2. RDD 和 DataSet 的交互
2.1 从 RDD 到 DataSet
// 1.创建样例类
scala> case class Student(name: String, age: Int)
defined class Student
// 2.RDD -> DS
scala> val rdd = sc.textFile("./stu.txt").map(line => {val paras = line.split(","); Student(paras(0),paras(1).toInt)})
rdd: org.apache.spark.rdd.RDD[Student] = MapPartitionsRDD[118] at map at <console>:26
scala> val ds = rdd.toDS
ds: org.apache.spark.sql.Dataset[Student] = [name: string, age: int]
scala> ds.show
+----+---+
|name|age|
+----+---+
| zgl| 23|
| zzx| 20|
| zzz| 18|
+----+---+
2.2 从 DataSet 到 RDD
-
调用
rdd
方法即可。scala> ds.show +----+---+ |name|age| +----+---+ | zgl| 23| | zzx| 20| | zzz| 18| +----+---+ scala> val rdd = ds.rdd rdd: org.apache.spark.rdd.RDD[Student] = MapPartitionsRDD[123] at rdd at <console>:30 scala> rdd.collect res41: Array[Student] = Array(Student(zgl,23), Student(zzx,20), Student(zzz,18))
3. DataFrame 和 DataSet 之间的交互
3.1 从 DataFrame 到 DataSet
scala> val df = spark.read.json("./stu.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> case class Student(name: String, age: BigInt)
defined class Student
scala> val ds = df.as[Student]
ds: org.apache.spark.sql.Dataset[Student] = [age: bigint, name: string]
scala> ds.show
+---+----+
|age|name|
+---+----+
| 22| zgl|
| 18| zzx|
| 20| zzz|
+---+----+
3.2 从 DataSet 到 DataFrame
scala> case class People(name: String, age: Int)
defined class People
scala> val ds = Seq(People("zzz",18)).toDS
ds: org.apache.spark.sql.Dataset[People] = [name: string, age: int]
scala> val df = ds.toDF
df: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> df.show
+----+---+
|name|age|
+----+---+
| zzz| 18|
+----+---+