我的大数据之旅-Spark Dataset和RDD互操作

目录

 

通过反射获取Schema

通过编程设置Schema(StructType)


  • 通过反射获取Schema

在scala接口中,Spark SQL支持自动将包含case类的RDD转换成DataFrame。case类定义了table的结构,case类通过属性反射变成了列名。

scala> case class Person(name:String, age:Int)
defined class Person

scala> val df = spark.sparkContext.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(attributes => Person(attributes(0).trim(),attributes(1).trim().toInt)).toDF
df: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> df.show
+-------+---+
|   name|age|
+-------+---+
|Michael| 29|
|   Andy| 30|
| Justin| 19|
+-------+---+

scala> df.createOrReplaceTempView("persons")

scala> val teen = spark.sql("select * from persons where age < 30")
teen: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> teen.show
+-------+---+
|   name|age|
+-------+---+
|Michael| 29|
| Justin| 19|
+-------+---+

// 结果的row获取字段的值可以通过下标
scala> teen.map(row => "Name : " + row(0)).show
+--------------+
|         value|
+--------------+
|Name : Michael|
| Name : Justin|
+--------------+


// 通过列名也可以
scala> teen.map(row => "Name : " + row.getAs[String]("name")).show
+--------------+
|         value|
+--------------+
|Name : Michael|
| Name : Justin|
+--------------+

DateFrame = DataSet[Row]

 

 

 

 

 

  • 通过编程设置Schema(StructType)

如果case类不能够提前定义,可以通过以下三个步骤创建一个DataFrame:

1.创建一个多行结构的RDD

2.创建StructType 来匹配行结构信息

3.通过createDataFrame方法将结构StructType 应用到RDD的行中

scala> val peopleRDD = sc.textFile("examples/src/main/resources/people.txt")
peopleRDD: org.apache.spark.rdd.RDD[String] = examples/src/main/resources/people.txt MapPartitionsRDD[21] at textFile at <console>:24

scala> peopleRDD.collect
res6: Array[String] = Array(Michael, 29, Andy, 30, Justin, 19)

scala> val schemaStr = "name age"
schemaStr: String = name age

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> val fields = schemaStr.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))
fields: Array[org.apache.spark.sql.types.StructField] = Array(StructField(name,StringType,true), StructField(age,StringType,true))

scala> val schema = StructType(fields)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,StringType,true))                                                

scala> import org.apache.spark.sql._
import org.apache.spark.sql._

scala> val rowRDD = peopleRDD.map(_.split(",")).map(attrs => Row(attrs(0).trim, attrs(1).trim))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[23] at map at <console>:31

scala> val peopleDF = spark.createData
createDataFrame   createDataset

scala> val peopleDF = spark.createDataFrame(rowRDD, schema)
peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: string]

scala> peopleDF.show
+-------+---+
|   name|age|
+-------+---+
|Michael| 29|
|   Andy| 30|
| Justin| 19|
+-------+---+

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值