Spark SQL:RDD、DataFrames、DataSet之间的转换


people.txt

Michael,29
Andy,30
Justin,19

RDD转DataFrames

scala> val rdd=sc.textFile("people.txt")
rdd: org.apache.spark.rdd.RDD[String] = people.txt MapPartitionsRDD[44] at textFile at <console>:24
方式一:直接指定列名和数据类型
scala> val ds=rdd.map(_.split(",")).map(x=>(x(0),x(1).trim().toInt)).toDF("name","age")
ds: org.apache.spark.sql.DataFrame = [name: string, age: int]
方式二:通过反射转换
scala> case class people(name:String,age:Long)
defined class people

scala> rdd.map(_.split(",")).map(x=>(people(x(0),x(1).trim.toInt))).toDF()
res44: org.apache.spark.sql.DataFrame = [name: string, age: bigint]
方式三:通过编程设置Schema(StructType)
# 在一些时候不能直接定义case类,就用这种方法
scala> val rdd=sc.textFile("people.txt")
rdd: org.apache.spark.rdd.RDD[String] = people.txt MapPartitionsRDD[97] at textFile at <console>:27

scala> val schemaString = "name age"
schemaString: String = name age

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))
fields: Array[org.apache.spark.sql.types.StructField] = Array(StructField(name,StringType,true), StructField(age,StringType,true))

scala> val schems=StructType(fields)
schems: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,StringType,true))

scala> import org.apache.spark.sql._
import org.apache.spark.sql._

scala> val rowrdd=rdd.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1).trim))
rowrdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[99] at map at <console>:35

scala> spark.createDataFrame(rowrdd,schems)
res46: org.apache.spark.sql.DataFrame = [name: string, age: string]

RDD转DataSet

scala> rdd.map(_.split(",")).map(x=>(x(0),x(1).trim().toInt)).toDS()
res17: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]

DataFrame/Dataset转RDD

scala> ds.rdd
scala> df.rdd

DataFrame转Dataset

scala> case class people(name:String,age:Long)
defined class people

scala> df.as[people]
res39: org.apache.spark.sql.Dataset[people] = [age: bigint, name: string]

Dataset转DataFrame

scala> ds.toDF()
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值