dataframe转rdd:
dataframe转rdd很简单,直接.rdd即可
scala> val df = spark.table("people_result")
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> val rdd1 = df.rdd
rdd1: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[9] at rdd at <console>:25
scala> rdd1.collect
res8: Array[org.apache.spark.sql.Row] = Array([3,c,22], [2,b,24])
scala> rdd1.map(x=>(x(1),x(0),x(2))).collect
res7: Array[(Any, Any, Any)] = Array((c,3,22), (b,2,24))
rdd转dataframe:
1.通过反射推断schema toDF
2.通过structtype直接指定schema createDataFrame
详细可参考官网:http://spark.apache.org/docs/latest/sql-programming-guide.html
hdfs存一张表:/test/persion.txt,内容:
[root@master ~]# hadoop fs -cat /test/person.txt
1,tingting,23,80
2,ningning,25,90
3,ruhua,27,60
4,mimi,33,85
从RDD创建DataFrame:
(1)创建RDD
scala> val line = sc.textFile("hdfs://master:9000/test/person.txt").map(_.split(","))
(2)创建DataFrame
加入schema(表头)信息:
通过反射推断schema:
scala> case class Person(id: Int, name: String, age: Int, faceValue: Int)
scala> val personRDD = line.map(x => Person(x(0).toInt, x(1), x(2).toInt, x(3).toInt))
scala> val personDF = personRDD.toDF
scala> personDF.show
或者通过structtype直接指定schema:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
直接指定schema:true表示是否可为空
val schema = StructType(
List(
StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("age", IntegerType, true),
StructField("faceValue", IntegerType, true)
)
)
val rowRDD = line.map(x => Row(x(0).toInt, x(1), x(2).toInt, x(3).toInt))
val personDF = spark.createDataFrame(rowRDD, schema)
+---+--------+---+---------+
| id| name|age|faceValue|
+---+--------+---+---------+
| 1|tingting| 23| 80|
| 2|ningning| 25| 90|
| 3| ruhua| 27| 60|
| 4| mimi| 33| 85|
+---+--------+---+---------+
其中,case class Person为创建一个Person样例类,相当于构造器
personRDD的格式为:
scala> personRDD.collect
res2: Array[Person] = Array(Person(1,tingting,23,80), Person(2,ningning,25,90), Person(3,ruhua,27,60), Person(4,mimi,33,85))
详细教程去spark sql官网查看:http://spark.apache.org/docs/latest/sql-programming-guide.html
IDEA代码:
通过反射推断schema
Spark2.0始,spark使用SparkSession接口代替SQLcontext和HiveContext
SparkSession是spark sql程序的入口
spark-shell命令,会自动创建好SparkContext和SparkSession
import org.apache.spark.sql.SparkSession
//样例类,定义表头的各个字段,供创建DataFrame时使用
case class Person(id: Int, name: String, age: Int, faceValue: Int)
object sparkSQL {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("sql test")
.master("local")
.getOrCreate()
//implicits隐士转换包,toDF、show方法等都需要这个包
import spark.implicits._
//读取
val line = spark.sparkContext.textFile("hdfs://master:9000/test/person.txt")
.map(_.split(","))
//将RDD和case class(样例类)关联
val personRDD = line.map(x => Person(x(0).toInt, x(1), x(2).toInt, x(3).toInt))
//RDD转换为DataFrame
val personDF = personRDD.toDF
//查看DataFrame内容
personDF.show
//注册
personDF.createOrReplaceTempView("person")
//查询
val df = spark.sql("select * from person order by age desc limit 2")
//输出
df.write.json("hdfs://master:9000/result")
}
}
需要注意的地方:
1.SparkSession:
spark2.0起,SparkConf、SparkContext、SQLContext都被封装在SparkSession中
val test = SparkSession.builder()
2. import test.implicits._
implicits隐士转换包,隐式转换toDF、show方法等都需要这个包
3. 注册
想要使用spark.sql("sql语句") 就要把dataframe注册临时表
使用createOrReplaceTempView
spark2.0起,registerTempTable被弃用
结果:
下载下来查看它
IDEA代码:
通过structtype直接指定schema
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
object sparkSQL {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("sql test")
.master("local")
.getOrCreate()
//implicits隐士转换包,toDF、show方法等都需要这个包
import spark.implicits._
//读取
val line = spark.sparkContext.textFile("hdfs://master:9000/test/person.txt")
.map(_.split(","))
//将RDD和case class(样例类)关联
//val personRDD = line.map(x => Person(x(0).toInt, x(1), x(2).toInt, x(3).toInt))
val schema = StructType(
List(
StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("age", IntegerType, true),
StructField("faceValue", IntegerType, true)
)
)
val rowRDD = line.map(x => Row(x(0).toInt, x(1), x(2).toInt, x(3).toInt))
val personDF = spark.createDataFrame(rowRDD, schema)
//RDD转换为DataFrame
//val personDF = personRDD.toDF
//查看DataFrame内容
//personDF.show
//注册
personDF.createOrReplaceTempView("person")
//查询
val df = spark.sql("select * from person order by age desc limit 2")
//输出
df.show
}
}
结果: