DataFrame

最新推荐文章于 2023-07-14 11:39:51 发布

Gru杨

最新推荐文章于 2023-07-14 11:39:51 发布

阅读量365

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/weixin_43517453/article/details/93879402

版权

Spark 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

DataFrame
val df = spark.read.json(“file:///home/xxxxxxxx.json”) //底层是外部数据源
df.show
df.printSchema
定义Schema
val structType = StructType(Array(StructField(“id”,StringType,true),StructField(“id”,StringType,true),StructField(“name”,StringType,true)))
//StructField(结构体字段) 名字类型为不为空
val df = spark.createDataFrame(rdd,structType)

Json：
$SPARK_HOME/examples/src/main/resource/employees.json
{“name”:“Michael”, “salary”:3000}
{“name”:“Andy”, “salary”:4500}
{“name”:“Justin”, “salary”:3500}
{“name”:“Berta”, “salary”:4000}
val df = spark.read.json(“file:///home/hadoop/app/spark-2.4.2-bin-2.6.0-cdh5.7.0/examples/src/main/resources/employees.json”)

scala> df.printSchema
root
|-- city: string (nullable = true)
|-- name: string (nullable = true)
|-- salary: double (nullable = true)

scala> df.createOrReplaceTempView(“emp”)

scala> df.select(“name”,“salary”).show
scala> df.select( $" n a m e ",$ “salary”).show 事先得 import spark.implicits._
scala> df.select(df(“salary”),df(“name”)).show
±------±-----+
| name|salary|
±------±-----+
|Michael|3000.0|
| Andy|4500.0|
| Justin|3500.0|
| Berta|4000.0|
±------±-----+

scala> df.select(df(“salary”)+10).show
±------------+
|(salary + 10)|
±------------+
| 3010.0|
| 4510.0|
| 3510.0|
| 4010.0|
±------------+

scala> df.filter(“salary>3600”).show
±------±----±-----+
| city| name|salary|
±------±----±-----+
| null| Andy|4500.0|
|beijing|Berta|4000.0|
±------±----±-----+
筛选出来
student.filter(“name=’’ or name=‘null’ or name=‘NULL’ “)
排序
student.sort(student(“name”).desc,student(“id”).desc).show()
改名字
student.select($“phone”.as(“mobile”))
join
val stu1=spark.sparkContext.textFile(“file://xxxx”).map(.split("\|")).map(x=>Student(x(0),x(1),x(2),x(3))).toDF
val stu2=spark.sparkContext.textFile(“file://xxxx”).map(.split(”\|”)).map(x=>Student(x(0),x(1),x(2),x(3))).toDF
stu1.join(stu2) //不写条件，join的结果是笛卡尔积
stu1.join(stu2,stu1(“id”)===stu2(“id”))

RDD 与 DataFrame/DataSet 的转换 **********

reflection 反射一般在源头数据采用，因为这时候知道数据的结构
import spark.inplicits._
spark.sparkContext.textFile("").map(_.split("\|")).map(x=>Student(x(0),x(1),x(2),x(3))).toDF
case class Student(id:String,name:String,phone:String,email:String)

rdd.take(4) 结果不是rdd 是Array[]数组
rdd.first
rdd.head(3).foreach(println)
自己创建Schema，然后 apply to RDD 上
1.create RDD of Rows from the original RDD
2.Create the schema by a StructType
3.apply the schema to the RDD of Rows via createDataFrame

1.create RDD of Rows from the original RDD 把原生的RDD转成 RDD Row
val rdd = spark.sparkContext.textFile(file://xxx).map(_.split("\|")).map(x=>Row(x(0),x(1),x(2),x(3)))
2.Create the schema by a StructType
StructType=Array[StructField]
StructField={name,类型,是不是为空}
val structType= StructType(Array(StructField(“id”,StringType,true),
StructField(“name”,StringType,true),
StructField(“phone”,StringType,true),
StructField(“email”,StringType,true)))
3.apply the schema to the RDD of Rows via createDataFrame
val df = spark.createDataFrame(rdd,structType)
df.printSchema