DataFrame
val df = spark.read.json(“file:///home/xxxxxxxx.json”) //底层是外部数据源
df.show
df.printSchema
定义Schema
val structType = StructType(Array(StructField(“id”,StringType,true),StructField(“id”,StringType,true),StructField(“name”,StringType,true)))
//StructField(结构体字段) 名字 类型 为不为空
val df = spark.createDataFrame(rdd,structType)
Json:
$SPARK_HOME/examples/src/main/resource/employees.json
{“name”:“Michael”, “salary”:3000}
{“name”:“Andy”, “salary”:4500}
{“name”:“Justin”, “salary”:3500}
{“name”:“Berta”, “salary”:4000}
val df = spark.read.json(“file:///home/hadoop/app/spark-2.4.2-bin-2.6.0-cdh5.7.0/examples/src/main/resources/employees.json”)
scala> df.show
±------±------±-----+
| city| name|salary|
±------±------±-----+
| null|Michael|3000.0|
| null| Andy|4500.0|
| null| Justin|3500.0|
|beijing| Berta|4000.0|
±------±------±-----+
scala> df.printSchema
root
|-- city: string (nullable = true)
|-- name: string (nullable = true)
|-- salary: double (nullable = true)
scala> df.createOrReplaceTempView(“emp”)
scala> spark.sql(“select * from emp”).show
19/06/04 09:29:50 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
±------±------±-----+
| city| name|salary|
±------±------±-----+
| null|Michael|3000.0|
| null| Andy|4500.0|
| null| Justin|3500.0|
|beijing| Berta|4000.0|
±------±------±-----+
scala> df.select(“name”,“salary”).show
scala> df.select(
"
n
a
m
e
"
,
"name",
"name",“salary”).show 事先得 import spark.implicits._
scala> df.select(df(“salary”),df(“name”)).show
±------±-----+
| name|salary|
±------±-----+
|Michael|3000.0|
| Andy|4500.0|
| Justin|3500.0|
| Berta|4000.0|
±------±-----+
scala> df.select(df(“salary”)+10).show
±------------+
|(salary + 10)|
±------------+
| 3010.0|
| 4510.0|
| 3510.0|
| 4010.0|
±------------+
scala> df.filter(df(“salary”)>3600).show
±------±----±-----+
| city| name|salary|
±------±----±-----+
| null| Andy|4500.0|
|beijing|Berta|4000.0|
±------±----±-----+
scala> df.filter(“salary>3600”).show
±------±----±-----+
| city| name|salary|
±------±----±-----+
| null| Andy|4500.0|
|beijing|Berta|4000.0|
±------±----±-----+
筛选出来
student.filter(“name=’’ or name=‘null’ or name=‘NULL’ “)
排序
student.sort(student(“name”).desc,student(“id”).desc).show()
改名字
student.select($“phone”.as(“mobile”))
join
val stu1=spark.sparkContext.textFile(“file://xxxx”).map(.split("\|")).map(x=>Student(x(0),x(1),x(2),x(3))).toDF
val stu2=spark.sparkContext.textFile(“file://xxxx”).map(.split(”\|”)).map(x=>Student(x(0),x(1),x(2),x(3))).toDF
stu1.join(stu2) //不写条件,join的结果是笛卡尔积
stu1.join(stu2,stu1(“id”)===stu2(“id”))
RDD 与 DataFrame/DataSet 的转换 **********
-
reflection 反射 一般在源头数据采用,因为这时候知道数据的结构
import spark.inplicits._
spark.sparkContext.textFile("").map(_.split("\|")).map(x=>Student(x(0),x(1),x(2),x(3))).toDF
case class Student(id:String,name:String,phone:String,email:String)rdd.take(4) 结果不是rdd 是Array[]数组
rdd.first
rdd.head(3).foreach(println) -
自己创建Schema,然后 apply to RDD 上
1.create RDD of Rows from the original RDD
2.Create the schema by a StructType
3.apply the schema to the RDD of Rows via createDataFrame1.create RDD of Rows from the original RDD 把原生的RDD转成 RDD Row
val rdd = spark.sparkContext.textFile(file://xxx).map(_.split("\|")).map(x=>Row(x(0),x(1),x(2),x(3)))
2.Create the schema by a StructType
StructType=Array[StructField]
StructField={name,类型,是不是为空}
val structType= StructType(Array(StructField(“id”,StringType,true),
StructField(“name”,StringType,true),
StructField(“phone”,StringType,true),
StructField(“email”,StringType,true)))
3.apply the schema to the RDD of Rows via createDataFrame
val df = spark.createDataFrame(rdd,structType)
df.printSchema
RDD.map ==> RDD
DataFrame.map ==> DataSet