一、dataframe的创建
scala> import spark.implicits._
import spark.implicits._
scala> val df = spark.read.json("file:///usr/local/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
二、dataframe的保存
从文件people.json创建一个DataFrame,然后保存为csv文件和文本txt文件
scala> val peopleDF = spark.read.format("json").load("file:///usr/local/spark/examples/src/main/resources/people.json")
peopleDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> peopleDF.select("name","age").write.format("csv").save("File:///usr/local/spark/newpqople.csv")
scala> peopleDF.rdd.saveAsTextFile("file:///usr/local/spark/newpeople.txt")
三、dataframe的常用操作
1、printSchcema()
可以使用printSchcema()操作,打印出DaraFrame的模式(Schcema)信息
scala> df.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
2、select()
select()操作的功能,是从DataFrame中选取部分列的数据。select()操作选取了name和age两个列,并把age这个列的值增加1
scala> df.select(df("name"),df("age")+1).show()
+-------+---------+
| name|(age + 1)|
+-------+---------+
|Michael| null|
| Andy| 31|
| Justin| 20|
+-------+---------+
select()操作还可以实现对列名进行重命名的操作,如:将name列名称被重命名为username。
scala> df.select(df("name").as("uaername"),df("age")).show()
+--------+----+
|uaername| age|
+--------+----+
| Michael|null|
| Andy| 30|
| Justin| 19|
+--------+----+
3、fiter()
fiter()操作可以实现条件查询,找到满足条件要求的记录,查询所有age大于20的记录
scala> df.filter(df("age")>20).show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+
4、groupBy()
groupBy()操作用于对记录进行分组。如:可以根据age字段进行分组,并对每个分组中包含的记录数量进行统计
scala> df.groupBy("age").count().show()
+----+-----+
| age|count|
+----+-----+</