大数据学习笔记03
Spark SQL基本操作
- 将下列json数据复制到你的ubuntu系统/usr/local/spark下,并保存命名为employee.json
{ "id":1 ,"name":" Ella","age":36 }
{ "id":2,"name":"Bob","age":29 }
{ "id":3 ,"name":"Jack","age":29 }
{ "id":4 ,"name":"Jim","age":28 }
{ "id":5 ,"name":"Damon" }
{ "id":5 ,"name":"Damon" }
- 首先为employee.json创建DataFrame,并写出Python语句完成以下操作:
- 创建DataFrame
from pyspark import SparkContext,Sparkconf
form pyspark.sql import Sparksession
spark=SparkSession.builder().getOrCreate()
df = spark.read.json("file:///usr/local/spark/employee.json")
df.show()
df.distinct().show()
df.drop("id").show()
df.filter(df.age > 30 ).show()
df.groupBy("name").count().show()
df.sort(df.name.asc()).show()
df.take(3) 或python> df.head(3)
- 查询所有记录的name列,并为其取别名为username
df.select(df.name.alias("username")).show()
df.agg({"age": "mean"}).show()
df.agg({"age": "max"}).show()