#Pyspark imports
import pyspark
from pyspark.sql import SQLContext
from pyspark.sql.functions import hour, when, col, date_format, to_timestamp
from pyspark.sql.functions import *
# Define Spark Context
sc = pyspark.SparkContext(appName="Homework")
sqlContext = SQLContext(sc)
# Function to load data
def load_data():
df = sqlContext.read.option("header",True).csv("yellow_tripdata_2019-01_short.csv")
return df
df = load_data()
https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.DataFrame.orderBy.html?highlight=orderby#pyspark.sql.DataFrame.orderBy
在pyspark中,可以和pandas一样进行groupby操作,count 也是一样可以做的,例如我们可以使用下面的简单操作来去得到对column1进行group后,计算每个group的计数,并且展示出来。
df.groupy("column1").count().show()
现在我们开始在这个语句上面增加条件,加上各种变化,满足现实