pyspark入门学习demo
最近数据机太大,用pandas处理耗时太久,于是用学习pyspark处理数据。
pyspark创建Dataframe
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import functions as fn
from pyspark.sql import Window
from pyspark.sql.functions import current_date
from pyspark.sql.functions import datediff
from pyspark.sql.functions import lit
from pyspark.sql.functions import col,when, max
# 创建一个SparkSession对象
conf = SparkConf().setAppName("spark_1").setMaster("local[2]")
ss = SparkSession.builder.config(conf=conf).getOrCreate()
# 创建DateFrame
df1 = ss.createDataFrame([
("may", '2020-10-13', 5000, "F",), ("may", '2020-11-12', 8800, "F"), ("may", '2020-12-15', 6000, "F"),
("donce", '2020-10-10', 1800, "M"), ("donce", '2020-11-10', 6600, "M"), ("donce", '2020-12-10', 8800, "M")],
("name", "date", "exp", 'sex'))
print(df1.show())
增、删、改等相关语法
1.按照条件增加一列
print(df1.withColumn('sex_id', fn.when(col('sex') == 'M', 1).otherwise(0)).show())
print(df1.withColumn('exp_id', when(col('exp') <= 2000, 1)
.when((col('exp') > 2000) & (col('exp') <= 6000), 2)
.otherwise(3)).show())
2.新增一列自定义的,并删除'sex'列
print(df1.withColumn('current', lit(current_date())).drop('sex').show())
print(df1.withColumn('学位', lit('硕士')).drop('sex').show())
3.窗口函数实现
window = Window.partitionBy("name").orderBy(df1["exp"].desc())
df2 = df1.withColumn('topn', fn.row_number().over(window))
print(df2.show())
4.选择top1的exp
print(df2.where(df2.topn<=1).show())
年前学习留下的小笔记,明年争取更多时间学习,提升自己!