一、pyspark窗口函数的使用
pyspark使用sql类似方法窗口函数window.
partitionby进行分组,orderby是根据两个字段排序。分组需要的确唯一值 比如两个时间一样就不可以,可以进行联合orderby
F.sum 分组累计求和,相当于pandas.groupby.cumsum,金额都累计至当次(包括当次的金额)。特别注意
from pyspark.sql import Window
import pyspark.sql.functions as F
# 累计求和至当次
windowSpec=Window.partitionBy("ryid").orderBy('jsrq','sjfsfyze')
js=js.withColumn("row_number",F.sum(js['sjfsfyze']).over(windowSpec))
二、pyspark 表连接操作
unionall做两个表之间的连接操作,不进行去重的操作。
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = [("James","Sales","NY",90000,35,10000), \
("Michael","Sales","NY",86000,56,20000), \
("Robert","Sales","CA",81000,30,23000), \
("Maria","Finance","CA",90000,24,23000) \
]
columns= ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
simpleData2 = [("James","Sales","NY",90000,34,10000), \
("Maria","Finance","CA",90000,24,23000), \
("Jen","Finance","NY",79000,53,15000), \
("Jeff","Marketing","CA",80000,25,18000), \
("Kumar","Marketing","NY",91000,50,21000) \
]
columns2= ["employee_name","department","state","salary","age","bonus"]
df2 = spark.createDataFrame(data = simpleData2, schema = columns2)
df2.printSchema()
df2.show(truncate=False)
# 表连接
output=df.unionAll(df2)
三、对string列进行分列操作
用split方法进行操作,getitem获取分列的第几个。
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col,substring,regexp_replace
spark=SparkSession.builder.appName("sparkbyexamples").getOrCreate()
data = [('James','','Smith','1991-04-01'),
('Michael','Rose','','2000-05-19'),
('Robert','','Williams','1978-09-05'),
('Maria','Anne','Jones','1967-12-01'),
('Jen','Mary','Brown','1980-02-17')
]
columns=["firstname","middlename","lastname","dob"]
df=spark.createDataFrame(data,columns)
df.printSchema()
df.show(truncate=False)
df1 = df.withColumn('year', split(df['dob'], '-').getItem(0)) \
.withColumn('month', split(df['dob'], '-').getItem(1)) \
.withColumn('day', split(df['dob'], '-').getItem(2))
df1.printSchema()
df1.show(truncate=False)
四、sql groupby分组求和的使用
groupby.agg方法,如果不需要分组直接进行groupby().agg()就可以了。
alias进行重新命名这个跟pandas不同。
F.when 是相当于case when then otherwise end操作。
js.filter(col("医疗类别").isin("门诊慢性病")&col("险种类型").isin("职工")&col("医疗人员类别").isin(
'建国前老工人','离休职工','退休公务员','退休职工'
)).groupby("人员id","医疗机构类型","大病起付线降低标识").agg(
F.sum('统筹支付金额').alias("统筹支付金额"),
F.sum('统筹纳入范围').alias("统筹纳入范围"),
F.sum('转外自理').alias("转外自理"),
F.sum('大病纳入范围').alias("大病纳入范围"),
F.sum('大病支付').alias("大病支付"),
F.sum('大病药品支付').alias("大病药品支付"),
F.sum('大病药品纳入范围').alias("大病药品纳入范围"),
F.sum('ylzf').alias("乙类自付扣除超限价金额"), F.sum('zl').alias("自理金额"),
F.sum('cxg').alias("重新计算超限价金额"),
F.sum(col("统筹纳入范围") + col("转外自理") - col("统筹支付金额") - col("个人账户支付") - col(
"公务员补助")).alias("大病合规费用计算1"),
F.sum(F.when(col("统筹纳入范围")<200,0).otherwise(F.when(col("是否基层")==1,(col("统筹纳入范围")-200)*0.85
).otherwise((col("统筹纳入范围")-400)*0.7))).alias("统筹支付金额2"),
F.count(col("人员id")).alias("人次")
)
五、pyspark时间格式转化
to_timrstamp进行string转化为日期格式。
data_format根据string转化为想要的日期格式
# 转化为时间格式,根据字段的时间格式才能进行转化。年月日 时分秒 yyyy-MM-dd HH:mm:ss
js=js.withColumn("newdate", F.to_timestamp(col("jsrq"),"yyyy-MM-dd HH:mm:ss")).withColumn(
"month",F.month(col("jsrq"))
)
# f.min 、max avg、sum
# 时间格式的转化,例如转化为年月日
js.withColumn("ymd", F.date_format(col("jsrq"), "yyyy-MM-dd")).select("ymd", "jsrq").show()
参考资料
https://sparkbyexamples.com/pyspark/pyspark-date_format-convert-date-to-string-format/