pyspark join 报错的解决方法
由于pyspark里使用了DAG,当通过df1得到df2,再和df1做join的时候,会报错:
错误信息里有一层一层的列表式:
stackoverflow里有关于这个问题的解决方法:
https://stackoverflow.com/questions/45713290/how-to-resolve-the-analysisexception-resolved-attributes-in-spark
亲测其中最有效的一个是:
import pyspark.sql.functions as F
df = df.select(F.col("colA").alias("colA"))
pyspark中将列表中的列累加成一个新列
两种方法:
from functools import reduce
from operator import add
from pyspark.sql.functions import col
df.na.fill(0).withColumn(“result” ,reduce(add, [col(x) for x in df.columns])
或者:
from pyspark.sql.functions import expr
cols_list = [‘a’, ‘b’, ‘c’]
expression = ‘+’.join(cols_list)
df = df.withColumn(‘sum_cols’, expr(expression))
用pyspark删除hdfs file
sc = spark.sparkContext
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
fs = FileSystem.get(URI("hdfs://hadoop-master:9000"), sc._jsc.hadoopConfiguration())
# We can now use the Hadoop FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html)
if fs.exists(Path(target_file)):
UT.Log('delete %s' % target_file)
fs.delete(Path(target_file))
hdfs的函数参考:
hdfs apis
array函数
array_intersect
array_union
array_except
pyspark窗函数中的rangeBetween()
w = Window.partitionBy("freq").orderBy('count').rangeBetween(1, 10)
fqitm.select(F.col('*'), F.min("count").over(w).alias('mincount'), F.max("count").over(w).alias('maxcount')).show()
pyspark插值
普通分组:
https://zhuanlan.zhihu.com/p/143933094
利用window分组插值:
https://blog.csdn.net/qq_38092934/article/details/97680140
forward filling:
https://johnpaton.net/posts/forward-fill-spark/
pyspark正则剔除
ex_diag = ‘宫颈上皮|高血压’
tk07.where(F.col(‘diag_name’).rlike(ex_diag)==False).select(‘diag_name’).show(10, False)