pyspark skills

最新推荐文章于 2023-05-12 19:18:32 发布

euler1983

最新推荐文章于 2023-05-12 19:18:32 发布

阅读量197

点赞数

分类专栏： pyspark

本文链接：https://blog.csdn.net/euler1983/article/details/110428860

版权

pyspark 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

pyspark join 报错的解决方法

由于pyspark里使用了DAG，当通过df1得到df2，再和df1做join的时候，会报错：

在这里插入图片描述
错误信息里有一层一层的列表式：

stackoverflow里有关于这个问题的解决方法：
https://stackoverflow.com/questions/45713290/how-to-resolve-the-analysisexception-resolved-attributes-in-spark

亲测其中最有效的一个是：

import pyspark.sql.functions as F
df = df.select(F.col("colA").alias("colA"))

pyspark中将列表中的列累加成一个新列

两种方法：
from functools import reduce
from operator import add
from pyspark.sql.functions import col
df.na.fill(0).withColumn(“result” ,reduce(add, [col(x) for x in df.columns])

或者：
from pyspark.sql.functions import expr
cols_list = [‘a’, ‘b’, ‘c’]
expression = ‘+’.join(cols_list)
df = df.withColumn(‘sum_cols’, expr(expression))

用pyspark删除hdfs file

    sc = spark.sparkContext
    URI = sc._gateway.jvm.java.net.URI
    Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
    FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
    fs = FileSystem.get(URI("hdfs://hadoop-master:9000"), sc._jsc.hadoopConfiguration())
    
    # We can now use the Hadoop FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html)
    if fs.exists(Path(target_file)):
        UT.Log('delete %s' % target_file)
        fs.delete(Path(target_file))

hdfs的函数参考：
hdfs apis

array函数

array_intersect
array_union
array_except

pyspark窗函数中的rangeBetween()

w = Window.partitionBy("freq").orderBy('count').rangeBetween(1, 10)
fqitm.select(F.col('*'), F.min("count").over(w).alias('mincount'), F.max("count").over(w).alias('maxcount')).show()

在这里插入图片描述