1.dataframe列数据类型校验
isinstance(df.schema["col_name"].dataType, ArrayType)
2.将dataframe列中的list数据转化为多行
例如:[‘qq’, ‘ww’, ‘ee’]——>qq
ww
ee
import pyspark.sql.functions as F
exploded_df = df.select("exploded_data", F.explode("orig_col").alias("exploded_data"))
详见:https://stackoverflow.com/questions/48822381/pyspark-convert-column-of-lists-to-rows
3.dataframe去重操作
df.dropDuplicates((subset=['col1','col2'])
4.dataframe字符串拆分成多列
例如:原始数据为‘aa_bb’——>‘aa’, 'bb’两列
split[添加链接描述](https://sparkbyexamples.com/pyspark/pyspark-withcolumn/)_col = split(exploded_df["exploded_data"], '_').alias("new_col")
exploded_df = exploded_df.select("exploded_data", split_col.getItem(0).alias('col1_name'),
split_col.getItem(1).alias('col2_name'))
详见:https://stackoverflow.com/questions/45789489/how-to-split-a-list-to-multiple-columns-in-pyspark
5.dataframe列名重命名
df = df.withColumnRenamed("orig_name", "new_name")
更多关于withColumn方法的使用详见:
https://sparkbyexamples.com/pyspark/pyspark-withcolumn/
6.dataframe的list数据to多列
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import Row
df = spark.createDataFrame([Row(index=1, finalArray = [1.1,2.3,7.5], c =4),Row(index=2, finalArray = [9.6,4.1,5.4], c= 4)])
#collecting all the column names as list
dlist = df.columns
#Appending new columns to the dataframe
df.select(dlist+[(col("finalArray")[x]).alias("Value"+str(x+1)) for x in range(0, 3)]).show()
详见:https://stackoverflow.com/questions/45789489/how-to-split-a-list-to-multiple-columns-in-pyspark