1、字段类型转换
(1)数据集准备
df = spark.read.option('sep','!')\
.option('ignoreLeadingWhiteSpace','true')\
.option('ignoreTrailingWhiteSpace','true')\
.csv('dbfs:/mnt/landing/uat/input/GPSE/GPSE_GPBILDS/GPSE_GPBILDS.ctl')
df = df.withColumnRenamed('_c0','src').withColumnRenamed('_c1','date').withColumnRenamed('_c2','cnt')
display(df)
(2)转换方式1: withColumn
需求:将字段cnt从string类型转换为int类型
##方式1:withColumn
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType,StringType,DateType
# 转化为整数类型
df1_1 = df.withColumn("cnt",df.cnt.cast(IntegerType()))
df1_2 = df.withColumn("cnt",df.cnt.cast('int'))
df1_3 = df.withColumn("cnt",df.cnt.cast('integer'))
display(df1_1)
display(df1_2)
display(df1_3)
(3)转换方式2: select
##方式2:select
df2 = df.select(col('src'),col('date'),col('cnt').cast('int').alias('cnt'))
display(df2)
(4)转换方式3: selectExpr
##方式3:selectExpr
df3 = df.selectExpr("src","date","cast(cnt as int) cnt")
display(df3)
2、withColumn函数
(1)数据集准备
# 创建数据集
data = [("Alice", 2), ("Bob", 5), ("Charlie", 7)]
df = spark.createDataFrame(data, ["Name", "Age"])
display(df)
(2) 新增列-lit
##1 withColumn lit:给DateFrame添加新列
from pyspark.sql.functions import lit
# 增加city列,且值为北京
df1= df.withColumn('city',lit('北京'))
display(df1)
(3)新增列 - 逻辑运算
##2 withColumn 对原有列进行算数运算,得到新列
from pyspark.sql.functions import lit
from pyspark.sql.functions import col
df2 = df.withColumn("new_age",col("age") + 10)
display(df2)
(4)新增列 - 正则匹配替换
##3 withColumn 对原有列进行正则匹配和替换
from pyspark.sql.functions import regexp_replace
# 将原Name列中的任何a|b|c|A|B|C都替换成*
df3 = df.withColumn("new_name",regexp_replace("Name","[a-cA-C]","*"))
display(df3)