填充均值
直接上代码
import pyspark.sql.functions as fn
#均值填充
mean = df.agg(*(fn.mean(c).alias(c) for c in concate_col))
meaninfo = mean.first().asDict()
meandf = df.fillna(meaninfo)
为了保证线上线下的一致性,可以将每一列的均值持久化,线上预测的时候可以直接读取需要填充的均值,这样也可以一定程度上减少出错的概率和增加预测速度
import json
meaninfo = mean.first().asDict()
#持久化json文件
with open(ModelParaspath+"/meaninfo.txt", "w") as f:
f.write(json.dumps(meaninfo))
f.close()
我的ModelParaspath写的是HDFS路径,在本地的时候也是可以放本地路径的,这个都行。
线上需要用的时候再解析json文件
import pyspark.sql.functions as fn
if(os.path.exists(ModelParaspath+"/meaninfo.txt")): #如果json文件存在
with open(ModelParaspath+"/meaninfo.txt", "r") as f:
meaninfo = json.loads(f.read())
meandf = meandf.fillna(meaninfo)
else: #如果json文件不存在,则直接计算非空值的均值进行填充
# 用均值填充连续类特征中的null值#by 20191212
# pass
print('填充均值...')
mean = meandf.agg(*(fn.mean(c).alias(c) for c in concate_col))
meandf = meandf.fillna(mean.first().asDict())
多列填充缺失值
oktz = ['templateid', 'order_num_l180_level', 'gvm_last180_level', 'maintenance_emergency_degree', 'activity_score',
'isactive', 'dxsenthour']
fill0 = ['isblacklist', 'istduser', 'isltuser', 'iscpuser', 'ismruser', 'isbyuser', 'isotheruser', 'dx_sendtime_pref']
fillwz = ['gradename', 'gender', 'ordertime_perf', 'vehiclebrand', 'fueltype', 'jointventure', 'vehiclelevel',
'week90day', 'life_cycle', 'long_city_type', 'active7day_platform', 'active30day_platform', 'mdtype_pref',
'active30day_hour', 'senthour']
df = df.fillna(0, subset=fill0)
df = df.fillna('unkown', subset=oktz + fillwz)
简单来说,多列填充,需要将填充的列名放在list中,作为subset参数的值
不过这里有个坑,比如df = df.fillna(0, subset=fill0)中,fill0的某些列是string类型的,那么这句话就不会去填充,他只会填充数字类型的,要想填充字符串类型的空值,就需要
df = df.fillna('0', subset=fill0)