append()方法_数据缺失值填充的三大统计方法总结实现

205712da3c16f6ac9b182ba08d00fce2.png

在我上一篇文章里面写到了平时工作积累到的一些比较常用的数据科学处理方法,其中第一大部分主要将的就是数据缺失值相关的内容,在实际的应用中,数据缺失值是一个不可避免的问题,造成数据中有缺失值的原因有很多,但是我们更重要的是要聚焦如何去进行缺失数据的填充,进可能降低缺失值数据带来的影响。

在这里我基于自己实际项目经验,总结了三种统计上的数据缺失值填充方法,分别是:滑动平均数据填充方法、移动加权数据填充方法和卡尔曼滤波数据填充方法,相关的介绍和原理网上有很多,我这里就不再累赘了,这里给出来具体的代码实现,感兴趣的可以拿去使用,欢迎交流技术问题,共同进步。

具体的代码实现如下:

#!usr/bin/env python#encoding:utf-8from __future__ import division'''__Author__:沂水寒城功能: 缺失数据填充模块主要包括: 滑动平均缺失值填充算法、移动加权缺失值填充算法、卡尔曼滤波缺失值填充算法'''import sysfrom pykalman import KalmanFilterreload(sys)sys.setdefaultencoding("utf-8")def sliceWindowAverage(one_all_list,num=10,flag=False): ''' 滑动平均缺失值填充算法: 基于滑窗在时序数据上的滑动实现对窗口内缺失值的填充(使用窗口内其余值的均值进行填充) ''' nozero_list=[one for one in one_all_list if one>0] before_avg,last_avg=sum(nozero_list[:num])/num,sum(nozero_list[-1*num:])/num res_list=[] for i in range(len(one_all_list)): if one_all_list[i]!=0: res_list.append(one_all_list[i]) else: tmp=int(num/2) if i<=tmp: if flag: res_list.append(float(before_avg)) else: res_list.append(int(before_avg)) elif i>=len(one_all_list)-tmp: if flag: res_list.append(float(last_avg)) else: res_list.append(int(last_avg)) else: one_index_list=list(range(i-tmp,i))+list(range(i+1,i+tmp+1)) one_value=[one_all_list[h] for h in one_index_list] one_value=[one for one in one_value if one>0] if len(one_value): if flag: res_list.append(float(sum(one_value)/len(one_value))) else: res_list.append(int(sum(one_value)/len(one_value))) return res_listdef weightGenerate(weight_list): ''' 基于时间步长生成权重值 ''' total=sum(weight_list) return [one/total for one in weight_list]def sliceWindowWeight(one_all_list,num=7,flag=False): ''' 移动加权缺失值填充算法: 基于滑窗在时序数据上的滑动实现对窗口内缺失值的填充(使用窗口内其余值的加权均值进行填充) ''' nozero_list=[one for one in one_all_list if one>0] before_avg,last_avg=sum(nozero_list[:num])/num,sum(nozero_list[-1*num:])/num res_list=[] for i in range(len(one_all_list)): if one_all_list[i]>0: res_list.append(one_all_list[i]) else: tmp=int(num/2)+1 if i<=tmp: if flag: res_list.append(float(before_avg)) else: res_list.append(int(before_avg)) elif i>=len(one_all_list)-tmp: if flag: res_list.append(float(last_avg)) else: res_list.append(int(last_avg)) else:  one_index_list=range(i-tmp,i)+range(i+1,i+tmp) #去除i位置 one_value=[one_all_list[h] for h in one_index_list] weight_list=[abs(1/(B-i)) for B in range(i-tmp,i)]+[abs(1/(L-i)) for L in range(i+1,i+tmp)] one_w=weightGenerate(weight_list) one_weight_value=[one_value[j]*one_w[j] for j in range(len(one_w))] res_list.append(int(sum(one_weight_value))) return res_listdef kaerman(one_all_list): ''' 卡尔曼滤波器缺失值填充算法: 基于观测值和估计值来还原系统真实的状态值 ''' observations=measurements=one_all_list kf=KalmanFilter(initial_state_mean=observations[0]) kf=kf.em(observations,n_iter=5,em_vars='all') measurements_predicted=(kf.smooth(measurements)[0])[:, 0] res_list=measurements_predicted.tolist() return [int(one) for one in res_list]if __name__ == '__main__': print 'missingDataHandle!!!' data=[19, 18, 17, 17, 17, 17, 16, 16, 17, 17, 0, 18, 20, 22, 23, 24, 24, 24, 24, 26, 0, 28, 26, 25, 24, 25, 26, 26, 30, 33, 33, 32, 30, 32, 35, 33, 35, 39, 42, 43, 40, 35, 32, 35, 38, 36, 38, 38, 36, 33, 35, 33, 35, 36, 35, 33, 33, 33, 33, 30, 30, 32, 36, 38, 39, 40, 43, 35, 30, 32, 35, 38, 36, 35, 35, 39, 42, 45, 45, 45, 46, 48, 52, 60, 65, 68, 73, 74, 70, 68, 66, 65, 69, 73, 74, 75, 85, 95, 113, 122] D1=sliceWindowAverage(data,num=7,flag=False) D2=sliceWindowWeight(data,num=7,flag=False) D3=kaerman(data) print 'D1: ',D1 print 'D2: ',D2 print 'D3: ',D3

简单的测试结果输出如下:

missingDataHandle!!!D1: [19, 18, 17, 17, 17, 17, 16, 16, 17, 17, 18, 18, 20, 22, 23, 24, 24, 24, 24, 26, 25, 28, 26, 25, 24, 25, 26, 26, 30, 33, 33, 32, 30, 32, 35, 33, 35, 39, 42, 43, 40, 35, 32, 35, 38, 36, 38, 38, 36, 33, 35, 33, 35, 36, 35, 33, 33, 33, 33, 30, 30, 32, 36, 38, 39, 40, 43, 35, 30, 32, 35, 38, 36, 35, 35, 39, 42, 45, 45, 45, 46, 48, 52, 60, 65, 68, 73, 74, 70, 68, 66, 65, 69, 73, 74, 75, 85, 95, 113, 122]D2: [19, 18, 17, 17, 17, 17, 16, 16, 17, 17, 17, 18, 20, 22, 23, 24, 24, 24, 24, 26, 25, 28, 26, 25, 24, 25, 26, 26, 30, 33, 33, 32, 30, 32, 35, 33, 35, 39, 42, 43, 40, 35, 32, 35, 38, 36, 38, 38, 36, 33, 35, 33, 35, 36, 35, 33, 33, 33, 33, 30, 30, 32, 36, 38, 39, 40, 43, 35, 30, 32, 35, 38, 36, 35, 35, 39, 42, 45, 45, 45, 46, 48, 52, 60, 65, 68, 73, 74, 70, 68, 66, 65, 69, 73, 74, 75, 85, 95, 113, 122]D3: [18, 18, 17, 17, 17, 16, 16, 16, 16, 14, 9, 15, 19, 21, 22, 23, 23, 23, 23, 21, 13, 22, 24, 24, 24, 25, 26, 27, 29, 31, 32, 31, 31, 32, 33, 34, 35, 38, 40, 41, 39, 35, 34, 35, 36, 36, 37, 37, 35, 34, 34, 33, 34, 35, 34, 33, 33, 32, 32, 31, 31, 32, 35, 37, 38, 39, 39, 35, 32, 33, 34, 36, 36, 35, 36, 39, 41, 43, 44, 45, 46, 49, 53, 59, 63, 67, 70, 71, 69, 68, 66, 66, 69, 72, 74, 78, 86, 96, 109, 118][Finished in 0.9s]

学习记录了。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值