【缺失值填补】面板数据的缺失值补全方法

对于两端缺失数据使用前后填充,对于中间数据使用线性插值(可替换)

import numpy as np 
import pandas as pd 
from scipy.interpolate import interp1d
df = pd.read_excel('data.xlsx')
df['城市'].fillna(method = 'ffill',inplace = True)
df.drop('city',1,inplace=True)
df[df['城市'] == '晋中市']
城市yearpm2.5(达尔豪斯)lnpm2.5(达尔豪斯)第二产业增加值占GDP比重(%)人均地区生产总值(元)地区生产总值(万元)年末人口总数(万人)行政区域土地面积(全市)能源效率地方财政一般预算内支出科学支出年末实有城市道路面积(万平方米)行政区域土地面积(市辖区)
210晋中市200430.1173603.40510250.408300.02539148.0306.1216404.08.3377563050003503701327.0
211晋中市200538.5810123.65276040.4610772.03310171.0306.6316404.08.8766203944965323881311.0
212晋中市200645.3245753.81384942.3112491.03856794.0310.0016404.09.2150316640544974211311.0
213晋中市200736.5783863.59945841.9415157.04701976.0313.8816404.07.98984472106458665211311.0
214晋中市200834.3744823.53731438.7118219.05678066.0317.9016404.06.54799985679575695851311.0
215晋中市200938.0597343.63915733.4420335.06368106.0320.0016404.07.236680103087384955921311.0
216晋中市201049.9638073.91129936.4624275.07638366.0320.9616392.07.2218491205394101196161318.0
217晋中市201177.1098224.34523140.1927300.08902376.0323.3016392.08.4641271541989119396161318.0
218晋中市201257.9390104.05939140.6430093.09865596.0325.0016392.07.1952881787371140587691318.0
219晋中市201354.5319283.99878639.6930977.010222281.0330.5016392.06.76153221363331728710101318.0
220晋中市201445.4842013.81736539.6931434.010413000.0330.5016392.06.7615322168543191768501318.0
221晋中市201542.3429213.74580131.3931434.010461155.0329.6616392.0NaN2411647190369651318.0
222晋中市201635.9989733.58349030.0932646.010911041.0332.0016444.0NaN24716991513410951311.0
223晋中市201735.1702593.56020131.9538274.012849233.0332.0016392.07.34006227892312456901318.0
224晋中市201834.4520573.53956932.1442910.014476039.0334.0016444.07.55080032263682287114511311.0
df.isna().sum()
城市                    0
year                  0
pm2.5(达尔豪斯)           0
lnpm2.5(达尔豪斯)        15
第二产业增加值占GDP比重(%)     12
人均地区生产总值(元)          59
地区生产总值(万元)           34
年末人口总数(万人)            0
行政区域土地面积(全市)          0
能源效率                106
地方财政一般预算内支出           0
科学支出                  0
年末实有城市道路面积(万平方米)      0
行政区域土地面积(市辖区)         0
dtype: int64

def chazhi_(data,col,method = 'linear',thresdhold = 0.9):
    '''
    创建缺失值补全函数,针对每一个分组
    data:传入的每一个分组
    method:使用插值的方法,
    thresdhold:插值的门槛,即缺失值占总数据的比重,如果超过门槛者不进行补充
    '''
    chazhi_index = data[col][data[col].isna()].index
    if len(chazhi_index)/data.shape[0] >= 0.9:
        return data
    else:
        y = data[col][~(data[col].isna())].values
        x = data[col][~(data[col].isna())].index
        f = interp1d(x, y, kind = method)
        chazhi_index_2 = [c for c in chazhi_index if c<x.max() and c>x.min()]
        data.loc[chazhi_index_2,col] = f(chazhi_index_2)
        fill = data[col].fillna(method = 'ffill')
        bill = data[col].fillna(method = 'bfill')
        w = pd.DataFrame((fill,bill)).mean()
        data.iloc[:,0] = w
        return data
def chazhi(data,group_id):
    '''
    针对整个dataframe,
    data:dataframe文件
    group_id:样本的id
    '''
    na_col = data.columns[data.isna().sum() != 0]
    for col in na_col[1:]:
        test = data[[col,group_id]]
        data[col] = test.groupby(group_id).apply(chazhi_,col= col,method = 'linear')[col]
    return data[data.isna().sum(1)==0]
w = chazhi(df,'城市')
w.isna().sum()
城市                  0
year                0
pm2.5(达尔豪斯)         0
lnpm2.5(达尔豪斯)       0
第二产业增加值占GDP比重(%)    0
人均地区生产总值(元)         0
地区生产总值(万元)          0
年末人口总数(万人)          0
行政区域土地面积(全市)        0
能源效率                0
地方财政一般预算内支出         0
科学支出                0
年末实有城市道路面积(万平方米)    0
行政区域土地面积(市辖区)       0
dtype: int64
w[w['城市']== '晋中市']
城市yearpm2.5(达尔豪斯)lnpm2.5(达尔豪斯)第二产业增加值占GDP比重(%)人均地区生产总值(元)地区生产总值(万元)年末人口总数(万人)行政区域土地面积(全市)能源效率地方财政一般预算内支出科学支出年末实有城市道路面积(万平方米)行政区域土地面积(市辖区)
210晋中市200430.1173603.40510250.408300.02539148.0306.1216404.08.3377563050003503701327.0
211晋中市200538.5810123.65276040.4610772.03310171.0306.6316404.08.8766203944965323881311.0
212晋中市200645.3245753.81384942.3112491.03856794.0310.0016404.09.2150316640544974211311.0
213晋中市200736.5783863.59945841.9415157.04701976.0313.8816404.07.98984472106458665211311.0
214晋中市200834.3744823.53731438.7118219.05678066.0317.9016404.06.54799985679575695851311.0
215晋中市200938.0597343.63915733.4420335.06368106.0320.0016404.07.236680103087384955921311.0
216晋中市201049.9638073.91129936.4624275.07638366.0320.9616392.07.2218491205394101196161318.0
217晋中市201177.1098224.34523140.1927300.08902376.0323.3016392.08.4641271541989119396161318.0
218晋中市201257.9390104.05939140.6430093.09865596.0325.0016392.07.1952881787371140587691318.0
219晋中市201354.5319283.99878639.6930977.010222281.0330.5016392.06.76153221363331728710101318.0
220晋中市201445.4842013.81736539.6931434.010413000.0330.5016392.06.7615322168543191768501318.0
221晋中市201542.3429213.74580131.3931434.010461155.0329.6616392.06.9543752411647190369651318.0
222晋中市201635.9989733.58349030.0932646.010911041.0332.0016444.07.14721824716991513410951311.0
223晋中市201735.1702593.56020131.9538274.012849233.0332.0016392.07.34006227892312456901318.0
224晋中市201834.4520573.53956932.1442910.014476039.0334.0016444.07.55080032263682287114511311.0
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值