数据竞赛(二)数据清洗

一、缺失值分析

#找出有缺失值的列: pv uv 在训练集和测试集中均有缺失
train = pd.read_csv('train_data.csv')
test = pd.read_csv('test_a.csv')
train.info()
test.info()

输出:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41440 entries, 0 to 41439
Data columns (total 51 columns):
ID                    41440 non-null int64
area                  41440 non-null float64
rentType              41440 non-null object
houseType             41440 non-null object
houseFloor            41440 non-null object
totalFloor            41440 non-null int64
houseToward           41440 non-null object
houseDecoration       41440 non-null object
communityName         41440 non-null object
city                  41440 non-null object
region                41440 non-null object
plate                 41440 non-null object
buildYear             41440 non-null object
saleSecHouseNum       41440 non-null int64
subwayStationNum      41440 non-null int64
busStationNum         41440 non-null int64
interSchoolNum        41440 non-null int64
schoolNum             41440 non-null int64
privateSchoolNum      41440 non-null int64
hospitalNum           41440 non-null int64
drugStoreNum          41440 non-null int64
gymNum                41440 non-null int64
bankNum               41440 non-null int64
shopNum               41440 non-null int64
parkNum               41440 non-null int64
mallNum               41440 non-null int64
superMarketNum        41440 non-null int64
totalTradeMoney       41440 non-null int64
totalTradeArea        41440 non-null float64
tradeMeanPrice        41440 non-null float64
tradeSecNum           41440 non-null int64
totalNewTradeMoney    41440 non-null int64
totalNewTradeArea     41440 non-null int64
tradeNewMeanPrice     41440 non-null float64
tradeNewNum           41440 non-null int64
remainNewNum          41440 non-null int64
supplyNewNum          41440 non-null int64
supplyLandNum         41440 non-null int64
supplyLandArea        41440 non-null float64
tradeLandNum          41440 non-null int64
tradeLandArea         41440 non-null float64
landTotalPrice        41440 non-null int64
landMeanPrice         41440 non-null float64
totalWorkers          41440 non-null int64
newWorkers            41440 non-null int64
residentPopulation    41440 non-null int64
pv                    41422 non-null float64
uv                    41422 non-null float64
lookNum               41440 non-null int64
tradeTime             41440 non-null object
tradeMoney            41440 non-null float64
dtypes: float64(10), int64(30), object(11)
memory usage: 16.1+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Data columns (total 50 columns):
ID                    2469 non-null int64
area                  2469 non-null float64
rentType              2469 non-null object
houseType             2469 non-null object
houseFloor            2469 non-null object
totalFloor            2469 non-null int64
houseToward           2469 non-null object
houseDecoration       2469 non-null object
communityName         2469 non-null object
city                  2469 non-null object
region                2469 non-null object
plate                 2469 non-null object
buildYear             2469 non-null object
saleSecHouseNum       2469 non-null int64
subwayStationNum      2469 non-null int64
busStationNum         2469 non-null int64
interSchoolNum        2469 non-null int64
schoolNum             2469 non-null int64
privateSchoolNum      2469 non-null int64
hospitalNum           2469 non-null int64
drugStoreNum          2469 non-null int64
gymNum                2469 non-null int64
bankNum               2469 non-null int64
shopNum               2469 non-null int64
parkNum               2469 non-null int64
mallNum               2469 non-null int64
superMarketNum        2469 non-null int64
totalTradeMoney       2469 non-null int64
totalTradeArea        2469 non-null float64
tradeMeanPrice        2469 non-null float64
tradeSecNum           2469 non-null int64
totalNewTradeMoney    2469 non-null int64
totalNewTradeArea     2469 non-null int64
tradeNewMeanPrice     2469 non-null float64
tradeNewNum           2469 non-null int64
remainNewNum          2469 non-null int64
supplyNewNum          2469 non-null int64
supplyLandNum         2469 non-null int64
supplyLandArea        2469 non-null float64
tradeLandNum          2469 non-null int64
tradeLandArea         2469 non-null float64
landTotalPrice        2469 non-null int64
landMeanPrice         2469 non-null float64
totalWorkers          2469 non-null int64
newWorkers            2469 non-null int64
residentPopulation    2469 non-null int64
pv                    2467 non-null float64
uv                    2467 non-null float64
lookNum               2469 non-null int64
tradeTime             2469 non-null object
dtypes: float64(9), int64(30), object(11)
memory usage: 964.5+ KB

 对缺失值采用中值填充:

    # 对缺失值进行填充
    pv_median = df_train["pv"].mean()
    df_train["pv"].fillna(pv_median, inplace=True)
    df_test["pv"].fillna(pv_median, inplace=True)
    
    uv_median = df_train["uv"].mean()
    df_train["uv"].fillna(uv_median, inplace=True)
    df_test["uv"].fillna(uv_median, inplace=True)
CV Score: 0.85709

二、 异常值处理(根据测试集数据的分布处理训练集的数据分布)
参考 https://blog.csdn.net/MiMicoa/article/details/86749999 

一种经典的计算数据集中疑似异常值的方法是Tukey method。该方法先计算出数据集的四分之一分位数(Q1)和四分之三分位数(Q3),从而计算出四分位数间距(IQR),然后将小于Q1 - 1.5IQR或者大于Q3 + 1.5IQR的数据点当做是疑似异常值。我们可以借助这种方法在DataFrame中检测异常值。代码如下:
 

import numpy as np
from collections import Counter


# Outlier detection
def detect_outliers(df, n, features):
    """
    Takes a dataframe df of features and returns a list of the indices
    corresponding to the observations containing more than n outliers according
    to the Tukey method.
    """
    outlier_indices = []

    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col], 75)
        # quartile spacing (IQR)
        IQR = Q3 - Q1
        # outlier step
        outlier_step = 1.5 * IQR

        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step)].index

        # append the found outlier indices for col to the list of outlier indices
        outlier_indices.extend(outlier_list_col)

    # select observations containing more than n outliers
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(k for k, v in outlier_indices.items() if v > n)

    return multiple_outliers

三、深度清洗

分析每一个communityName、city、region、plate的数据分布并对其进行数据清洗。

不太会操作。。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值