Python异常值检测——案例分析

目录

1.单个变量异常值检测

2. 双变量关系中的异常值检测

3. 使用线性回归来确定具有重大影响的数据点

4. 使用k最近邻算法找到离群值

5. 使用隔离森林算法查找异常


1.单个变量异常值检测

        如果某个值离平均值有多个标准偏差,并且远离近似标准正态分布的值,则我们更倾向于将将其视为异常值。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats as scistat
data=pd.read_csv("F:\\python数据清洗\\04第四章\\covidtotals.csv")
data.columns
Index(['iso_code', 'lastdate', 'location', 'total_cases', 'total_deaths',
       'total_cases_pm', 'total_deaths_pm', 'population', 'pop_density',
       'median_age', 'gdp_per_capita', 'hosp_beds'],
      dtype='object')
data.set_index('iso_code',inplace=True)
totvars=['location', 'total_cases', 'total_deaths', 'total_cases_pm', 'total_deaths_pm']
demovars=['population', 'pop_density', 'median_age', 'gdp_per_capita', 'hosp_beds']

1.1 描述性统计 

data_b=data.loc[:,totvars]
data_b.describe()
total_casestotal_deathstotal_cases_pmtotal_deaths_pm
count2.100000e+02210.000000210.000000210.000000
mean2.921614e+041770.7142861355.35794355.659129
std1.363978e+058705.5658572625.277497144.785816
min0.000000e+000.0000000.0000000.000000
25%1.757500e+024.00000092.5415000.884750
50%1.242500e+0325.500000280.9285006.154000
75%1.011700e+04241.2500001801.39475031.777250
max1.790191e+06104383.00000019771.3480001237.551000

1.2 显示百分位数数据

data_b.quantile(np.arange(0.0,1.1,0.1))
total_casestotal_deathstotal_cases_pmtotal_deaths_pm
0.00.00.00.00000.0000
0.122.90.017.99860.0000
0.2105.22.056.29100.3752
0.3302.06.7115.43411.7183
0.4762.012.0213.97343.9566
0.51242.525.5280.92856.1540
0.62514.654.6543.956212.2452
0.76959.8137.21071.244225.9459
0.816847.2323.22206.298249.9658
0.946513.11616.93765.1363138.9045
1.01790191.0104383.019771.34801237.5510

1.3 偏度(skewness)&峰度(kurtosis) 

        偏度和峰度分别描述了分布的对称性和分布的尾部有多肥。如果变量以正态分布,则这两个指标都大大高于预期。

# 偏度skewness
data_b.skew()
----------------------------
total_cases        10.804275
total_deaths        8.929816
total_cases_pm      4.396091
total_deaths_pm     4.674417
dtype: float64


# 峰度kurtosis
data_b.kurtosis()
----------------------------
total_cases        134.979577
total_deaths        95.737841
total_cases_pm      25.242790
total_deaths_pm     27.238232
dtype: float64

1.4 测试数据的正态分布性

        使用scipy库中的Shapiro-Wilk测试,输出测试的p值。正态分布的null假设可以在低于0.05的任何p值和95%的置信水平上被拒绝。

def testnorm(var,df):
    stat,p=scistat.shapiro(df[var])
    return p

testnorm("total_cases",data_b) #3.753789128593843e-29
testnorm("total_deaths",data_b) #4.3427896631016077e-29
testnorm("total_cases_pm",data_b) #1.3972683006509067e-23
testnorm("total_deaths_pm",data_b) #1.361060423265974e-25

结果表明变量的分布在高显著水平下不是正态的。

# 总病例数的分位数图(qq图),直线显示的是正态分布应有的外观
sm.qqplot(data_b[["total_cases"]].sort_values(['total_cases']),line='s')
plt.title('QQ Plot of Total Cases')
plt.show()

# 每百万人口的总病例数的分位数图(qq图)
sm.qqplot(data_b[["total_cases_pm"]].sort_values(['total_cases_pm']),line='s')
plt.title('QQ Plot of Total Cases Per Million')
plt.show()

 

qqplot的结果也表明,total_cases(总病例数)和total_cases_pm(每百万人口的总病例数)的分布都与正态分布有显著差异(红色直线显示的是正态分布应有的外观)

1.5 显示离群值范围

        定义变量离群值的一种方法是按四分位数的距离计算。第三个四分位数(0.75)被称为上四分位数,第一个四分位数(0.25)被称为下四分位数,上四分位数和下四分位数之间的距离称为四分位距,也就是0.25~0.75分位数的范围。如果某个值离均值大于1.5倍,则认为该值是离群值。

def getoutliers():
    dfout=pd.DataFrame(columns=data.columns,data=None)
    for col in data_b.columns[1:]:
        thirdq,firstq=data_b[col].quantile(0.75),data_b[col].quantile(0.25) #计算四分位数
        range_q=1.5*(thirdq-firstq) #四分位距
        high,low=range_q+thirdq,firstq-range_q #离群阈值
        df=data.loc[(data[col]>high) | (data[col]<low)] #筛选出离群点
        df=df.assign(varname=col,threshlow=low,threshhigh=high) #添加新列来显示所检查的变量(varname)和离群范围
        dfout=pd.concat([dfout,df]) #合并表格
    return dfout

outliers=getoutliers()
outliers
lastdatelocationtotal_casestotal_deathstotal_cases_pmtotal_deaths_pmpopulationpop_densitymedian_agegdp_per_capitahosp_bedsvarnamethreshlowthreshhigh
BGD2020-06-01Bangladesh47153650286.3153.947164689383.01265.03627.53523.9840.80total_cases-14736.12525028.875
BLR2020-06-01Belarus425562354503.60424.8709449321.046.85840.317167.96711.00total_cases-14736.12525028.875
BEL2020-06-01Belgium5838194675037.354816.85211589616.0375.56441.842658.5765.64total_cases-14736.12525028.875
BRA2020-06-01Brazil514849293142422.142137.910212559409.025.04033.514103.4522.20total_cases-14736.12525028.875
CAN2020-06-01Canada9093672952409.401193.28537742157.04.03741.444017.5912.50total_cases-14736.12525028.875
.............................................
ESP2020-05-31Spain239429271275120.952580.19746754783.093.10545.534272.3602.97total_deaths_pm-45.45478.116
SWE2020-06-01Sweden3754243953717.298435.18010099270.024.71841.046949.2832.22total_deaths_pm-45.45478.116
CHE2020-06-01Switzerland3077916563556.367191.3438654618.0214.24343.157410.1664.53total_deaths_pm-45.45478.116
GBR2020-06-01United Kingdom274762384894047.403566.96567886004.0272.89840.839753.2442.54total_deaths_pm-45.45478.116
USA2020-06-01United States17901911043835408.389315.354331002647.035.60838.354225.4462.77total_deaths_pm-45.45478.116
outliers.varname.value_counts()
total_deaths       36
total_cases        33
total_deaths_pm    28
total_cases_pm     17
Name: varname, dtype: int64

2. 双变量关系中的异常值检测

        当某个值的分布与平均值没有明显偏离时,即使它不是极值,它也可能是异常值。当第二个变量具有某些值时,第一个变量的某些值就可能是异常值。(以下使用的还是之前的数据)

2.1 生成相关性矩阵

data.corr(method="pearson")
total_casestotal_deathstotal_cases_pmtotal_deaths_pmpopulationpop_densitymedian_agegdp_per_capitahosp_beds
total_cases1.0000000.9320790.1822460.2474640.270030-0.0287370.1626980.1868350.027601
total_deaths0.9320791.0000000.1798120.3948110.212619-0.0316450.2051280.1987290.019990
total_cases_pm0.1822460.1798121.0000000.586468-0.0560090.1100430.3138360.6512000.081449
total_deaths_pm0.2474640.3948110.5864681.000000-0.0139020.0302810.3895950.3836720.120488
population0.2700300.212619-0.056009-0.0139021.000000-0.0230840.024395-0.059555-0.038329
pop_density-0.028737-0.0316450.1100430.030281-0.0230841.0000000.1788780.3151990.314973
median_age0.1626980.2051280.3138360.3895950.0243950.1788781.0000000.6489050.662222
gdp_per_capita0.1868350.1987290.6512000.383672-0.0595550.3151990.6489051.0000000.296995
hosp_beds0.0276010.0199900.0814490.120488-0.0383290.3149730.6622220.2969951.000000

从上表可看出,总病例数与总死亡数之间的相关性非常高(0.93),而每百万人口总病例数与每百万人口总死亡数之间的相关性则较小(0.59),人均GDP与每百万人口总病例数之间也存在较强的相关关系(0.65)。

2.2 创建不同变量分位数的交叉表

        首先创建一个DataFrame,使用qcut创建一个将数据分解为分位数的列,显示总病例数分位数与总死亡数数分位数的交叉表。

data_b['total_cases_q']=pd.qcut(data_b['total_cases'],labels=['very low','low','medium','high','very high'],q=5,precision=0)
data_b['total_deaths_q']=pd.qcut(data_b['total_deaths'],labels=['very low','low','medium','high','very high'],q=5,precision=0)
pd.crosstab(data_b.total_cases_q,data_b.total_deaths_q)
total_deaths_qvery lowlowmediumhighvery high
total_cases_q
very low347100
low12191010
medium11315130
high0012246
very high002436
data.loc[(data_b.total_cases_q=="very high")&(data_b.total_deaths_q=="medium")].T
iso_codeQATSGP
lastdate2020-06-012020-06-01
locationQatarSingapore
total_cases5691034884
total_deaths3823
total_cases_pm19753.1465962.727
total_deaths_pm13.193.931
population2881060.05850343.0
pop_density227.3227915.731
median_age31.942.4
gdp_per_capita116935.685535.383
hosp_beds1.22.4

从表中可以看出,大多数国家都在对角线上或其附近,但有2个国家(卡塔尔,新加坡)的病例数很高,但死亡人数时中等的。

2.3 绘制两个变量间的散点图

        使用seaborn的regplot方法时,除生成散点图外,还可以生成线性回归线。

import seaborn as sns
ax=sns.regplot(x="total_cases",y="total_deaths",data=data)
ax.set(xlabel='Cases',ylabel='Deaths',title='Total Covid Cases and Deaths by Country')
plt.show()

# 检查回归线上方的意外值
data.loc[(data.total_cases<300000) & (data.total_deaths>20000)].T
iso_codeFRAITAESPGBR
lastdate2020-06-012020-06-012020-05-312020-06-01
locationFranceItalySpainUnited Kingdom
total_cases151753233019239429274762
total_deaths28802334152712738489
total_cases_pm2324.8793853.9855120.9524047.403
total_deaths_pm441.251552.663580.197566.965
population65273512.060461828.046754783.067886004.0
pop_density122.578205.85993.105272.898
median_age42.047.945.540.8
gdp_per_capita38605.67135220.08434272.3639753.244
hosp_beds5.983.182.972.54
# 检查回归线下方的意外值
data.loc[(data.total_cases>400000) & (data.total_deaths<10000)].T
iso_codeRUS
lastdate2020-06-01
locationRussia
total_cases405843
total_deaths4693
total_cases_pm2780.995
total_deaths_pm32.158
population145934460.0
pop_density8.823
median_age39.6
gdp_per_capita24765.954
hosp_beds8.05

从上表结果来看,有4个国家(法国,意大利,西班牙和英国)的病例数少于30w,死亡人数却超过2w;有1个国家(俄罗斯)的病例数超过30w,但死亡人数少于1w。

3. 使用线性回归来确定具有重大影响的数据点

        使用线性回归的优势在于他们对于变量的分布依赖性较小,并且能比单变量或双变量分析揭示更多的东西,识别出在其它方面不显著的异常值。

        使用statsmodels OLS方法拟合每百万人口总病例数的线性回归模型,然后确定那些对该模型具有较大影响的国家/地区。

3.1 描述性统计

xvars=['pop_density','median_age','gdp_per_capita']
data_x=data.loc[:,['total_cases_pm']+xvars].dropna()
data_x.describe()
total_cases_pmpop_densitymedian_agegdp_per_capita
count175.000000175.000000175.000000175.000000
mean1134.015709247.15186330.53714319008.385423
std2101.363772822.3989679.11775119673.386571
min0.0000001.98000015.100000661.240000
25%67.44800036.06600022.3000004458.202500
50%263.41300082.32800029.70000012951.839000
75%1357.506000207.96000038.70000027467.146000
max19753.1460007915.73100048.200000116935.600000

3.2 拟合线性回归模型 

def getlm(df):
    Y=df.total_cases_pm
    X=df[['pop_density','median_age','gdp_per_capita']]
    X=sm.add_constant(X)
    return sm.OLS(Y,X).fit()

lm=getlm(data_x)
lm.summary()
coefstd errtP>|t|[0.0250.975]
const944.4731426.7122.2130.028102.1721786.774
pop_density-0.20570.142-1.4470.150-0.4860.075
median_age-49.439816.013-3.0880.002-81.048-17.832
gdp_per_capita0.09210.00812.0150.0000.0770.107

3.3 确定对模型影响较大的国家/地区,并绘制影响图

        在线性回归中,库克距离描述了单个样本对整个回归模型的影响程度,库克距离越大,说明影响越大。当库克距离大于0.5应仔细检查。

influence=lm.get_influence().summary_frame()
influence.loc[influence.cooks_d>0.5,['cooks_d']]
iso_codecooks_d
HKG0.780662
QAT5.080180

data_x.loc[influence.cooks_d>0.5]
total_cases_pmpop_densitymedian_agegdp_per_capita
iso_code
HKG0.0007039.71444.856054.92
QAT19753.146227.32231.9116935.60
fig,ax=plt.subplots(figsize=(12,8))
sm.graphics.influence_plot(lm,ax=ax,criterion="cooks")
plt.show()

3.4 删除离群点,再进行回归拟合

data_x_new=data_x.loc[influence.cooks_d<0.5]
lm=getlm(data_x_new)
lm.summary()
coefstd errtP>|t|[0.0250.975]
const44.0854349.9240.1260.900-646.700734.870
pop_density0.24230.1451.6660.098-0.0450.529
median_age-2.516513.526-0.1860.853-29.21724.184
gdp_per_capita0.05570.0077.8750.0000.0420.070

删除之后,常数和median_age的估计值不再有效。回归输出的P>|t|值表明该系数是否与0有着显著不同。在第一次回归中,median_age和gdp_per_capita的系数在99%的置信水平上是显著的;即P>|t|小于0.01。而在删除两个离群值的情况下,只有gdp_per_capita是显著的。因此在处理此问题时,应分析诸如此类的离群值是否会增加重要信息或使模型失真。

4. 使用k最近邻算法找到离群值

        在使用线性回归中,我们使用了每百万人口病例数作为因变量,而在没有标签数据的情况下,即没有目标变量或因变量时,无监督的机器学习工具也可以识别与其它观察结果不同的观察结果。

from pyod.models.knn import KNN
from sklearn.preprocessing import StandardScaler
# 创建分析列的标准化DataFrame
standardizer=StandardScaler()
var1=['location','total_cases_pm', 'total_deaths_pm', 'pop_density','median_age', 'gdp_per_capita']
data_var1=data.loc[:,var1].dropna()
data_var1_stand=standardizer.fit_transform(data_var1.iloc[:,1:])
# KNN模型生成异常分数
clf_name='KNN'
clf=KNN(contamination=0.1) #contamination:异常样本的百分比,在[0,0.5]之间的float型,默认为0.5
clf.fit(data_var1_stand)
y_pred=clf.labels_ # 返回训练数据上的分类标签 (0: 正常值, 1: 异常值)
y_scores=clf.decision_scores_  # 返回训练数据上的异常值 (分值越大越异常)
pred=pd.DataFrame(zip(y_pred,y_scores),columns=['outlier','scores'],index=data_var1.index)
pred.outlier.value_counts()
0    157
1     18
Name: outlier, dtype: int64

显示离群的数据

data_var1.join(pred).loc[pred.outlier==1,['location','total_cases_pm','total_deaths_pm','scores']].sort_values(['scores'],ascending=False)
locationtotal_cases_pmtotal_deaths_pmscores
iso_code
SGPSingapore5962.7273.9319.228543
HKGHong Kong0.0000.0007.770444
QATQatar19753.14613.1902.643709
BHRBahrain6698.46811.1662.049642
LUXLuxembourg6418.776175.7261.525783
MDVMaldives3280.0419.2501.395496
MLTMalta1395.12015.8541.353541
BGDBangladesh286.3153.9471.117904
BRNBrunei322.2984.5721.108011
SAUSaudi Arabia2449.05314.4480.902926
AREUnited Arab Emirates3493.99426.6930.822334
KWTKuwait6332.42049.6420.803449
BRBBarbados320.14424.3590.740434
IRLIreland5060.962334.5620.721934
PSEPalestine122.9070.9800.717875
NORNorway1551.48943.5320.712298
GNQEquatorial Guinea930.8728.5530.695340
OMNOman2239.6419.2040.670568

从上表可以看出,新加坡、卡塔尔的决策分数显著高于其它国家/地区,与前面的分析结果一致。

5. 使用隔离森林算法查找异常

        隔离森林(isolation forest,也称孤立森林)找到离群点的方法是,对数据进行连续分区,直至某个数据点被隔离。如果某个数据点仅需要很少的分区即可隔离,那么它将获得较高的异常分数。

from sklearn.ensemble import IsolationForest
from mpl_toolkits.mplot3d import Axes3D
var2=['location','total_cases_pm', 'total_deaths_pm', 'pop_density','median_age', 'gdp_per_capita']
standardizer=StandardScaler()
data_var2=data.loc[:,var2].dropna()
data_var2_stand=standardizer.fit_transform(data_var2.iloc[:,1:])
clf=IsolationForest(n_estimators=100,max_samples='auto',contamination=0.1,max_features=1.0)
clf.fit(data_var2_stand)
data_var2['anomaly']=clf.predict(data_var2_stand) #查看隔离森林得出的结果,输出为一维矩阵,其中值为1代表为正常数据点,值为-1代表为异常点
data_var2['scores']=clf.decision_function(data_var2_stand)  #数据集中样本的异常分数值,注意其中值是以0为阈值,小于0视为异常,大于0视为正常,
data_var2.anomaly.value_counts()
 1    157
-1     18
Name: anomaly, dtype: int64
# 创建离群值和内围值的DataFrame
inlier,outlier=data_var2.loc[data_var2.anomaly==1],data_var2.loc[data_var2.anomaly==-1]
outlier[['location','total_cases_pm','total_deaths_pm','anomaly','scores']].sort_values(['scores']).head(10)
locationtotal_cases_pmtotal_deaths_pmanomalyscores
iso_code
QATQatar19753.14613.190-1-0.257773
SGPSingapore5962.7273.931-1-0.199578
HKGHong Kong0.0000.000-1-0.166863
BELBelgium5037.354816.852-1-0.116571
BHRBahrain6698.46811.166-1-0.110567
LUXLuxembourg6418.776175.726-1-0.097430
ITAItaly3853.985552.663-1-0.066504
ESPSpain5120.952580.197-1-0.060739
IRLIreland5060.962334.562-1-0.040109
MDVMaldives3280.0419.250-1-0.031414
ax=plt.axes(projection='3d')
ax.set_title('Isolation Forest Anomaly Detection')
ax.set_zlabel("Cases Per Million")
ax.set_xlabel("GDP Per Capita")
ax.set_ylabel("Median Age")
ax.scatter3D(inlier.gdp_per_capita,inlier.median_age,inlier.total_cases_pm,label="inlier",c='blue')
ax.scatter3D(outlier.gdp_per_capita,outlier.median_age,outlier.total_cases_pm,label="outlier",c='red')
ax.legend()
plt.tight_layout()
plt.show()

 隔离森林的结果与k最近邻的结果非常相似,卡塔尔、新加坡的异常得分较高(准确地说是最小负分),因此,当进行多变量分析时,可以考虑删除离群值再进行分析。

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值