隔离森林找到离群值的方法是,对数据进行连续区分,直至某个数据点被隔离。
1)加载数据
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from mpl_toolkits.mplot3d import Axes3D
pd.set_option('display.float_format', lambda x: '%.2f' % x)
covidtototals=pd.read_csv(r"D:\日常文档\笔记\泰坦尼克号数据\covidtotals.csv")
covidtototals.set_index("iso_code", inplace=True)
covidtototals.head()
2)创建一个标准化的分析DataFrame.
首先删除所有包含缺失值的行
analysisvars = ['location','total_cases_pm','total_deaths_pm','pop_density','median_age','gdp_per_capita']
standardizer = StandardScaler()
covidtotals.isnull().sum()
covidanalysis = covidtotals.loc[:, analysisvars].dropna()
covidanalysisstand = standardizer.fit_transform(covidanalysis.iloc[:, 1:])
3)运行隔离森林模型以检测离群值
将标准化后的数据传递给fit方法。可以看到,有18个国家/地区被标识为离群值(这些国家的离群值为-1)。contamination(混合)参数同样被设置为0.1。
IsolationForest的参数
IsolationForest( n_estimators = 构建多少个itree
max_samples=采样数,自动是256
contamination=c(n)默认为0.1
max_features=最大特征数 默认为1
bootstrap=构建Tree时,下次是否替换采样,为True为替换,为False为不替换
n_jobs=fit和perdict执行时的并行数
)
# run an isolation forest model to detect outliers
clf=IsolationForest(n_estimators=100, max_samples='auto', #建议100个树 contamination=.1预计异常值占样本10%
contamination=.1, max_features=1.0)
clf.fit(covidanalysisstand)
covidanalysis['anomaly'] = clf.predict(covidanalysisstand) #异常标志 -1异常
covidanalysis['scores'] = clf.decision_function(covidanalysisstand)#得分 数值越低样本越可能是异常值 get_params(self[, deep]) 得到模型参数
covidanalysis.anomaly.value_counts()
4)创建离群值(outlier)和内围值(inlier)的DataFrame.
根据异常分数列出前10个离群值。
# view the outliers
inlier, outlier = covidanalysis.loc[covidanalysis.anomaly==1],covidanalysis.loc[covidanalysis.anomaly==-1]
outlier[['location','total_cases_pm','total_deaths_pm','median_age','gdp_per_capita','scores']].sort_values(['scores']).head(10)
5)绘制离群值和内围值。
# plot the inliers and outliers
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.set_title('Isolation Forest Anomaly Detection')
ax.set_zlabel("Cases Per Million")
ax.set_xlabel("GDP Per Capita")
ax.set_ylabel("Median Age")
ax.scatter3D(inlier.gdp_per_capita, inlier.median_age, inlier.total_cases_pm, label="inliers", c="blue")
ax.scatter3D(outlier.gdp_per_capita, outlier.median_age, outlier.total_cases_pm, label="outliers", c="red")
ax.legend()
plt.tight_layout()
plt.show()
按人均GDP、年龄中位数和每百万人口病例数划分离群值和内围值的国家