没找到非常好的中文的文章,看到一篇很好的使用python进行马氏距离离群值检测的文章,在此转载,里面的代码可以直接跑,效率也是挺高的。
核心代码为:
# Covariance matrix
covariance = np.cov(indepvar[:,:-1] , rowvar=False)
# Covariance matrix power of -1
covariance_pm1 = np.linalg.matrix_power(covariance, -1) #对协方差矩阵取逆
# Center point
centerpoint = np.mean(indepvar[:,:-1] , axis=0)
# Distances between center point and other points
distances = []
for i, val in enumerate(indepvar[:,:-1]):
if i % 100000 == 0:
print("已经处理到第{0}行.".format(i))
p1 = val
p2 = centerpoint
distance = (p1-p2).T.dot(covariance_pm1).dot(p1-p2)
distances.append(distance)
distances = np.array(distances)
# Cutoff (threshold) value from Chi-Sqaure Distribution for detecting outliers
cutoff = chi2.ppf(0.95, indepvar[:,:-1].shape[1])
# Index of outliers
outlierIndexes = np.where(distances > cutoff )
print('--- Index of Outliers ----')
print("离群值共有{0}个数据点,占比为{1}%.".format(len(outlierIndexes[0]) , round(len(outlierIndexes[0]) / len(indepvar)*100 , 2) ))
print(outlierIndexes)
# array([24, 35, 67, 81])
print('--- Observations found as outlier -----')
print(indepvar[ distances > cutoff , :])
# 将上述的离群值删除
indepvar = indepvar[ distances <= cutoff , :]
len(indepvar)
原文内容: