您可以使用内四分位范围(IQR)进行简单的异常值检测。来自维基百科The interquartile range (IQR), also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1.
In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data.
It is a measure of the dispersion similar to standard deviation or variance, but is much more robust against outliers.Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
if(data[i] < (Q1 - 1.5 * IQR)) |(data[i] > (Q3 + 1.5 * IQR))
#outlier detected
#do stuff ...
如果数据点位于异常值边界之外,则该数据点可能是异常值。因此,在您的情况下,基于逻辑,您计算每列或所有列的异常值,这取决于您拥有的数据以及它们之间的关系。希望有帮助。在
顺便说一句,您可以使用matplotlib boxplot将上述方法可视化。只要把你正在做异常值检测的一系列数据传递给你,它会直接为你做并绘制出来。
还有其他方法,比如scikit learnoutlier detection
这个blog也很有用。在