数据预处理异常值处理
表中的内容(Table of Content)
- Definition of Outliers离群值的定义
- Different types of Outliers不同类型的离群值
- Ways to deal with Outliers处理离群值的方法
- Optional Content about SD & Variance关于SD和差异的可选内容
- Standard Deviation Method标准偏差法
- Interquartile Range Method(IQR)四分位距法(IQR)
- Automatic Outliers detection 自动异常值检测
离群值的定义(Definition of Outliers)
An outlier is an unlikely observation in a dataset. It is rare, or distinct, or does not fit in some way.
离群值是数据集中不太可能观察到的。 它很少见或与众不同,或者在某种程度上不适合。
不同类型的离群值: (Different types of Outliers:)
Outliers can have many causes, such as:
离群值可能有多种原因,例如:
- Measurement or Manual error 测量或手动错误
- Data generation flaw数据生成缺陷
- Data corruption资料损坏
- True outlier observation (E.g. Sachin tendulkar/Virat Kohli in Cricket)真正的异常值观察(例如,板球中的Sachin tentenkar / Virat Kohli)
There is no precise way to identify an outlier, domain expert needs to interpret the raw data and decide whether a value is an outlier or not.
没有精确的方法来识别异常值,领域专家需要解释原始数据并决定是否 值是否为异常值。
处理离群值的方法 (Ways to deal with Outliers)
- Standard Deviation Method标准偏差法
- Interquartile Range Method (IQR)四分位间距法(IQR)
- Automatic Outlier Detection 自动异常值检测
关于SD和差异的可选内容(Optional Content about SD& Variance)
Variance: In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean.Informally, it measures how far a set of numbers is spread out from their average value.
方差:在概率论和统计学中,方差是对随机变量与其均值平方差的期望,非正式地,它衡量一组数字与平均值之间的距离。

S² = sample varianceX = the value of the one observationμ = the mean value of all observationsN = the number of observations
S²=样本方差X =一次观测的值μ=所有观测的平均值N =观测数量