问题:How to decide for the contamination value in isolation forest 让我困惑,因为涉及到决策分数阈值确定。
两个相关的分析帖,我没打开,苦恼。。
和我一样的问题github贴:Using IForest in situations where training set does not contain any anomalies · Issue #482 · yzhao062/pyod (github.com)
:I thought that contamination is used to define the threshold on the decision function and the threshold is calculated for generating the binary outlier labels. Doesn't this automatically mean that the choice of 'contamination' determines the number of anomalies to be found? Also, I have read in many tutorials on Isolation Forests that the performance of the algorithm depends very much on the choice of the 'contamination' parameter. However, I could not read anything of the sort in the original paper. Besides the training data, there are only two input parameters to the IForest algorithm: the subsampling size (corresponds to the max_samples parameter of pyod, I guess) and the number of trees (probably n_estimators). The isolation trees are built using randomly selected sub-samples of the given training data and recursively dividing them by randomly selecting an attribute and a split value until either the node has only one instance or all data at the node have the same values. At the end of the training process, a collection of trees - the forest - is returned. Then test instances are passed through the isolation trees in the evaluation stage to obtain an anomaly score for each instance. This anomaly score is calculated via the average path length over a number of trees and therefore when a forest of random trees collectively produce shorter path lengths for particular samples, they are probably anomalies.
隔离林随机将每个点从其他点中分离出来,并根据其分离次数构建一棵树,每个点代表树中的一个节点。异常值会出现在离树根较近的位置,而异常值会出现在较高的深度。
在隔离林中,根据 n_estimators 和 max_sample 参数创建树状森林,并从中得出分数。
我们可以使用 score_samples/decision_fuction 来获取每个点的归一化异常得分,根据 sklearn 的计算结果,负值越大,异常得分越高。在这一步之前,污染因子对分数没有影响。从这里开始,当我们应用预测以返回异常点 1/0 时,污染会作为分数的分界线/百分位数,并将负分的前 x 个百分位数作为异常点返回。(例如:如果将污染度设置为 0.05/5%,那么负分最高 5%的点就会被标记为异常点。)(这里是说,decision_function会调用score_samples后有一个offset偏执,这时正常样本得分为正,异常样本得分为负,而offset是算法计算的,自动设定阈值。)
if self.contamination != "auto":
if not (0.0 < self.contamination <= 0.5):
raise ValueError(
"contamination must be in (0, 0.5], got: %f" % self.contamination
)
if self.contamination == "auto":
# 0.5 plays a special role as described in the original paper.
# we take the opposite as we consider the opposite of their score.
self.offset_ = -0.5
return self
# else, define offset_ wrt contamination parameter
self.offset_ = np.percentile(self.score_samples(X), 100.0 * self.contamination)
def decision_function(self, X):
return self.score_samples(X) - self.offset_
def score_samples(self, X):
# code structure from ForestClassifier/predict_proba
check_is_fitted(self)
# Check data
X = self._validate_data(X, accept_sparse="csr", reset=False)
# Take the opposite of the scores as bigger is better (here less
# abnormal)
return -self._compute_chunked_score_samples(X)
最后:contamination设置为auto时,按照原论文的算法自动设值,在sklearn中相应的,offset会变为固定0.5。
在实时异常检测中,将统计规则与孤立森林相结合的效果更好,因为可以对模型进行训练,并对未来的数据流进行部署和预测,而未来数据流的分布可能会不时发生变化,新数据的得分也会不同。关于和统计规则Z-Scores/IQR的结合,有待研究。欢迎大佬交流。
补充contamination的认识:
why specify contamination if our job is to find it out? · Issue #144 · yzhao062/pyod (github.com)

被折叠的 条评论
为什么被折叠?



