IsolationForest-03Sklearn源码

最新推荐文章于 2022-03-16 10:00:03 发布

维格堂406小队

最新推荐文章于 2022-03-16 10:00:03 发布

阅读量746

点赞数 3

分类专栏： ★★★机器学习 # ★★异常检测

本文链接：https://blog.csdn.net/wendaomudong_l2d4/article/details/103988432

版权

★★★机器学习同时被 2 个专栏收录

62 篇文章 3 订阅

订阅专栏

★★异常检测

3 篇文章 1 订阅

订阅专栏

Intro

分析sklearn的IsolationForest源码，搞清楚代码结构和样本异常得分的计算逻辑。目前对python类、方法等概念不是很了解，只从直观上解释代码。

查看源码的方式

建议直接用Pycharm,方便代码跳转查看。可以copy一个副本在相同目录下，如_iforestTest.py，后面可以直接在这个脚本里做注释。另外可以在该脚本中加入print语句，观察过程性数据。

基本结构

_iforestTest.py在\anaconda\Lib\site-packages\sklearn\ensemble目录下，IsolationForestTest类下，主要的几个方法：

fit
predict
decision_function
score_samples
_compute_chunked_score_samples
_compute_score_samples

另外还有一个算是函数还是啥？不知道专业名词是什么:

_average_path_length: Paper里的c(n)，计算n个样本构建二叉树的平均深度

得分逻辑

介绍异常值得分的计算逻辑，不聚焦于代码细节~

_average_path_length: 实现Paper中的 $c (n)$ ，这个方法在计算单个样本的pathLength和score都有调用
_compute_score_samples: 计算每个样本的score,此时score和Paper中的一致
score_samples: 对score取相反数
decision_function: 对取相反数之后的score做了个偏移，-offset(默认0.5，否则根据异常值比例求分位数计算)
predict: decision_function返回值大于0则正常，小于0则异常

代码整体思路和Paper类似，前期计算的score和Paper里提出的计算公式一致，值域为[0,1]。此时，score越大，样本越可能是异常值。后面取反减去offset，因为offset默认是-0.5,所以值域范围变为[-0.5,0.5]，值越小，越可能是异常值。
其中offset和contamination有关，contamination默认为auto,此时offset默认为-0.5。如果contamination为float类型，那么表示异常值占比，此时会计算训练数据中所有score(取反之后)的分位数，把这个分位数值赋给offset。

这里有两个小问题，offset默认取-0.5是否合理？Paper上貌似也没有指定。另外分位数的计算需要再check。

代码结构

def _average_path_length

该函数是Paper中 $c (n)$ 的实现，主要用途有两个：

单个样本计算pathLength时，需要计算 $c (T . s i z e)$
score归一化，需要计算 $c (n)$ ，n为subSamlpeSize

from sklearn.utils import check_array
import numpy as np

def _average_path_length(n_samples_leaf):
    """
    The average path length in a n_samples iTree, which is equal to
    the average path length of an unsuccessful BST search since the
    latter has the same structure as an isolation tree.
    Parameters
    ----------
    n_samples_leaf : array-like, shape (n_samples,).
        The number of training samples in each test sample leaf, for
        each estimators.
        输入：array，一种是输入subsamplesize，另一种是subsamplesize个样本落入的叶子结点中所含的样本数  

    Returns
    -------
    average_path_length : array, same shape as n_samples_leaf
    输出：array，维度和输入的array一直  
    """
    # ensure_2d:default=True，Whether to raise a value error if array is not 2D.
    # 做数据检验和转换，非二维时，不报错
    # 可以把列表数据转换成array  
    n_samples_leaf = check_array(n_samples_leaf, ensure_2d=False)
    # 获取维度
    n_samples_leaf_shape = n_samples_leaf.shape
    # 又是转置，1行N列
    n_samples_leaf = n_samples_leaf.reshape((1, -1))
    # 生成1行N列的数组，数组元素都是0
    average_path_length = np.zeros(n_samples_leaf.shape)

    # 下面是为公式计算做准备
    # c(n)=2[ln(n-1)+0.5772156649]-2(n-1)/n
    # n<=1时，c(n)=0;n=2时，c(n)=1;
    # n>2,此时not_mask=True,按照公式计算即可  
    mask_1 = n_samples_leaf <= 1
    mask_2 = n_samples_leaf == 2
    # 逻辑或，再取反，啥意思，两个为False计算为正，别的都为负
    # not_mask为true，则n_samples_leaf>2,为啥写的这么啰嗦，看不懂
    not_mask = ~np.logical_or(mask_1, mask_2)

    # 赋值计算，<=1的是0，==2的为1
    average_path_length[mask_1] = 0.
    average_path_length[mask_2] = 1.

    # 计算对应的averagePath
    average_path_length[not_mask] = (
        2.0 * (np.log(n_samples_leaf[not_mask] - 1.0) + np.euler_gamma)
        - 2.0 * (n_samples_leaf[not_mask] - 1.0) / n_samples_leaf[not_mask]
    )
    return average_path_length.reshape(n_samples_leaf_shape)

check_array([10], ensure_2d=False)

array([10])

计算score时，如此调用_average_path_length([self.max_samples_])，列表作为参数传进去

_average_path_length([10])

array([3.74888048])

n = 10
2.0*(np.log(n-1)+ np.euler_gamma)-2*(n-1)/n

3.748880484475505

计算多个样本的pathLength时，统计 $c (T . s i z e)$
这里的计算逻辑需要涉及到_compute_score_samples，此时把每个样本在同一个iTree落入叶子结点样本数拼接成array,维度是1行N列，N为subSampleSize

_average_path_length(np.asanyarray([10,12]))

array([3.74888048, 4.11688854])

def _compute_score_samples

该函数是计算样本score的核心模块，中间涉及到tree生成的部分不提，还没有看，估计也看不懂~只分析tree返回的结果，以及如果用这个结果进行得分计算。

中间加了些print逻辑，方便查看过程性变量
只print三次循环的过程

    # _compute_score_samples返回和paper里同样逻辑的得分
    def _compute_score_samples(self, X, subsample_features):
        """
        Compute the score of each samples in X going through the extra trees.

        Parameters
        ----------
        X : array-like or sparse matrix

        subsample_features : bool,
            whether features should be subsampled
            是否对特征也抽样
        """
        # 样本数
        n_samples = X.shape[0]
        print("样本数:"+str(n_samples))
        # 初始化array，元素都为0，用于存放样本的pathLength
        depths = np.zeros(n_samples, order="f")
        print("------ depths ------",depths)
        # 下面做训练遍历，应该是把bagging的过程用来做循环
        # 所以self.estimators_是多个ExtraTreeRegressor组成的list
        # self.estimators_是多个arrat{0]组成的list
        # 需要进一步查看
        print("------ self.estimators_ ------"+"\n");print(type(self.estimators_))
        print(len(self.estimators_));print(type(self.estimators_[0]))
        print("------ self.estimators_features_ ------" + "\n");print(type(self.estimators_features_))
        print(len(self.estimators_features_));print(self.estimators_features_[0:3])
        kk=0
        print("------------------------------ 循环开始 ------------------------------")
        for tree, features in zip(self.estimators_, self.estimators_features_):
            kk+=1
            # print("------------第"+str(kk)+"次循环------------")
            # 完成抽样过程，如果子抽样，features里应该被抽中的特征，目前看，似乎不需要子抽样
            X_subset = X[:, features] if subsample_features else X
            # print("X_subset"+"\n");print(X_subset)

            # 下面都是tree的内容，tree的生成逻辑暂时不是很清楚，但是可以先观察输出的数据结构和含义
            # apply返回每个数据落在该树的哪一个叶子结点上
            leaves_index = tree.apply(X_subset) # 叶子结点编号
            # print("------ leaves_index ------"+"\n");print(type(leaves_index));print(leaves_index)
            # 返回的是稀疏矩阵,直接打印，则只把不为0的索引打印出来
            # toarray之后，可以看到完整的数据结构
            # 从根节点开始计数，N个节点，这个矩阵就有N列，M行表示M个样本，如果样本m在n列中，那么(m,n)为1，否则为0
            # 可以结果可视化的结果来看
            node_indicator = tree.decision_path(X_subset) # 内部节点编号？
            # print("node_indicator"+"\n");print( node_indicator)
            # print("node_indicator.shape" + "\n");print(node_indicator.shape);print("\n");print(type(node_indicator))
            # print(node_indicator.toarray())
            # 返回每个样本所落叶子结点样本数
            n_samples_leaf = tree.tree_.n_node_samples[leaves_index]
            # print("n_samples_leaf"+"\n");print( n_samples_leaf)

            # np.ravel把array转成一维
            # pathLength = e + c(T.size)
            # e = np.ravel(node_indicator.sum(axis=1))-1.0??
            # -1是因为根节点也被计算在其中，所以pathLength这个没问题，至于怎么生成tree又是另外的一个话题了，以后再说吧
            # c(T.size) = _average_path_length(n_samples_leaf)
            #           = 2[ln(n-1)+0.577]-2(n-1)/n
            # n_samples_leaf理论上存的是每个样本所在叶子结点的样本个数
            depths += (
                np.ravel(node_indicator.sum(axis=1))
                + _average_path_length(n_samples_leaf)
                - 1.0
            )
            if (kk<3):
                print("------------第" + str(kk) + "次循环------------")
                # 完成抽样过程，如果子抽样，features里应该被抽中的特征，目前看，似乎不需要子抽样
                print("X_subset" + "\n");
                print(X_subset)

                # 下面都是tree的内容，tree的生成逻辑暂时不是很清楚，但是可以先观察输出的数据结构和含义
                # apply返回每个数据落在该树的哪一个叶子结点上
                print("------ leaves_index ------" + "\n");
                print(type(leaves_index));
                print(leaves_index)
                # 返回的是稀疏矩阵,直接打印，则只把不为0的索引打印出来
                # toarray之后，可以看到完整的数据结构
                # 从根节点开始计数，N个节点，这个矩阵就有N列，M行表示M个样本，如果样本m在n列中，那么(m,n)为1，否则为0
                # 可以结果可视化的结果来看
                print("node_indicator" + "\n");
                print(node_indicator)
                print("node_indicator.shape" + "\n");
                print(node_indicator.shape);
                print("\n");
                print(type(node_indicator))
                print(node_indicator.toarray())
                # 返回每个样本所落叶子结点样本数
                print("n_samples_leaf" + "\n");
                print(n_samples_leaf)
                print("---depths---"+"\n");print(depths)
                print("---node_indicator.sum(axis=1)---" + "\n");
                print(node_indicator.sum(axis=1))
                print("---np.ravel(node_indicator.sum(axis=1))---"+"\n");print(np.ravel(node_indicator.sum(axis=1)))
                print("---_average_path_length(n_samples_leaf)---"+"\n");print(_average_path_length(n_samples_leaf))
                print("--- (len(self.estimators_) ---"+"\n");print((len(self.estimators_)))
        scores = 2 ** (
            -depths/ (len(self.estimators_) # 这个是pathlength均值
            * _average_path_length([self.max_samples_]))
        )
        return scores

import numpy as np
from sklearn.ensemble import IsolationForestTest
from sklearn.ensemble import _average_path_length
X = np.asarray([[-1.1], [0.3], [0.5], [100]])
clf = IsolationForestTest(random_state=0)
clf.fit(X)

IsolationForestTest(behaviour='deprecated', bootstrap=False,
                    contamination='auto', max_features=1.0, max_samples='auto',
                    n_estimators=100, n_jobs=None, random_state=0, verbose=0,
                    warm_start=False)

clf._compute_score_samples(X,False)

样本数:4
------ depths ------ [0. 0. 0. 0.]
------ self.estimators_ ------

<class 'list'>
100
<class 'sklearn.tree._classes.ExtraTreeRegressor'>
------ self.estimators_features_ ------

<class 'list'>
100
[array([0]), array([0]), array([0])]
------------------------------ 循环开始 ------------------------------
------------第1次循环------------
X_subset

[[ -1.1]
 [  0.3]
 [  0.5]
 [100. ]]
------ leaves_index ------

<class 'numpy.ndarray'>
[2 3 3 4]
node_indicator

  (0, 0)	1
  (0, 1)	1
  (0, 2)	1
  (1, 0)	1
  (1, 1)	1
  (1, 3)	1
  (2, 0)	1
  (2, 1)	1
  (2, 3)	1
  (3, 0)	1
  (3, 4)	1
node_indicator.shape

(4, 5)


<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 0 0]
 [1 1 0 1 0]
 [1 1 0 1 0]
 [1 0 0 0 1]]
n_samples_leaf

[1 2 2 1]
---depths---

[2. 3. 3. 1.]
---node_indicator.sum(axis=1)---

[[3]
 [3]
 [3]
 [2]]
---np.ravel(node_indicator.sum(axis=1))---

[3 3 3 2]
---_average_path_length(n_samples_leaf)---

[0. 1. 1. 0.]
--- (len(self.estimators_) ---

100
------------第2次循环------------
X_subset

[[ -1.1]
 [  0.3]
 [  0.5]
 [100. ]]
------ leaves_index ------

<class 'numpy.ndarray'>
[2 3 3 4]
node_indicator

  (0, 0)	1
  (0, 1)	1
  (0, 2)	1
  (1, 0)	1
  (1, 1)	1
  (1, 3)	1
  (2, 0)	1
  (2, 1)	1
  (2, 3)	1
  (3, 0)	1
  (3, 4)	1
node_indicator.shape

(4, 5)


<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 0 0]
 [1 1 0 1 0]
 [1 1 0 1 0]
 [1 0 0 0 1]]
n_samples_leaf

[1 2 2 1]
---depths---

[4. 6. 6. 2.]
---node_indicator.sum(axis=1)---

[[3]
 [3]
 [3]
 [2]]
---np.ravel(node_indicator.sum(axis=1))---

[3 3 3 2]
---_average_path_length(n_samples_leaf)---

[0. 1. 1. 0.]
--- (len(self.estimators_) ---

100





array([0.46946339, 0.32529681, 0.33144271, 0.68006339])

从上述结果可得：

self.estimators_features_和self.estimators_features_都是list格式，长度都为100(默认参数n_estimators即为100)
estimators_features_元素为ExtraTreeRegressor，estimators_features_元素为array([0])
depths是和subSampleSize个维度的array，初始值元素都为0
在每一次循环中，完成以下操作
- X_subset:抽取样本，此时只做feature抽样(默认不做)，样本本身的子抽样，不需要做，传进来的X应该已经做过样本抽样
- leaves_index:生成叶子结点标号，array形式，应该是各个样本所落入的叶子结点编号
- node_indicator:从根节点开始，所有节点的标号信息，稀疏矩阵格式，toarray之后可以理解为行代表样本，列代表节点，如果样本落入这个节点为1否则0
- n_samples_leaf:取出leaves_index里对应叶子结点编号的样本数，位置顺序和样本一一对应
- depths:计算每一样本在这一轮迭代中的pathLength，并且累加到depths中，代码里np.ravel(node_indicator.sum(axis=1))，相当于对列求和，从而统计每一个样本经历的节点数，减1就可以得到样本split的次数。_average_path_length(n_samples_leaf)则计算c(n)，完成pathLength=e+c(T.size)的计算。
scores:为最返回值，depths/ (len(self.estimators_)为E(h(x))，得到pahtLength的均值，其他和Paper里逻辑一致

ExtraTreeRegressor返回结果解析

针对ExtraTreeRegressor返回的结果，借助可视化工具进行解析。

from sklearn.tree import ExtraTreeRegressor
tree = ExtraTreeRegressor(random_state=1234, max_depth=3)
xx = np.asarray([[21], [51],[-1.1], [0.3], [0.5], [10],  [110]])
# xx = np.asarray([[-1.1]])
model = tree.fit(xx, xx)

先生成树的结构

from IPython.display import Image  
from sklearn import tree
import pydotplus 
dot_data = tree.export_graphviz(model, out_file=None, 

                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())

在这里插入图片描述

每个样本落入叶子结点的编号

model.get_n_leaves()

leaves_index = model.apply(xx)
print("leaves_index"+"\n");print(leaves_index)

leaves_index

[ 9  9  2  4  5  8 10]

原始数据:[[21], [51],[-1.1], [0.3], [0.5], [10], [110]]
叶子结点:[ 2 4 5 8 9 9 10]
这里的编号是对树中全部节点的编号，根节点为0，后面从上往下，一次排列。具体到上图，从根节点分裂出两个分支，先从左边的分支开始编号，这个分支继续分裂，编号延续，此时编号会沿着分支一直纵向进行，到达叶子结点再回头计数。具体来说：

value= 27.386 为根节点，编号0
value=-0.1，左分支，编号为1
value=-1.1，编号为2
value=0.4，编号为3
value=0.3，编号为4
value=0.5，编号为5
此时左分支统计完毕，从右分支开始
value=48.0，编号为6
value=27.33，编号为7
value=10.0，编号为8
value=36.0，编号为9
value=110.0，编号为10

每个样本落入叶子结点所含总样本数

n_samples_leaf = model.tree_.n_node_samples[leaves_index]
print("n_samples_leaf"+"\n");print( n_samples_leaf);print(type(n_samples_leaf))

n_samples_leaf

[2 2 1 1 1 1 1]
<class 'numpy.ndarray'>

从上面树的结构上可得，[21], [51]同时落入节点9，其余叶子结点均只含一个样本

树结构

node_indicator = model.decision_path(xx) # 内部节点编号？
print("node_indicator.shape" + "\n");print(node_indicator.shape);print("\n");print(type(node_indicator))
print("node_indicator"+"\n")#;print( node_indicator)
print(node_indicator.toarray())

node_indicator.shape

(7, 11)


<class 'scipy.sparse.csr.csr_matrix'>
node_indicator

[[1 0 0 0 0 0 1 1 0 1 0]
 [1 0 0 0 0 0 1 1 0 1 0]
 [1 1 1 0 0 0 0 0 0 0 0]
 [1 1 0 1 1 0 0 0 0 0 0]
 [1 1 0 1 0 1 0 0 0 0 0]
 [1 0 0 0 0 0 1 1 1 0 0]
 [1 0 0 0 0 0 1 0 0 0 1]]

对样本做如下标记：S[索引,value],其中索引从0计数
对矩阵做如下标记M[row,col]，其中行列从0计数
对节点做如下标记N[i]第i个节点，从0开始计数，N[0]为根节点
如上面矩阵：

行代表样本，列代表节点，因此7行11列
第一列代表根节点N[0]，后面依次类推
第一次split,从根节点出发，根据x<=2.686split，有三个样本{S[2,-1.1],S[3,0.3],S[4,0.5]}落入节点N[1],因此矩阵的{M[2,1],M[3,1],M[4,1]}元素的值为1
第2次split，根据x<=-0.581对节点[1]进行split，此时样本S[2,-1.1]落入节点N[2]，因此M[2,2]=1，其他为0;其余样本{S[3,0.3],S[4,0.5]}落入节点N[3]，因此M[(3,3),(4,3)]为1，其余为0
其他split过程依次类推，不赘述

def _compute_chunked_score_samples

该函数和_compute_score_samples计算结果一致，似乎只是做了效率上的提升？代码也没有看懂~

import numpy as np
from sklearn.ensemble import IsolationForestTest
from sklearn.ensemble import _average_path_length
X = np.asarray([[-1.1], [0.3], [0.5], [100]])
# X = np.asarray([[-1.1,1], [0.3,1], [0.5,1], [100,1]])
clf = IsolationForestTest(random_state=0)
clf.fit(X)

IsolationForestTest(behaviour='deprecated', bootstrap=False,
                    contamination='auto', max_features=1.0, max_samples='auto',
                    n_estimators=100, n_jobs=None, random_state=0, verbose=0,
                    warm_start=False)

clf._compute_chunked_score_samples(X)

------ subsample_features ------ 
 False
------- chunk_n_rows ------


<class 'int'>
------- slices ------


<class 'generator'>
样本数:4
------ depths ------ [0. 0. 0. 0.]
------ self.estimators_ ------

<class 'list'>
100
<class 'sklearn.tree._classes.ExtraTreeRegressor'>
------ self.estimators_features_ ------

<class 'list'>
100
[array([0]), array([0]), array([0])]
------------------------------ 循环开始 ------------------------------
------- score ----- 
样本数:4
------ depths ------ [0. 0. 0. 0.]
------ self.estimators_ ------

<class 'list'>
100
<class 'sklearn.tree._classes.ExtraTreeRegressor'>
------ self.estimators_features_ ------

<class 'list'>
100
[array([0]), array([0]), array([0])]
------------------------------ 循环开始 ------------------------------
[0.46946339 0.32529681 0.33144271 0.68006339]
[0.46946339 0.32529681 0.33144271 0.68006339]





array([0.46946339, 0.32529681, 0.33144271, 0.68006339])

clf._compute_score_samples(X,False)

样本数:4
------ depths ------ [0. 0. 0. 0.]
------ self.estimators_ ------

<class 'list'>
100
<class 'sklearn.tree._classes.ExtraTreeRegressor'>
------ self.estimators_features_ ------

<class 'list'>
100
[array([0]), array([0]), array([0])]
------------------------------ 循环开始 ------------------------------





array([0.46946339, 0.32529681, 0.33144271, 0.68006339])

def score_samples

对数据做了下校验，完了直接调用_compute_chunked_score_samples，取反

def decision_function

对取反之后的score做校正，把值域从[-1,0]映射到[-0.5,0.5]

def predict

该函数对校正之后的得分进行判断，如果小于0，返回-1，认为是异常值。

def fit

fit函数是核心模块，还没有搞明白~从代码角度看，最终都是在调用bagging框架的方法，后面再看吧

Ref

[1] ExtraTreeRegressor文档
 [2] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html

2020-01-09 于南京市江宁区九龙湖

维格堂406小队

关注

3
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
IsolationForest-03Sklearn源码

Intro 分析sklearn的IsolationForest源码，搞清楚代码结构和样本异常得分的计算逻辑。目前对python类、方法等概念不是很了解，只从直观上解释代码。查看源码的方式建议直接用Pycharm,方便代码跳转查看。可以copy一个副本在相同目录下，如_iforestTest.py，后面可以直接在这个脚本里做注释。另外可以在该脚本中加入print语句，观察过程性数据。基...
复制链接

扫一扫