Intro
分析sklearn的IsolationForest源码,搞清楚代码结构和样本异常得分的计算逻辑。目前对python类、方法等概念不是很了解,只从直观上解释代码。
查看源码的方式
建议直接用Pycharm,方便代码跳转查看。可以copy一个副本在相同目录下,如_iforestTest.py,后面可以直接在这个脚本里做注释。另外可以在该脚本中加入print语句,观察过程性数据。
基本结构
_iforestTest.py在\anaconda\Lib\site-packages\sklearn\ensemble
目录下,IsolationForestTest类下,主要的几个方法:
- fit
- predict
- decision_function
- score_samples
- _compute_chunked_score_samples
- _compute_score_samples
另外还有一个算是函数还是啥?不知道专业名词是什么:
- _average_path_length: Paper里的c(n),计算n个样本构建二叉树的平均深度
得分逻辑
介绍异常值得分的计算逻辑,不聚焦于代码细节~
- _average_path_length: 实现Paper中的 c ( n ) c(n) c(n),这个方法在计算单个样本的pathLength和score都有调用
- _compute_score_samples: 计算每个样本的score,此时score和Paper中的一致
- score_samples: 对score取相反数
- decision_function: 对取相反数之后的score做了个偏移,-offset(默认0.5,否则根据异常值比例求分位数计算)
- predict: decision_function返回值大于0则正常,小于0则异常
代码整体思路和Paper类似,前期计算的score和Paper里提出的计算公式一致,值域为[0,1]。此时,score越大,样本越可能是异常值。后面取反减去offset,因为offset默认是-0.5,所以值域范围变为[-0.5,0.5],值越小,越可能是异常值。
其中offset和contamination有关,contamination默认为auto,此时offset默认为-0.5。如果contamination为float类型,那么表示异常值占比,此时会计算训练数据中所有score(取反之后)的分位数,把这个分位数值赋给offset。
这里有两个小问题,offset默认取-0.5是否合理?Paper上貌似也没有指定。另外分位数的计算需要再check。
代码结构
def _average_path_length
该函数是Paper中 c ( n ) c(n) c(n)的实现,主要用途有两个:
- 单个样本计算pathLength时,需要计算 c ( T . s i z e ) c(T.size) c(T.size)
- score归一化,需要计算 c ( n ) c(n) c(n),n为subSamlpeSize
from sklearn.utils import check_array
import numpy as np
def _average_path_length(n_samples_leaf):
"""
The average path length in a n_samples iTree, which is equal to
the average path length of an unsuccessful BST search since the
latter has the same structure as an isolation tree.
Parameters
----------
n_samples_leaf : array-like, shape (n_samples,).
The number of training samples in each test sample leaf, for
each estimators.
输入:array,一种是输入subsamplesize,另一种是subsamplesize个样本落入的叶子结点中所含的样本数
Returns
-------
average_path_length : array, same shape as n_samples_leaf
输出:array,维度和输入的array一直
"""
# ensure_2d:default=True,Whether to raise a value error if array is not 2D.
# 做数据检验和转换,非二维时,不报错
# 可以把列表数据转换成array
n_samples_leaf = check_array(n_samples_leaf, ensure_2d=False)
# 获取维度
n_samples_leaf_shape = n_samples_leaf.shape
# 又是转置,1行N列
n_samples_leaf = n_samples_leaf.reshape((1, -1))
# 生成1行N列的数组,数组元素都是0
average_path_length = np.zeros(n_samples_leaf.shape)
# 下面是为公式计算做准备
# c(n)=2[ln(n-1)+0.5772156649]-2(n-1)/n
# n<=1时,c(n)=0;n=2时,c(n)=1;
# n>2,此时not_mask=True,按照公式计算即可
mask_1 = n_samples_leaf <= 1
mask_2 = n_samples_leaf == 2
# 逻辑或,再取反,啥意思,两个为False计算为正,别的都为负
# not_mask为true,则n_samples_leaf>2,为啥写的这么啰嗦,看不懂
not_mask = ~np.logical_or(mask_1, mask_2)
# 赋值计算,<=1的是0,==2的为1
average_path_length[mask_1] = 0.
average_path_length[mask_2] = 1.
# 计算对应的averagePath
average_path_length[not_mask] = (
2.0 * (np.log(n_samples_leaf[not_mask] - 1.0) + np.euler_gamma)
- 2.0 * (n_samples_leaf[not_mask] - 1.0) / n_samples_leaf[not_mask]
)
return average_path_length.reshape(n_samples_leaf_shape)
check_array([10], ensure_2d=False)
array([10])
计算score时,如此调用_average_path_length([self.max_samples_])
,列表作为参数传进去
_average_path_length([10])
array([3.74888048])
n = 10
2.0*(np.log(n-1)+ np.euler_gamma)-2*(n-1)/n
3.748880484475505
计算多个样本的pathLength时,统计
c
(
T
.
s
i
z
e
)
c(T.size)
c(T.size)
这里的计算逻辑需要涉及到_compute_score_samples
,此时把每个样本在同一个iTree落入叶子结点样本数拼接成array,维度是1行N列,N为subSampleSize
_average_path_length(np.asanyarray([10,12]))
array([3.74888048, 4.11688854])
def _compute_score_samples
该函数是计算样本score的核心模块,中间涉及到tree生成的部分不提,还没有看,估计也看不懂~只分析tree返回的结果,以及如果用这个结果进行得分计算。
中间加了些print逻辑,方便查看过程性变量
只print三次循环的过程
# _compute_score_samples返回和paper里同样逻辑的得分
def _compute_score_samples(self, X, subsample_features):
"""
Compute the score of each samples in X going through the extra trees.
Parameters
----------
X : array-like or sparse matrix
subsample_features : bool,
whether features should be subsampled
是否对特征也抽样
"""
# 样本数
n_samples = X.shape[0]
print("样本数:"+str(n_samples))
# 初始化array,元素都为0,用于存放样本的pathLength
depths = np.zeros(n_samples, order="f")
print("------ depths ------",depths)
# 下面做训练遍历,应该是把bagging的过程用来做循环
# 所以self.estimators_是多个ExtraTreeRegressor组成的list
# self.estimators_是多个arrat{0]组成的list
# 需要进一步查看
print("------ self.estimators_ ------"+"\n");print(type(self.estimators_))
print(len(self.estimators_));print(type(self.estimators_[0]))
print("------ self.estimators_features_ ------" + "\n");print(type(self.estimators_features_))
print(len(self.estimators_features_));print(self.estimators_features_[0:3])
kk=0
print("------------------------------ 循环开始 ------------------------------")
for tree, features in zip(self.estimators_, self.estimators_features_):
kk+=1
# print("------------第"+str(kk)+"次循环------------")
# 完成抽样过程,如果子抽样,features里应该被抽中的特征,目前看,似乎不需要子抽样
X_subset = X[:, features] if subsample_features else X
# print("X_subset"+"\n");print(X_subset)
# 下面都是tree的内容,tree的生成逻辑暂时不是很清楚,但是可以先观察输出的数据结构和含义
# apply返回每个数据落在该树的哪一个叶子结点上
leaves_index = tree.apply(X_subset) # 叶子结点编号
# print("------ leaves_index ------"+"\n");print(type(leaves_index));print(leaves_index)
# 返回的是稀疏矩阵,直接打印,则只把不为0的索引打印出来
# toarray之后,可以看到完整的数据结构
# 从根节点开始计数,N个节点,这个矩阵就有N列,M行表示M个样本,如果样本m在n列中,那么(m,n)为1,否则为0
# 可以结果可视化的结果来看
node_indicator = tree.decision_path(X_subset) # 内部节点编号?
# print("node_indicator"+"\n");print( node_indicator)
# print("node_indicator.shape" + "\n");print(node_indicator.shape);print("\n");print(type(node_indicator))
# print(node_indicator.toarray())
# 返回每个样本所落叶子结点样本数
n_samples_leaf = tree.tree_.n_node_samples[leaves_index]
# print("n_samples_leaf"+"\n");print( n_samples_leaf)
# np.ravel把array转成一维
# pathLength = e + c(T.size)
# e = np.ravel(node_indicator.sum(axis=1))-1.0??
# -1是因为根节点也被计算在其中,所以pathLength这个没问题,至于怎么生成tree又是另外的一个话题了,以后再说吧
# c(T.size) = _average_path_length(n_samples_leaf)
# = 2[ln(n-1)+0.577]-2(n-1)/n
# n_samples_leaf理论上存的是每个样本所在叶子结点的样本个数
depths += (
np.ravel(node_indicator.sum(axis=1))
+ _average_path_length(n_samples_leaf)
- 1.0
)
if (kk<3):
print("------------第" + str(kk) + "次循环------------")
# 完成抽样过程,如果子抽样,features里应该被抽中的特征,目前看,似乎不需要子抽样
print("X_subset" + "\n");
print(X_subset)
# 下面都是tree的内容,tree的生成逻辑暂时不是很清楚,但是可以先观察输出的数据结构和含义
# apply返回每个数据落在该树的哪一个叶子结点上
print("------ leaves_index ------" + "\n");
print(type(leaves_index));
print(leaves_index)
# 返回的是稀疏矩阵,直接打印,则只把不为0的索引打印出来
# toarray之后,可以看到完整的数据结构
# 从根节点开始计数,N个节点,这个矩阵就有N列,M行表示M个样本,如果样本m在n列中,那么(m,n)为1,否则为0
# 可以结果可视化的结果来看
print("node_indicator" + "\n");
print(node_indicator)
print("node_indicator.shape" + "\n");
print(node_indicator.shape);
print("\n");
print(type(node_indicator))
print(node_indicator.toarray())
# 返回每个样本所落叶子结点样本数
print("n_samples_leaf" + "\n");
print(n_samples_leaf)
print("---depths---"+"\n");print(depths)
print("---node_indicator.sum(axis=1)---" + "\n");
print(node_indicator.sum(axis=1))
print("---np.ravel(node_indicator.sum(axis=1))---"+"\n");print(np.ravel(node_indicator.sum(axis=1)))
print("---_average_path_length(n_samples_leaf)---"+"\n");print(_average_path_length(n_samples_leaf))
print("--- (len(self.estimators_) ---"+"\n");print((len(self.estimators_)))
scores = 2 ** (
-depths/ (len(self.estimators_) # 这个是pathlength均值
* _average_path_length([self.max_samples_]))
)
return scores
import numpy as np
from sklearn.ensemble import IsolationForestTest
from sklearn.ensemble import _average_path_length
X = np.asarray([[-1.1], [0.3], [0.5], [100]])
clf = IsolationForestTest(random_state=0)
clf.fit(X)
IsolationForestTest(behaviour='deprecated', bootstrap=False,
contamination='auto', max_features=1.0, max_samples='auto',
n_estimators=100, n_jobs=None, random_state=0, verbose=0,
warm_start=False)
clf._compute_score_samples(X,False)
样本数:4
------ depths ------ [0. 0. 0. 0.]
------ self.estimators_ ------
<class 'list'>
100
<class 'sklearn.tree._classes.ExtraTreeRegressor'>
------ self.estimators_features_ ------
<class 'list'>
100
[array([0]), array([0]), array([0])]
------------------------------ 循环开始 ------------------------------
------------第1次循环------------
X_subset
[[ -1.1]
[ 0.3]
[ 0.5]
[100. ]]
------ leaves_index ------
<class 'numpy.ndarray'>
[2 3 3 4]
node_indicator
(0, 0) 1
(0, 1) 1
(0, 2) 1
(1, 0) 1
(1, 1) 1
(1, 3) 1
(2, 0) 1
(2, 1) 1
(2, 3) 1
(3, 0) 1
(3, 4) 1
node_indicator.shape
(4, 5)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 0 0]
[1 1 0 1 0]
[1 1 0 1 0]
[1 0 0 0 1]]
n_samples_leaf
[1 2 2 1]
---depths---
[2. 3. 3. 1.]
---node_indicator.sum(axis=1)---
[[3]
[3]
[3]
[2]]
---np.ravel(node_indicator.sum(axis=1))---
[3 3 3 2]
---_average_path_length(n_samples_leaf)---
[0. 1. 1. 0.]
--- (len(self.estimators_) ---
100
------------第2次循环------------
X_subset
[[ -1.1]
[ 0.3]
[ 0.5]
[100. ]]
------ leaves_index ------
<class 'numpy.ndarray'>
[2 3 3 4]
node_indicator
(0, 0) 1
(0, 1) 1
(0, 2) 1
(1, 0) 1
(1, 1) 1
(1, 3) 1
(2, 0) 1
(2, 1) 1
(2, 3) 1
(3, 0) 1
(3, 4) 1
node_indicator.shape
(4, 5)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 0 0]
[1 1 0 1 0]
[1 1 0 1 0]
[1 0 0 0 1]]
n_samples_leaf
[1 2 2 1]
---depths---
[4. 6. 6. 2.]
---node_indicator.sum(axis=1)---
[[3]
[3]
[3]
[2]]
---np.ravel(node_indicator.sum(axis=1))---
[3 3 3 2]
---_average_path_length(n_samples_leaf)---
[0. 1. 1. 0.]
--- (len(self.estimators_) ---
100
array([0.46946339, 0.32529681, 0.33144271, 0.68006339])
从上述结果可得:
- self.estimators_features_和self.estimators_features_都是list格式,长度都为100(默认参数n_estimators即为100)
- estimators_features_元素为ExtraTreeRegressor,estimators_features_元素为array([0])
- depths是和subSampleSize个维度的array,初始值元素都为0
- 在每一次循环中,完成以下操作
- X_subset:抽取样本,此时只做feature抽样(默认不做),样本本身的子抽样,不需要做,传进来的X应该已经做过样本抽样
- leaves_index:生成叶子结点标号,array形式,应该是各个样本所落入的叶子结点编号
- node_indicator:从根节点开始,所有节点的标号信息,稀疏矩阵格式,toarray之后可以理解为行代表样本,列代表节点,如果样本落入这个节点为1否则0
- n_samples_leaf:取出leaves_index里对应叶子结点编号的样本数,位置顺序和样本一一对应
- depths:计算每一样本在这一轮迭代中的pathLength,并且累加到depths中,代码里np.ravel(node_indicator.sum(axis=1)),相当于对列求和,从而统计每一个样本经历的节点数,减1就可以得到样本split的次数。_average_path_length(n_samples_leaf)则计算c(n),完成pathLength=e+c(T.size)的计算。
- scores:为最返回值,depths/ (len(self.estimators_)为E(h(x)),得到pahtLength的均值,其他和Paper里逻辑一致
ExtraTreeRegressor返回结果解析
针对ExtraTreeRegressor返回的结果,借助可视化工具进行解析。
from sklearn.tree import ExtraTreeRegressor
tree = ExtraTreeRegressor(random_state=1234, max_depth=3)
xx = np.asarray([[21], [51],[-1.1], [0.3], [0.5], [10], [110]])
# xx = np.asarray([[-1.1]])
model = tree.fit(xx, xx)
先生成树的结构
from IPython.display import Image
from sklearn import tree
import pydotplus
dot_data = tree.export_graphviz(model, out_file=None,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
每个样本落入叶子结点的编号
model.get_n_leaves()
1
leaves_index = model.apply(xx)
print("leaves_index"+"\n");print(leaves_index)
leaves_index
[ 9 9 2 4 5 8 10]
原始数据:[[21], [51],[-1.1], [0.3], [0.5], [10], [110]]
叶子结点:[ 2 4 5 8 9 9 10]
这里的编号是对树中全部节点的编号,根节点为0,后面从上往下,一次排列。具体到上图,从根节点分裂出两个分支,先从左边的分支开始编号,这个分支继续分裂,编号延续,此时编号会沿着分支一直纵向进行,到达叶子结点再回头计数 。具体来说:
- value= 27.386 为根节点,编号0
- value=-0.1,左分支,编号为1
- value=-1.1,编号为2
- value=0.4,编号为3
- value=0.3,编号为4
- value=0.5,编号为5
- 此时左分支统计完毕,从右分支开始
- value=48.0,编号为6
- value=27.33,编号为7
- value=10.0,编号为8
- value=36.0,编号为9
- value=110.0,编号为10
每个样本落入叶子结点所含总样本数
n_samples_leaf = model.tree_.n_node_samples[leaves_index]
print("n_samples_leaf"+"\n");print( n_samples_leaf);print(type(n_samples_leaf))
n_samples_leaf
[2 2 1 1 1 1 1]
<class 'numpy.ndarray'>
从上面树的结构上可得,[21], [51]同时落入节点9,其余叶子结点均只含一个样本
树结构
node_indicator = model.decision_path(xx) # 内部节点编号?
print("node_indicator.shape" + "\n");print(node_indicator.shape);print("\n");print(type(node_indicator))
print("node_indicator"+"\n")#;print( node_indicator)
print(node_indicator.toarray())
node_indicator.shape
(7, 11)
<class 'scipy.sparse.csr.csr_matrix'>
node_indicator
[[1 0 0 0 0 0 1 1 0 1 0]
[1 0 0 0 0 0 1 1 0 1 0]
[1 1 1 0 0 0 0 0 0 0 0]
[1 1 0 1 1 0 0 0 0 0 0]
[1 1 0 1 0 1 0 0 0 0 0]
[1 0 0 0 0 0 1 1 1 0 0]
[1 0 0 0 0 0 1 0 0 0 1]]
对样本做如下标记:S[索引,value],其中索引从0计数
对矩阵做如下标记M[row,col],其中行列从0计数
对节点做如下标记N[i]第i个节点,从0开始计数,N[0]为根节点
如上面矩阵:
- 行代表样本,列代表节点,因此7行11列
- 第一列代表根节点N[0],后面依次类推
- 第一次split,从根节点出发,根据x<=2.686split,有三个样本{S[2,-1.1],S[3,0.3],S[4,0.5]}落入节点N[1],因此矩阵的{M[2,1],M[3,1],M[4,1]}元素的值为1
- 第2次split,根据x<=-0.581对节点[1]进行split,此时样本S[2,-1.1]落入节点N[2],因此M[2,2]=1,其他为0;其余样本{S[3,0.3],S[4,0.5]}落入节点N[3],因此M[(3,3),(4,3)]为1,其余为0
- 其他split过程依次类推,不赘述
def _compute_chunked_score_samples
该函数和_compute_score_samples计算结果一致,似乎只是做了效率上的提升?代码也没有看懂~
import numpy as np
from sklearn.ensemble import IsolationForestTest
from sklearn.ensemble import _average_path_length
X = np.asarray([[-1.1], [0.3], [0.5], [100]])
# X = np.asarray([[-1.1,1], [0.3,1], [0.5,1], [100,1]])
clf = IsolationForestTest(random_state=0)
clf.fit(X)
IsolationForestTest(behaviour='deprecated', bootstrap=False,
contamination='auto', max_features=1.0, max_samples='auto',
n_estimators=100, n_jobs=None, random_state=0, verbose=0,
warm_start=False)
clf._compute_chunked_score_samples(X)
------ subsample_features ------
False
------- chunk_n_rows ------
<class 'int'>
------- slices ------
<class 'generator'>
样本数:4
------ depths ------ [0. 0. 0. 0.]
------ self.estimators_ ------
<class 'list'>
100
<class 'sklearn.tree._classes.ExtraTreeRegressor'>
------ self.estimators_features_ ------
<class 'list'>
100
[array([0]), array([0]), array([0])]
------------------------------ 循环开始 ------------------------------
------- score -----
样本数:4
------ depths ------ [0. 0. 0. 0.]
------ self.estimators_ ------
<class 'list'>
100
<class 'sklearn.tree._classes.ExtraTreeRegressor'>
------ self.estimators_features_ ------
<class 'list'>
100
[array([0]), array([0]), array([0])]
------------------------------ 循环开始 ------------------------------
[0.46946339 0.32529681 0.33144271 0.68006339]
[0.46946339 0.32529681 0.33144271 0.68006339]
array([0.46946339, 0.32529681, 0.33144271, 0.68006339])
clf._compute_score_samples(X,False)
样本数:4
------ depths ------ [0. 0. 0. 0.]
------ self.estimators_ ------
<class 'list'>
100
<class 'sklearn.tree._classes.ExtraTreeRegressor'>
------ self.estimators_features_ ------
<class 'list'>
100
[array([0]), array([0]), array([0])]
------------------------------ 循环开始 ------------------------------
array([0.46946339, 0.32529681, 0.33144271, 0.68006339])
def score_samples
对数据做了下校验,完了直接调用_compute_chunked_score_samples,取反
def decision_function
对取反之后的score做校正,把值域从[-1,0]映射到[-0.5,0.5]
def predict
该函数对校正之后的得分进行判断,如果小于0,返回-1,认为是异常值。
def fit
fit函数是核心模块,还没有搞明白~从代码角度看,最终都是在调用bagging框架的方法,后面再看吧
Ref
[1] ExtraTreeRegressor文档
[2] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
2020-01-09 于南京市江宁区九龙湖