一、rrcf算法的原理:
1、模型介绍:
来自于ICML2016的一篇paper
论文源码: https://github.com/kLabUM/rrcf
论文地址:http://proceedings.mlr.press/v48/guha16.html
RRCF 由一组 RRCT(robust random cut tree)组成,一个 RRCT 由以下方式定义:
2、异常的定义:
2.1、检测思路是怎么样的呢?
作者从模型复杂度的角度来看待这个问题,如果一个点的复杂度随着这个点的加入而显著增加,那么这个点就是一个异常;
2.2、那么如何定义复杂度呢?
由树的深度来定义的,先计算每个点的深度再求和,得到树的复杂度;如果是森林的话,就需要基于树的数量求平均;那么计算某个点带来的复杂度的变化就是其他所有点的复杂度变化;如下案例:
# Seed tree with zero-mean, normally distributed data
np.random.seed(1)
X = np.random.randn(10,2)
tree = rrcf.RCTree(X)
# Generate an inlier and outlier point
inlier = np.array([0, 0])
outlier = np.array([40, 40])
# Insert into tree
tree.insert_point(inlier, index='inlier')
tree.insert_point(outlier, index='outlier')
print(tree.codisp('inlier'))
print(tree.codisp('outlier'))
tree
outlier这个点的插入,将其他所有的点的深度都增加了1;一共有11个点,所以复杂度的变化就是11.
如果正常点有21个,那么分数就会是21分。
二、使用案例
案例1:
import numpy as np
import pandas as pd
import rrcf
# Set sample parameters
np.random.seed(0)
n = 2010
d = 3
# Generate data
X = np.zeros((n, d))
X[:1000,0] = 5
X[1000:2000,0] = -5
X += 0.01*np.random.randn(*X.shape)
# Set forest parameters
num_trees = 100
tree_size = 256
sample_size_range = (n // tree_size, tree_size)
# Construct forest
forest = []
while len(forest) < num_trees:
# Select random subsets of points uniformly
ixs = np.random.choice(n, size=sample_size_range,
replace=False)
# Add sampled trees to forest
trees = [rrcf.RCTree(X[ix], index_labels=ix)
for ix in ixs]
forest.extend(trees)
# Compute average CoDisp
avg_codisp = pd.Series(0.0, index=np.arange(n))
index = np.zeros(n)
for tree in forest:
codisp = pd.Series({leaf : tree.codisp(leaf)
for leaf in tree.leaves})
avg_codisp[codisp.index] += codisp
np.add.at(index, codisp.index.values, 1)
avg_codisp /= index
# 计算完成以后,展示一下代码如下:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import colors
threshold = avg_codisp.nlargest(n=10).min()
fig = plt.figure(figsize=(12,4.5))
ax = fig.add_subplot(121, projection='3d')
sc = ax.scatter(X[:,0], X[:,1], X[:,2],
c=np.log(avg_codisp.sort_index().values),
cmap='gnuplot2')
plt.title('log(CoDisp)')
ax = fig.add_subplot(122, projection='3d')
sc = ax.scatter(X[:,0], X[:,1], X[:,2],
linewidths=0.1, edgecolors='k',
c=(avg_codisp >= threshold).astype(float),
cmap='cool')
plt.title('CoDisp above 99.5th percentile')
案例2:
import numpy as np
import rrcf
# Generate data
n = 730
A = 50
center = 100
phi = 30
T = 2*np.pi/100
t = np.arange(n)
sin = A*np.sin(T*t-phi*T) + center
sin[235:255] = 80
# Set tree parameters
num_trees = 40
shingle_size = 4
tree_size = 256
# Create a forest of empty trees
forest = []
for _ in range(num_trees):
tree = rrcf.RCTree()
forest.append(tree)
# Use the "shingle" generator to create rolling window
points = rrcf.shingle(sin, size=shingle_size)
# Create a dict to store anomaly score of each point
avg_codisp = {}
# For each shingle...
for index, point in enumerate(points):
# For each tree in the forest...
for tree in forest:
# If tree is above permitted size, drop the oldest point (FIFO)
if len(tree.leaves) > tree_size:
tree.forget_point(index - tree_size)
# Insert the new point into the tree
tree.insert_point(point, index=index)
# Compute codisp on the new point and take the average among all trees
if not index in avg_codisp:
avg_codisp[index] = 0
avg_codisp[index] += tree.codisp(index) / num_trees
# 计算结果展示
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax1 = plt.subplots(figsize=(10, 5))
color = 'tab:red'
ax1.set_ylabel('Data', color=color, size=14)
ax1.plot(sin, color=color)
ax1.tick_params(axis='y', labelcolor=color, labelsize=12)
ax1.set_ylim(0,160)
ax2 = ax1.twinx()
color = 'tab:blue'
ax2.set_ylabel('CoDisp', color=color, size=14)
ax2.plot(pd.Series(avg_codisp).sort_index(), color=color)
ax2.tick_params(axis='y', labelcolor=color, labelsize=12)
ax2.grid('off')
ax2.set_ylim(0, 160)
plt.title('Sine wave with injected anomaly (red) and anomaly score (blue)', size=14)
参考:
http://dljz.nicethemes.cn/news/show-16412.html
https://zhuanlan.zhihu.com/p/347000008