反作弊中的无监督算法1_孤立森林_算法原理及实现

鲸鱼在dn

已于 2023-12-12 12:00:37 修改

阅读量276

点赞数 4

分类专栏：机器学习和深度学习基础 # 广告反作弊算法计算广告文章标签：算法机器学习

于 2023-11-21 17:30:10 首次发布

本文链接：https://blog.csdn.net/qq_41697157/article/details/134533832

版权

计算广告同时被 3 个专栏收录

16 篇文章 2 订阅

订阅专栏

机器学习和深度学习基础

9 篇文章 0 订阅

订阅专栏

广告反作弊算法

4 篇文章 0 订阅

订阅专栏

广告反作弊的业务背景，及常用的数据见我的上一篇博客广告反作弊思路分享，本篇主要讲一下涉及的无监督算法、算法原理、实现方式、评估方法

参考书籍《智能风控——原理、算法与工程实践》

一、算法原理

孤立森林是一种基于空间随机划分思想的集成算法，由多棵二叉树并行得到，再将输出结果进行加权平均。在传统的二叉树中，每一层的分化是基于均方差最小化对特征和特征值进行选择，不断迭代从而得到最终的决策树。在IF的每棵孤立树（iTree）中，特征及特征值的选择是完全从数据中随机选取的。

计算过程可以概括为：
1）从样本空间中随机选择一部分样本，从特征空间中随机选择一个特征；
2）在现有特征维度上随机选择一个特征值作为划分节点，即阈值；
3）分化决策树，左枝放入小于等于该阈值的样本，右枝放入大于该阈值的样本；
4）重复上述过程，直到数据不可再分（比如当前叶子节点所有样本的所有特征维度上的取值都相同），或者当前树的分化达到了开始设定的二叉树深度。

涉及的3个公式：
1）路径长度
样本 $x_i$ 在当前有T个样本的孤立树上的路径长度 $h(x_i)$ 的计算公式为：
$h(x_i)=e_i+C(T)$ 其中， $e_i$ 表示样本 $x_i$ 从一棵孤立树的根部游走到当前节点的边的个数， $C (T)$ 是一个偏置项。

2）平均路长
偏置项 $C (T)$ 的计算公式如下：
$2H(n-1)-\frac{2(n-1)}{n}$ 其中，n是当前决策树根节点中的样本数， $ln(k)+\varepsilon$ ， $\varepsilon=0.5772156649$ 为欧拉常数，偏置项表示使用T个样本训练的二叉树的平均路长。

3）异常分
明确了平均路径长度和平均路长之后，IF的异常分可定义为：
$Score(x_i)=2^{\frac{-E(h(x_i))}{C(\phi)}}$
其中， $E(h(x_i))$ 表示 $x_i$ 在所有孤立树上的路径长度的均值， $\phi$ 表示一棵孤立树上训练样本的个数， $C(\phi)$ 表示用 $\phi$ 个样本训练的二叉树的平均路径长度，作为归一化项。

从IF的异常分计算方式看：
得分越接近1，说明该样本越异常，样本x在多棵孤立树中的平均路径长度越短；
得分越接近0，说明该样本越正常，样本x在多棵孤立树中的平均路径长度越长

二. 算法实现方法

2.1 Python实现

github链接https://github.com/SilenceSengoku/IsolationFroest2

import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.preprocessing import MinMaxScaler

#read data
dataset = pd.read_csv('customers_nums.csv',engine='python')
#set variable
rs = np.random.RandomState(169);
outliers_fraction = 0.05;lendata = dataset.shape[0]
#label
anomaly = [];test_data = []
#sit normalize limited
nmlz_a = -1;nmlz_b = 1;

#some function is useful
def normalize(dataset,a,b):
    scaler = MinMaxScaler(feature_range=(a, b))
    normalize_data = scaler.fit_transform(dataset)
    return normalize_data

#read dataset x,y
x = normalize(pd.DataFrame(dataset, columns=['cr']), nmlz_a, nmlz_b)
y = normalize(pd.DataFrame(dataset, columns=['7wr']), nmlz_a, nmlz_b)
#
ifm = IsolationForest(n_estimators=100, verbose=2, n_jobs=2,
                      max_samples=lendata, random_state=rs, max_features=2)

if __name__ == '__main__':
    Iso_train_dt = np.column_stack((x, y))
    ifm.fit(Iso_train_dt)
    scores_pred = ifm.decision_function(Iso_train_dt)

    threshold = stats.scoreatpercentile(scores_pred, 100 * outliers_fraction)
    # 使用预测值取5%分位数来定义阈值（基于小概率事件5%）
    # 根据训练样本中异常样本比例，得到阈值，用于绘图

    # matplotlib
    # plot the line, the samples, and the nearest vectors to the plane
    xx, yy = np.meshgrid(np.linspace(nmlz_a, nmlz_b, 50), np.linspace(nmlz_a, nmlz_b, 50))  # 画格子
    Z = ifm.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.title("IsolationForest ")# plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)
    otl_proportion = int(outliers_fraction * len(dataset['Date']))
    plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, otl_proportion), cmap=plt.cm.hot)# 绘制异常点区域，值从最小的到阈值的那部分
    a = plt.contour(xx, yy, Z, levels=[threshold], linewidths=2, colors='red')# 绘制异常点区域和正常点区域的边界
    plt.contourf(xx, yy, Z, levels=[threshold, Z.max()], colors='palevioletred')
    # palevioletred 紫罗兰
    # 绘制正常点区域，值从阈值到最大的那部分

    for i in scores_pred:
        if i <= threshold:
            #print(i)
            test_data.append(1)
            anomaly.append(i)
        else:
            test_data.append(0)

    ano_lable = np.column_stack(((dataset['Date'],dataset['data'],x,y,scores_pred, test_data)))
    df = pd.DataFrame(data=ano_lable, columns=['Date','data','x', 'y', 'IsoFst_Score','label'])

    b = plt.scatter(df['x'][df['label'] == 0], df['y'][df['label'] == 0], s=20, edgecolor='k',c='white')
    c = plt.scatter(df['x'][df['label'] == 1], df['y'][df['label'] == 1], s=20, edgecolor='k',c='black')
    plotlist = df.to_csv('Iso_list.csv')

    plt.axis('tight')
    plt.xlim((nmlz_a, nmlz_b))
    plt.ylim((nmlz_a, nmlz_b))
    plt.legend([a.collections[0], b, c],
               ['learned decision function', 'true inliers', 'true outliers'],
               loc="upper left")
    print("孤立森林阈值  ：",threshold)
    print("全量数据样本数：",len(dataset),"个")
    print("检测异常样本数：",len(anomaly),"个")
    plt.show()

在这里插入图片描述

2.1.2 scikit-learn 参数说明

class sklearn.ensemble.IsolationForest(*, 
																n_estimators=100, 
																max_samples='auto', 
																contamination='auto', 
																max_features=1.0, 
																bootstrap=False, 
																n_jobs=None, 
																behaviour='deprecated', 
																random_state=None, 
																verbose=0, 
																warm_start=False)

源码及参数实例见https://scikit-learn.org.cn/view/631.html
在这里插入图片描述