CellPAD（机器学习的异常检测）代码详解---自己回顾用

最新推荐文章于 2023-07-10 09:30:13 发布

hxtyy

最新推荐文章于 2023-07-10 09:30:13 发布

阅读量1.3k

点赞数 1

本文链接：https://blog.csdn.net/hxtyy/article/details/119209254

版权

先放论文代码GitHub链接：https://github.com/XuJiaxin000/CellPAD#cellpad-detecting-performance-anomalies-in-cellular-networks-via-regression-analysis

最近在学习基于统计建模和机器学习的回归异常检测，原来已经作出了一个基本框架，最后被评价过于简陋，因此回顾一下CellPAD的代码。这里只讲关于Sudden_Drop的异常检测。

先打开example_drop.py文件。test()函数很清晰：读取KPI、注入异常、检测Sudden_Drop异常、性能评价。其中读取KPI不作介绍。

注入异常：

注入异常中使用DropSynthesiser Class的syn_drop()函数。其中，point_fraction参数为数据集中点异常比例，lowest_drop_ratio参数为异常数据突然下降的最小比例，segment_cnt参数异常中连续异常的数量，shortest_period和longest_period参数分别为连续异常最短和最长的持续数量。

add_point_anlomalies()函数注入点异常，add_segment_anomalies()函数注入连续异常，filter_by_rule()函数标注数据集中明显存在的（突然下降）异常。

如此，注入异常步骤完成。

训练模型：

这是整篇论文的关键，也是代码的难点。

    controller = DropController(timestamps=timestamps,
                                series=syn_series,
                                period_len=168,
                                feature_types=["Indexical", "Numerical"],
                                feature_time_grain=["Weekly"],
                                feature_operations=["Wma", "Ewma", "Mean", "Median"],
                                bootstrap_period_cnt=2,
                                to_remove_trend=True,
                                trend_remove_method="center_mean",
                                anomaly_filter_method="gauss",
                                anomaly_filter_coefficient=3.0)

timestamps和series不需要解释，分别是时间戳和原始的KPI序列。period_len为时间序列的周期。其他的将在下面详细分析：

在controller.py文件中，首先预处理原始数据：

# the dict of the attributes of the time series.
        dict_series = {}
        if not to_remove_trend:
            dict_series["detected_series"] = np.array(series)
        else:
            preprocessor = Preprocessor()
            dict_series["detected_series"] = preprocessor.remove_trend(series, period_len, method=trend_remove_method)
        dict_series["timestamps"] = timestamps
        dict_series["series_len"] = len(dict_series["detected_series"])
        dict_series["period_len"] = period_len
        self.dict_series = dict_series

to_remove_trend参数为是否去除趋势性，trend_remove_trend参数为去趋势方法，其中去趋势方法有center_mean与past_mean、除趋势与减趋势。center_mean取前后共一周期数据的平均值，past_mean取过去一周期数据的平均值。

其次，特征工程提取特征，创建特征的字典dict_feature：

 # the dict of features related variables.
        dict_feature = {}
        dict_feature["operations"] = feature_operations
        dict_feature["time_grain"] = feature_time_grain
        dict_feature["feature_types"] = feature_types
        dict_feature["feature_tool"] = FeatureTools()
        dict_feature["feature_list"] = dict_feature["feature_tool"].set_feature_names(dict_feature["feature_types"],                                                                                             
                                                  dict_feature["time_grain"],                                                                                      
                                                  dict_feature["operations"])

feature_operations参数为特征工程提取的特征，也就是对数据的处理，如Mean为取平均，Wma为加权移动平均，Media为中位数。

feature_time_grain参数为特征的时间粒度。时间粒度指的是对时间管理的最小值，形象点，时间粒度就是提取数据的步长，即数据的采样频率。

feature_type参数为特征的种类，有Indexical、Numerical两类，indexical类根据索引即时间戳提取特征，Numerical类根据数值即KPI提取特征。

FeatureTools()类在feature.py文件中，我们进入分析：

我们直接找到set_feature_names()函数：

如果feature_type=Indexical，处理时间戳。如果feature_time_grain=Weekly，一周提取一次的数据，那么一周数据就有日和小时的特征，如果feature_time_grain=Day，一天提取一次的数据，那么一天数据只有小时的特征。用feature_list存储特征种类。

如果feature_type=Numerical，处理KPI。如果operation=Raw，不处理KPI，直接作为特征；否则根据win即窗口长度，feature_time_grain与operation建立特征 win_(feature_time_grain)_operation，如2_Weekly_Mean即取以周为时间粒度的2个数据做平均值。

文中feature_types=["Indexical", "Numerical"]、feature_time_grain=["Weekly"]、feature_operations=["Wma", "Ewma", "Mean", "Median"]。提取的特征有Hour、Day、[3,5,7,10,13]_Weekly_["Wma","Ewma","Mean","Median"]。

以此类推，建立feature_list。

再次，从数据集中提取引导用训练集，作dict_bootstrap字典。（我也不清楚具体有什么用）

 # the dict of the bootstrap parameters.
        dict_bootstrap = {}
        dict_bootstrap["period_cnt"] = bootstrap_period_cnt
        dict_bootstrap["bootstrap_series_len"] = bootstrap_period_cnt * period_len
        dict_bootstrap["bootstrap_series"] = self.dict_series["detected_series"][:dict_bootstrap["bootstrap_series_len"]]
        self.dict_bootstrap = dict_bootstrap

其中，boostrap_period_cnt参数为引导训练所用数据的周期数，总数据量为周期数boostrap_period_cnt*数据集周期period_len。从数据集的开头开始取。

然后，作筛选异常的字典。

# the dict of anomaly filter parameters.
        dict_filter = {}
        dict_filter["method"] = anomaly_filter_method
        dict_filter["coefficient"] = anomaly_filter_coefficient
        self.dict_filter = dict_filter

anomaly_filter_method参数为筛选方法，本文默认为高斯方法，具体实现见后文。

anomaly_filter_coefficient参数为置信区间的边界值。

最后，作存储训练集数据的字典。

# the dict of the storage for training data.
        dict_storage = {}
        dict_storage["normal_features_matrix"] = pd.DataFrame()
        dict_storage["normal_response_series"] = []
        self.dict_storage = dict_storage

这里，normal_features_matrix为输入的正常数据特征矩阵，normal_response_series为正常数据的输出，基于回归的异常检测中输出即对应的KPI。

到这里为止，controller的DropController()初始化已经完成。

接下来开始检测异常。

controller.detect(predictor="RF")

重新回到controller.py文件定位detect()函数。

本文以随机森林RF为例，detect()函数继续进入__detect_by_regression()函数。

        if predictor == "RT" or predictor == "RF" or predictor == "SLR" or predictor == "HR":
            self.__detect_by_regression(predictor=predictor)

__detect_by_regression()函数默认参数n_esimators即迭代次数为100。

首先调用self.__init_bootstrap()初始化引导训练集。该初始化函数建立dict_result字典存储drop_ratios、drop_scores、drop_labels、predicted_series。

其次，调用RegressionPredictor(predictor)函数建立model回归模型。该函数在algorithm.py文件中。self.reg = RandomForestRegressor(n_estimators=100, criterion="mse")建立回归模型，迭代次数n_estimators为100，损失函数标准criterion为mse均方差。

再次，提取特征，下面是对引导训练集的特征提取：

first_train_features = self.dict_feature["feature_tool"].compute_feature_matrix(
                                     timestamps=self.dict_series["timestamps"],
                                     series=self.dict_bootstrap["bootstrap_series"],
                                     labels=[False] * self.dict_bootstrap["bootstrap_series_len"],
                                     ts_period_len=self.dict_series["period_len"],
                                     feature_list=self.dict_feature["feature_list"],
                                     start_pos=0,
                                     end_pos=self.dict_bootstrap["bootstrap_series_len"])

start_pos参数为起始位置即0，end_pos参数为结束位置即训练序列的长度在最后一个数据的后一位。这里，引入了 self.dict_feature["feature_tool"].compute_feature_matrix()函数，在feature.py文件中，用于提取特征矩阵。

这里一次性投入的是一个bootstrap即两周期168*2的数据量。转入FeatureExtractor类compute_features()函数。遍历feature_list，调用self.compute_one_feature(feature_name, start_pos, end_pos)函数提取各个特征向量。进入该函数：

feature_name=Hour or Day，将对应时间戳加入feature_values返回作为一个特征。

feature_name=Raw，将原始KPI返回作为一个特征。

若feature_name为3_Weekly_Mean形式，遍历所有数据，调用下述函数：

feature_period_len = self.compute_feature_period_len(period_grain) 
vs = self.get_sametime_instances(current_index=idx,
                                 feature_period_len=feature_period_len,
                                 ts_period_len=self.ts_period_len,
                                 instance_count=win)

进入self.compute_feature_period_len(period_grain)函数，time_delta变量指相邻数据的时间差，weekly_time_delta/time_delta为7*24，即返回一个时间粒度的数据时的数据量feature_period_len。

进入self.get_sametime_instances()函数，参数current_index为当前数据下标，feature_period_len参数为一时间粒度的数据量168，ts_period_len为时间序列的周期168，instance_count为滑动窗口长度[3,5,7,10,13]。

根据我的理解，这里有一个假设：t+period时采集所得的反映的是t时的数据。

已知时间序列的周期为168，那么数据集的前168条相当于无效，则使pos为当前下标减去168，当pos<0时返回0，前168个数据返回皆为0。接下来的数据减去一个周期后，每隔一个时间粒度采集一次数据，直至达到滑动窗口数或数据不足时停止。

回到self.compute_one_feature(feature_name, start_pos, end_pos)函数，调用函数求得特征，返回特征值feature_values。

返回的特征值最后赋给first_train_features作为模型输入，将dict_bootstrap[“bootstrap_series”]赋给first_train_response作为模型输出。最后再将特征矩阵与输出分别存储到self.dict_storage["normal_features_matrix"] self.dict_storage["normal_response_series"]。

接着，进行模型的训练model.train。

取round_cnt为时间序列的周期数（取整），遍历各个周期，重复上述特征工程操作得到模型训练的输入和输出。

对于输入，调用rf的回归模型model预测输出为this_predicted_series，而实际的输出this_practical_series则为原始设置值。

异常检测

this_drop_ratios, this_drop_labels, this_drop_scores = \
                                    self.__filter_anomaly(predicted_series=this_predicted_series,
                                                        practical_series=this_practical_series)

这里调用函数self.__filter_anomaly()测试模型。

predicted_series参数是模型的预测即期望输出，practical_series参数是模型的实际输出。__filter_anomaly()函数如下：

anomaly_filter = DropAnomalyFilter(rule=self.dict_filter["method"],
                                           coef=self.dict_filter["coefficient"])
drop_ratios, drop_labels, drop_scores = \
            anomaly_filter.detect_anomaly(predicted_series=predicted_series,
                                          practical_series=practical_series)
return drop_ratios, drop_labels, drop_scores

其中，初始化anomaly_filter为DropAnomalyFilter类中该类位于filter.py文件中。

根据之前代码，rule=gauss，coef=3.0。

继续进入detect_anomaly()函数，遍历practical_series，计算practical_series相较predicted_series的下降率为dp，再遍历下降率dp，若下降率大于等于0，则异常得分为0，反之，取其异常得分为其相反数。再调用self.filter_anomaly(drop_ratios)函数生成异常标签。

该函数中，取下降率的平均值与方差标准化，再调用self.filter_by_threshold(drop_ratios, threshold)函数生成异常标签。

该函数中，若标准化的下降率超出阈值，则标记为异常，反之为正常。阈值即coef=3.0。

 self.__store_this_results(this_predicted_series, this_drop_ratios,
                                    this_drop_labels, this_drop_scores)
 self.__store_features_response(this_features_matrix=this_predicted_features,
                                this_response_series=this_practical_series,
                                this_labels=this_drop_labels)

存储测试结果并更新字典dict_result，将最新的训练和测试结果加入字典后方，使得字典的越后方，训练效果越好。同时更新dict_storage字典。

对更新好的dict_storage字典继续训练，如此迭代已取得更好的效果。

性能评价

    results = controller.get_results()

    auc, prauc = evaluate(results["drop_scores"][2*168:], syn_labels[2*168:])

    print("front_mean", "auc", auc, "prauc", prauc)

调用上述函数对实验结果的auc与prauc进行测评。

hxtyy

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
CellPAD（机器学习的异常检测）代码详解---自己回顾用

先放论文代码GitHub链接：https://github.com/XuJiaxin000/CellPAD#cellpad-detecting-performance-anomalies-in-cellular-networks-via-regression-analysis最近在学习基于统计建模和机器学习的回归异常检测，原来已经作出了一个基本框架，最后被评价过于简陋，因此回顾一下CellPAD的代码。这里只讲关于Sudden_Drop的异常检测。先打...
复制链接

扫一扫