Godaddy（1/5）时间序列异常检测

WayneBin

已于 2023-04-23 02:31:32 修改

阅读量447

点赞数

文章标签：算法

于 2023-04-05 03:20:04 首次发布

本文链接：https://blog.csdn.net/weixin_35731157/article/details/129965417

版权

欢迎关注公众号：WayneData智能世界

ChangePoint简介

变点和异常值检测是时间序列分析中的重要技术，因为它们可以帮助识别数据中的显著变化或异常情况。时间序列数据通常表现出非平稳性，这意味着数据的统计属性随时间变化。这些变化可能是由于各种因素引起的，如基本趋势的变化、数据分布的变化或罕见事件或异常的发生。变点检测可以帮助识别这些变化发生的时间，并提供有关变化根本原因的见解。本文的主要目的是尽我所能识别时间序列中的所有异常情况。

据我所知，时间序列当中存在的异常值如下几种情况：

Additive -（AO）加性异常值不会持续存在。顾名思义，加性异常值只影响异常值发生的时间的观测值。因此，加性异常值对未来的预测没有影响，可以理解为是一种孤立事件。
Innovational -（IO）这种类型的异常值不仅对 X(T) 造成干扰，还会影响时间 T 之后序列中所有的观测值。可以理解为干预性异常值，一种脉冲性冲击。
Level shift -（LS）它们的作用是在异常值发生的时间点开始提高或降低系列的均值。均值的变化是突然且永久性的。可以使用阶跃函数函数来建模，其幅度等于omega参数。
Transient -（TS）这种类型的异常值在时刻 T 发生时会带有某种初始影响，然后根据衰减因子的大小，在一段时间后呈指数衰减，最终归于常态。
趋势变动 - 这种是数据模式上发生的改变，可能没有突然出现的异动，但是整体结构发生了变动。

Additive

这是一种加性的异常，更像是一过性的变化，没有对周围数据产生影响，一般情况下，这种情况一般直接删除该段数据或者用周围的数据进行填充。

Innovational

这是一种类似冲击的异常，对数据产生波动也可能是趋势的影响。图中在黄线之后数据波动幅度变大，均值也些许提升。在业务场景中，更像是公司政策亦或者政府政策产生某种持续性的影响。

Level shift

这是一种类似跳跃的水平移位异常且是多次移位，时间序列被分为4个不同均值水平的部分。每个部分的开头对于前一个时间段来说都是异常值。这些异常值不像典型的异常点，而是当前值跳跃并维持一段时间的某个水平。这种情况使得历史数据对将来预测的参考意义降低，因为该部分看起来更像来自另一个序列而不是它自己。因此，模式学习变得困难。

Transient

这种变化非常类似双11，一种营销活动产生的短时间影响，图中在时间节点11开始突增，在17左右慢慢回落到常态。很显然，如果从历史（这段数据时间范围内）中找经验很难预测到这段变化。

GoDaddy异常检测

前面我们分析了时间序列异常值的类型，与可能存在的影响，接下来我们将使用Godaddy竞赛数据进行异常检测。

比赛的背景

比赛主办方GoDaddy是全球最大的企业家服务平台。他们的使命是通过为全球超过2000万客户和所有企业家提供所需的所有帮助和工具，帮助他们在网络上成长。这个竞赛的目标是预测一个给定地区每月的微型企业密度。需要使用美国县级数据训练出一个准确的模型。

我们将从以下几个方面，进行异常检测。

基于差分的LOF - 这个是自定义的检测方法，能够一定程度上检测到LS，IO，AO，甚至TC。缺点是无法判定具体异常的类型。
chow test - 能够进行趋势改变的检测，进行结构性变动的检测。缺点是需要自定义窗口进行检测，同时也无法检测出所有的结构性变动
基于ARIMA的检测 - 优点是效率高，相对精准，但是存在使用门槛，需要进行其他的结合进行具体异常类型的判定
其他

基于差分的LOF

以下检测方法是一种自定义的异常检测方法。它使用LOF检测差分序列中的异常值并将其定位。此方法可以检测各种异常，包括IO、AO、LS、TC等。但是，定位可能存在一些不准确性，需要与其他方法结合使用以获得更精确的结果。下面是我自定义函数CusOulierDet，输入是时间序列的list，输出是，所有的断点checkpoint以及根据断点切割出来的片段segment

def CusOulierDet(ts, threshold):
    """
    This function detects outliers in a time series and splits it into multiple segments         for further processing.
    :param ts: The time series data.
    :param threshold: The threshold for the outlier score. Default is -3.5.
    :return: A list of all segments and a list of the split points.
    """
    # Calculate the differences between consecutive values in the time series
    differences = [ts[i+1] - ts[i] for i in range(len(ts)-1)]
    # Convert the differences into a two-dimensional array
    X = [[val] for val in differences]

    # Create the Local Outlier Factor (LOF) model and set hyperparameters and algorithm parameters
    lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05, novelty=True)

    # Train the model and get the outlier scores for each data point
    lof.fit(X)
    scores = lof.negative_outlier_factor_

    # Identify the positions of the outliers
    anomalies = []
    for i, score in enumerate(scores):
        if score < threshold:
            anomalies.append(i+1)

    # If there are no outliers, return the original time series and an empty list of split points
    if not anomalies:
        return [ts], []

    # Split the original time series into segments based on the positions of the outliers and record the split points
    segments = []
    split_points = []
    start_idx = 0
    for anomaly in anomalies:
        segments.append(ts[start_idx:anomaly])
        split_points.append(anomaly)
        start_idx = anomaly

    # Add the final segment to the list of segments
    segments.append(ts[start_idx:])

    # Return the list of segments and the list of split points
    return segments, split_points

从上面结果来看，切割变点特别是LF，如序列cfip（县ID）17055，存在两次跳跃，13219存在一次跳跃。

chow test

Chow test是一种统计检验方法，用于检验一个线性回归模型中，是否存在结构性断点或者说是参数在断点处发生了变化。在Chow test中，将数据集划分为两个或多个子样本，然后估计每个子样本中的回归模型，并计算总体回归模型与子样本回归模型之间的差异。如果总体回归模型与子样本回归模型之间存在显著差异，则表明存在结构性断点。

Chow test通常用于时间序列分析、面板数据分析等领域，可以用于检验某个时间点或者某个事件引起了模型参数的突变。它的优点是易于实现，但缺点是需要对数据进行分割，而且在断点位置不清楚时，可能会产生误判。下面构建基础的chowtest函数，输入是时间x，序列值y，以及选择检测点checkpoint，输出时检测显著性P值。

def chow_test(y, x, breakpoint):
    # Add a column for the intercept
    x = sm.add_constant(x)

    # Fit a linear regression model using OLS
    model_full = sm.OLS(y, x)
    results_full = model_full.fit()

    # Split the data into two parts
    x_left, x_right = x[:breakpoint], x[breakpoint:]
    y_left, y_right = y[:breakpoint], y[breakpoint:]

    # Fit linear regression models for each part
    model_left = sm.OLS(y_left, x_left)
    model_right = sm.OLS(y_right, x_right)

    # Compute the Chow test statistic and p-value
    chow_num = (results_full.ssr - (model_left.fit().ssr + model_right.fit().ssr)) / 2
    chow_denom = (model_left.fit().ssr + model_right.fit().ssr) / (len(y_left) + len(y_right) - 4)
    chow_stat = chow_num / chow_denom
    p_value = 1 - stsf.cdf(chow_stat, 2, len(y_left) + len(y_right) - 4)
    return p_value

定义窗口进行chow test

side_winds = 7
sample_size = 50
for cfip in np.random.choice(train.cfips.unique(), size=sample_size, replace=False):
#     cfip = 28101
    Candidate_dict = {} 
    threadhold = 0.05
    last_candi = 0
    
    df = train.loc[train.cfips==cfip,'microbusiness_density'].values
    xts = np.arange(len(df))

    for breakpoint in xts[side_winds:-side_winds]:
#         print(f'breakpoint:{breakpoint}')
        p_value = chow_test(df[breakpoint-side_winds:breakpoint+side_winds],
                            xts[breakpoint-side_winds:breakpoint+side_winds], 
                            side_winds)
#         print(f'p_value:{p_value}')
    #     print(p_value )
        if p_value<threadhold:
            set_candi_condi = (last_candi in Candidate_dict) & (0<breakpoint- last_candi< side_winds)
#             print(f'last_candi:{last_candi}')
    #         print( set_candi_condi,(last_candi in Candidate_dict),(0<breakpoint- last_candi< side_winds) )
            if set_candi_condi:
                if Candidate_dict[last_candi]>p_value:
                    Candidate_dict[breakpoint] = p_value
                    del Candidate_dict[last_candi]
                    last_candi = breakpoint
            else:
                last_candi = breakpoint
                Candidate_dict[breakpoint] = p_value
    Candidate = sorted(Candidate_dict.items(), key=lambda x: x[1])[:3]
    Candidate = sorted([x[0] for x in Candidate if x[1]<0.001])
    print('\n\n Candidate:',Candidate)
    pltObj = MyPlots(x=xts, y=df,barOrLine='line',
                         title='Detecting ouliers as change of trends ',
                         SeriselableName='microbusiness_density',
                         x_label='month', y_label='microbusiness_density',
                         main_title=f'CHOW TEST OF CFIP:{cfip}')
    min_ =  df.min()
    max_ =  df.max()
    pltObj.vlines(Candidate , ymin=min_ ,ymax= max_, linewidth=5,color='#F4B400', alpha=0.5)
    pltObj.show()

结果显示，一定程度上的检测出了结构性变动的节点，如42009，趋势由增到减。同时也发现一些问题，chowtest对于序列中的跳跃异常比较敏感，如18133，将其阶跃而造成的变动，当成一种结构或者趋势的变化。因此很有必要多种手段进行结合。

基于ARIMA的检测

这里我们将ARIMA，设置为无差分情况下的模型，避免其削弱异常的检测。只考虑AR(p)与MA(q)的情况。通过残差对比，找到异动点，进而判定异常。

sample_size = 5
for cfip in np.random.choice(train.cfips.unique(), size=sample_size, replace=False):

    df = train.loc[train.cfips==cfip,'microbusiness_density'].values

    # Fit ARIMA model
    model = ARIMA(df, order=(1, 0, 1))
    model_fit = model.fit()

    # Calculate residuals
    residuals = pd.Series(model_fit.resid).abs()
    forecat = model_fit.forecasts[0]

    # Calculate mean and standard deviation of residuals
    mu = residuals[1:].mean()
    sigma = residuals[1:].std()

    # Detect temporary change
    threshold = mu + 3 * sigma  # Set threshold to 3 sigma
    temporary_change = residuals[residuals > threshold]

    pltObj = MyPlots(x=xts, y=df,barOrLine='line',
                         title='Detecting ouliers as change of trends ',
                         SeriselableName='microbusiness_density',
                         x_label='month', y_label='microbusiness_density',
                         main_title=f'ar_p:{ar_p} ma_q:{ma_q} ARIMA TEST OF CFIP:{cfip}',
                        serise1=forecat,
                        color1 ='#DB4437',
                         lableName1 = 'ARIMA Forecast' 
                    )


    min_ =  df.min()
    max_ =  df.max()
    pltObj.vlines(temporary_change.index , ymin=min_ ,ymax= max_, linewidth=5,color='#F4B400', alpha=0.5)
    pltObj.show()

在所有工具中，ARIMA结果应该是相对比较满意的情况，多数情况都能进行很好的发现，当然也不乏一些比如，比如18021这种结构性变化情况，ARIMA是不能发现的。

期待

限于时间和篇幅，我现在就先讲到这里。在ARIMA方法中，也许可以通过逐步窗口添加的并预测，然后分析残差，趋势等以及设计一些模拟函数可以准确的定位到异常的类型。最后，除了上述方法外，针对异常检测，您还可以考虑使用Ruptures和Prophet等软件包。当然，为了实现更准确的异常检测，可能需要使用多种方法，例如投票、堆叠或组合。此外，您可以看到这只是我项目进度的1/5。在将来的笔记本中，我将涵盖以下内容：

离线模型设计 - 如何构建有效的离线模型，特别是在这个比赛中；
构建有效特征 - 如何构建有用的特征，比如怎样的周期特征，时间特征；
离线模型设计 - 传统模型和一些测试，具体来说是ARIMA，波动率波形等一些测试
模式识别 - 在数据中识别模式，归类不同的模式分别建模以及利用模式构建特征工程；
善用异常结果 - 如何使用检测到的异常来构建特征。

WayneBin

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Godaddy（1/5）时间序列异常检测

变点和异常值检测是时间序列分析中的重要技术，因为它们可以帮助识别数据中的显著变化或异常情况。时间序列数据通常表现出非平稳性，这意味着数据的统计属性随时间变化。这些变化可能是由于各种因素引起的，如基本趋势的变化、数据分布的变化或罕见事件或异常的发生。变点检测可以帮助识别这些变化发生的时间，并提供有关变化根本原因的见解。本文的主要目的是尽我所能识别时间序列中的所有异常情况。据我所知，时间序列当中存在的异常值如下几种情况Additive-（AO）加性异常值不会持续存在。
复制链接

扫一扫