Introducing practical and robust anomaly detection in a time series

19 篇文章 0 订阅
6 篇文章 0 订阅
Tuesday, January 6, 2015 | By Arun Kejariwal (@arun_kejariwal), Software Engineer [16:49 UTC]   
Tweet   

Both last year and this year, we saw a spike in the number of photos uploaded to Twitter on Christmas Eve, Christmas and New Year’s Eve (in other words, an anomaly occurred in the corresponding time series). Today, we’re announcing AnomalyDetection, our open-source R package that automatically detects anomalies like these in big data in a practical and robust way.

Time series from Christmas Eve 2014

Time series from Christmas Eve 2013

Early detection of anomalies plays a key role in ensuring high-fidelity data is available to our own product teams and those of our data partners. This package helps us monitor spikes in user engagement on the platform surrounding holidays, major sporting events or during breaking news. Beyond surges in social engagement, exogenic factors – such as bots or spammers – may cause an anomaly in number of favorites or followers. The package can be used to find such bots or spam, as well as detect anomalies in system metrics after a new software release. We’re open-sourcing AnomalyDetection because we’d like the public community to evolve the package and learn from it as we have.

Recently, we open-sourced BreakoutDetection, a complementary R package for automatic detection of one or more breakouts in time series. While anomalies are point-in-time anomalous data points, breakouts are characterized by a ramp up from one steady state to another.

Despite prior research in anomaly detection [1], these techniques are not applicable in the context of social network data because of its inherent seasonal and trend components. Also, as pointed out by Chandola et al. [2], anomalies are contextual in nature and hence, techniques developed for anomaly detection in one domain can rarely be used ‘as is’ in another domain.

Broadly, an anomaly can be characterized in the following ways:

  1. Global/Local: At Twitter, we observe distinct seasonal patterns in most of the time series we monitor in production. Furthermore, we monitor multiple modes in a given time period. The seasonal nature can be ascribed to a multitude of reasons such as different user behavior across different geographies. Additionally, over longer periods of time, we observe an underlying trend. This can be explained, in part, by organic growth. As the figure below shows, global anomalies typically extend above or below expected seasonality and are therefore not subject to seasonality and underlying trend. On the other hand, local anomalies, or anomalies which occur inside seasonal patterns, are masked and thus are much more difficult to detect in a robust fashion. Illustrates positive/negative, global/local anomalies detected in real data
  2. Positive/Negative: An anomaly can be positive or negative. An example of a positive anomaly is a point-in-time increase in number of Tweets during the Super Bowl. An example of a negative anomaly is a point-in-time decrease in QPS (queries per second). Robust detection of positive anomalies serves a key role in efficient capacity planning. Detection of negative anomalies helps discover potential hardware and data collection issues.

How does the package work?
The primary algorithm, Seasonal Hybrid ESD (S-H-ESD), builds upon the Generalized ESD test [3] for detecting anomalies. S-H-ESD can be used to detect both global and local anomalies. This is achieved by employing time series decomposition and using robust statistical metrics, viz., median together with ESD. In addition, for long time series such as 6 months of minutely data, the algorithm employs piecewise approximation. This is rooted to the fact that trend extraction in the presence of anomalies is non-trivial for anomaly detection [4].

The figure below shows large global anomalies present in the raw data and the local (intra-day) anomalies that S-H-ESD exposes in the residual component via our statistically robust decomposition technique.

Besides time series, the package can also be used to detect anomalies in a vector of numerical values. We have found this very useful as many times the corresponding timestamps are not available. The package provides rich visualization support. The user can specify the direction of anomalies, the window of interest (such as last day, last hour) and enable or disable piecewise approximation. Additionally, the x- and y-axis are annotated in a way to assist with visual data analysis.

Getting started
To begin, install the R package using the commands below on the R console:

?
1
2
3
install .packages( "devtools" )
devtools::install_github( "twitter/AnomalyDetection" )
library(AnomalyDetection)

The function AnomalyDetectionTs is used to discover statistically meaningful anomalies in the input time series. The documentation of the function AnomalyDetectionTs details the input arguments and output of the function AnomalyDetectionTs, which can be seen by using the command below.

?
1
help(AnomalyDetectionTs)

An example
The user is recommended to use the example dataset which comes with the packages. Execute the following commands:

?
1
2
3
data(raw_data)
res = AnomalyDetectionTs(raw_data, max_anoms=0.02, direction= 'both' , plot=TRUE)
res$plot

This yields the following plot:

From the plot, we can tell that the input time series experiences both positive and negative anomalies. Furthermore, many of the anomalies in the time series are local anomalies within the bounds of the time series’ seasonality.

Therefore, these anomalies can’t be detected using the traditional methods. The anomalies detected using the proposed technique are annotated on the plot. In case the timestamps for the plot above were not available, anomaly detection could then be carried out using the AnomalyDetectionVec function. Specifically, you can use the following command:

?
1
AnomalyDetectionVec(raw_data[,2], max_anoms=0.02, period=1440, direction= 'both' , only_last=FALSE, plot=TRUE)

Often, anomaly detection is carried out on a periodic basis. For instance, you may be interested in determining whether there were any anomalies yesterday. To this end, we support a flag only_last where one can subset the anomalies that occurred during the last day or last hour. The following command

?
1
2
res = AnomalyDetectionTs(raw_data, max_anoms=0.02, direction= 'both' , only_last= "day" , plot=TRUE)
res$plot

yields the following plot:

From the above plot, we observe that only the anomalies that occurred during the last day have been annotated. Additionally, the prior six days are included to expose the seasonal nature of the time series but are put in the background as the window of primary interest is the last day.

Anomaly detection for long duration time series can be carried out by setting the longterm argument to T. An example plot corresponding to this (for a different data set) is shown below:

Acknowledgements

Our thanks to James Tsiamis and Scott Wong for their assistance, and Owen Vallis (@OwenVallis) and Jordan Hochenbaum (@jnatanh) for this research.

References

[1] Charu C. Aggarwal. “Outlier analysis”. Springer, 2013.

[2] Varun Chandola, Arindam Banerjee, and Vipin Kumar. “Anomaly detection: A survey”. ACM Computing Surveys, 41(3):15:1{15:58, July 2009.

[3] Rosner, B., (May 1983), “Percentage Points for a Generalized ESD Many-Outlier Procedure”, Technometrics, 25(2), pp. 165-172.

[4] Vallis, O., Hochenbaum, J. and Kejariwal, A., (2014) “A Novel Technique for Long-Term Anomaly Detection in the Cloud”, 6th USENIX Workshop on Hot Topics in Cloud Computing, Philadelphia, PA.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Erlang是一种功能性编程语言,最初是为了电话交换系统而设计的,具有并发、容错和分布式计算的能力。在功能性编程中,我们将程序视为一系列函数的集合,而不是一系列指令的序列。这种思维方式使得编写可靠、可扩展和可维护的软件变得更加容易。 要开始学习Erlang,我们首先需要安装它。Erlang可在多个操作系统上运行,包括Windows、Linux和Mac OS。我们可以从Erlang官方网站下载适合自己操作系统的安装程序,并按照提示进行安装。 安装完Erlang后,我们可以在命令行界面(或终端)中启动Erlang Shell(也称为Erlang交互式环境)。在Shell中,我们可以输入并执行Erlang代码。 Erlang的基本语法与其他编程语言有些不同。在Erlang中,函数定义使用"fun"关键字,比如:Add = fun(X, Y) -> X + Y end. 这将定义一个名为Add的函数,它接受两个参数X和Y,并返回它们的和。 在Erlang中,我们可以使用模式匹配来处理不同的情况。例如,我们可以编写一个函数来计算一个列表中所有元素的和,如下所示: sum([]) -> 0; sum([H|T]) -> H + sum(T). 当我们传递一个空列表[ ]给sum函数时,它将返回0。而当我们传递一个非空列表[H|T]时,它将把列表的头部元素H与剩余部分T的和相加。 在学习Erlang时,重要的是要尝试编写简单的程序和函数,以便熟悉基本的语法和概念。您可以使用Erlang Shell来交互式地测试和执行您的代码。 除了基本的语法和概念之外,Erlang还具有许多强大的特性,例如并发编程、消息传递和模式匹配。这些功能使Erlang成为开发可靠、高可扩展性的分布式系统的理想选择。 总之,Erlang是一种功能强大的功能性编程语言,它具有并发、容错和分布式计算的能力。我们可以通过安装Erlang并在Erlang Shell中尝试编写简单的程序,来快速上手Erlang。在学习过程中,我们将逐步掌握Erlang的基本语法和概念,为开发可靠和高可扩展的系统打下坚实的基础。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值