SAX(Symbolic Aggregate Approximation)一种时间序列的新型符号化方法

Introduction

简言之,SAX算法就是将时间序列进行符号化表示。

这个算法最初是由Lin et al.提出的,它扩展了基于PAA的方法并继承了原始方法的简单和低复杂度的特点,同时在范围查询的过程中提供了令人满意的灵敏度和可选择性。除此之外,符号化的特征表示为现在存在的丰富的数据结构和字符串处理算法(哈希、正则表达式、模式匹配、后缀树和语法推断)开启了一扇大门。

The algorithm

SAX将一段长度为n的时间序列X转换为一段任意时间长度的字符串。这个算法包括两个步骤:

(1)、将原始时间序列数据转换为PAA特征表示。

(2)、将PAA数据转换为字符串。

PAA的使用带来了简单有效的降维性,同时也提供了重要的下边界属性。使用查表的方法将PAA系数转换为字母的计算效率也很高,Lin et al.证明了符号距离的收缩性。

将一段时间序列的PAA特征表示离散化到SAX中,得到的符号与时间序列的特征相对应,具有相同的概率。对原始算法的作者所使用的各种时间序列的数据集进行了广泛而又严格的分析,结果表明z-normalized的时间序列的值遵循正态分布。通过使用它的属性,可以很容易的在正常曲线下使用查表法来确定直线坐标,分割高斯曲线下的区域。

这些行的x坐标在SAX上下文中称为断点(breakpoints),列表中的断点将在N(0,1)分布中的数据划分到了a个相同的区域。通过将相应的字母符号对应到每个区间,矢量的转换PAA系数C~到字符串C^的实现如下:



SAX通过拓展欧式距离和PAA距离引入了度量字符串之间距离的新指标。这个函数返回两个原始时间序列Q^和C^的字符串特征之间的最小距离。

如下表所示,使用查表的方式实现dist函数,并计算每个单元格的cell(r, c)的值:

四个字母的查找表

      a          b          c          d     
a000.671.34
b0000.67
c0.67000
d1.340.6700

如Li等人所示,这个SAX距离指标的下界要低于PAA距离指标的下界:

Ding等人对SAX下界进行了详细的检验,发现了其精度优于bursty(非周期)数据集的谱分解方法。

SAX primer

1 时间序列数据

我们将用到下边的时间序列来作为例子(ts1和ts2的欧式距离是11.4)

> ts1=c(2.02, 2.33, 2.99, 6.85, 9.20, 8.80, 7.50, 6.00, 5.85, 3.85, 4.85, 3.85, 2.22, 1.45, 1.34)
> ts2=c(0.50, 1.29, 2.58, 3.83, 3.25, 4.25, 3.83, 5.63, 6.44, 6.25, 8.75, 8.83, 3.25, 0.75, 0.72)
> dist(rbind(ts1,ts2), method = "euclidean")
         ts1
ts2 11.42126

我们将把它转换为长度为9的字符串,它的字母来自大小为4的字母表。

2 Z-normalization

在用SAX转换字符串之前,我们将对数据进行Z-normalization处理。

znorm <- function(ts){
  ts.mean <- mean(ts)
  ts.dev <- sd(ts)
  (ts - ts.mean)/ts.dev
}

ts1_znorm=znorm(ts1)
ts2_znorm=znorm(ts2)

PAA遵循这个标准过程。

3 PAA转换

PAA

paa <- function(ts, paa_size){
  len = length(ts)
  if (len == paa_size) {
    ts
  }
  else {
    if (len %% paa_size == 0) {
      colMeans(matrix(ts, nrow=len %/% paa_size, byrow=F))
    }
    else {
      res = rep.int(0, paa_size)
      for (i in c(0:(len * paa_size - 1))) {
        idx = i %/% len + 1# the spot
        pos = i %/% paa_size + 1 # the col spot
        res[idx] = res[idx] + ts[pos]
      }
      for (i in c(1:paa_size)) {
        res[i] = res[i] / len
      }
      res
    }
  }
}
paa_size=9
s1_paa = paa(ts1_znorm,paa_size)
s2_paa = paa(ts2_znorm,paa_size)

4 将PAA值转换为字母

如上表所示,我是用了四个字母(a, b, c, d)。这四个字母的分割线如下图中蓝色的线所示。

SAX通过9点PAA将ts1转换为字符串abddccbaa。

SAX通过9点PAA将ts2转换为字符串abbccddba。

SAX距离:0 + 0 + 0.67 + 0 + 0 + 0 + 0.67 + 0 + 0 = 1.34

在图中,橙色描绘了被计数的符号之间的距离(它们在表格中不是相邻的)。


原文链接 https://jmotif.github.io/sax-vsm_site/morea/algorithm/SAX.html

  • 10
    点赞
  • 75
    收藏
    觉得还不错? 一键收藏
  • 12
    评论
SAX符号化序列范例源码 -------------------- timeseries2symbol.m: -------------------- This function takes in a time series and convert it to string(s). There are two options: 1. Convert the entire time series to ONE string 2. Use sliding windows, extract the subsequences and convert these subsequences to strings For the first option, simply enter the length of the time series as "N" ex. We have a time series of length 32 and we want to convert it to a 8-symbol string, with alphabet size 3: timeseries2symbol(data, 32, 8, 3) For the second option, enter the desired sliding window length as "N" ex. We have a time series of length 32 and we want to extract subsequences of length 16 using sliding windows, and convert the subsequences to 8-symbol strings, with alphabet size 3: timeseries2symbol(data, 16, 8, 3) Input: data is the raw time series. N is the length of sliding window (use the length of the raw time series instead if you don't want to have sliding windows) n is the number of symbols in the low dimensional approximation of the sub sequence. alphabet_size is the number of discrete symbols. 2 <= alphabet_size > mindist_demo sax_version_of_A = 3 4 2 1 1 3 4 2 sax_version_of_B = 1 1 3 4 3 1 1 4 euclidean_distance_A_and_B = 10.9094 ans = 5.3600 ---> This is the mindist ----------------- symbolic_visual.m ----------------- This demo presents a visual comparison between SAX and PAA and shows how SAX can represent data in finer granularity while using the same, if not less, amount of space as PAA. The input parameter [data] is optional. The default # of PAA segments is 16, and the alphabet size is 4. -------- Examples: -------- You can type this up in your matlab: Recall that there are two options for timeseries2symbol. The first option is demonstrated in sax_demo.m Now here is an example of the latter. We are going to convert time series of length 50, with a sliding window of 32, into 8 symbols, with and alphabet size of 3. >> [symbolic_data, pointers] = timeseries2symbol(long_time_series,32,8,alphabet_size) symbolic_data = 1 1 3 3 3 3 1 1 1 2 3 3 3 2 1 1 1 3 3 3 3 1 1 1 2 3 3 3 2 1 1 1 3 3 3 3 1 1 1 1 3 3 3 2 1 1 1 2 3 3 3 1 1 1 1 3 3 3 2 1 1 1 2 3 3 3 1 1 1 1 3 3 3 2 1 1 1 2 3 3 pointers = 1 2 5 6 9 10 13 14 17 18 Note that each row corresponds to a subsequence (with overlap) The SAX word at 3 and 4 were omitted, since they where the same as the word at 2, same for 7 and 8, which were the same as 6 etc (look at the pointers) It might be helpful to view the data this way >> [pointers symbolic_data ] ans = 1 1 1 3 3 3 3 1 1 2 1 2 3 3 3 2 1 1 5 1 3 3 3 3 1 1 1 6 2 3 3 3 2 1 1 1 9 3 3 3 3 1 1 1 1 10 3 3 3 2 1 1 1 2 13 3 3 3 1 1 1 1 3 14 3 3 2 1 1 1 2 3 17 3 3 1 1 1 1 3 3 18 3 2 1 1 1 2 3 3 So the first word is (1 1 3 3 3 3 1 1) , the 9th word is (3 3 3 3 1 1 1 1) , the 14 word is (3 3 2 1 1 1 2 3)
评论 12
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值