时间序列分类问题有用的库

(一)sktime

(二)pyts

基于Shapelet的时间序列分类方法实战https://zhuanlan.zhihu.com/p/359666547

pyts库的介绍

https://zhuanlan.zhihu.com/p/272691705

今天搜索shapelets方面代码的时候看到了这样一个库:pyts,他的GitHub仓库地址是:https://github.com/johannfaouzi/pyts。在他的仓库readme下还放了介绍这个库的论文: 《pyts: A Python Package for Time Series Classification》,于是我来阅读一下。 挑主要的说:

Dependencies(依赖)

首先这个库依赖:numpy,scipy,scikit-learn,joblib还有numba库。 其中,joblib的作用是running Python functions as pipeline jobs。 numba的作用是:Numbacompiled numerical algorithms in Python can reach the speeds of C or FORTRAN.

Assumptions on Input Data

为了计算的效率,pyts库应用的绝大部分算法仅仅针对等长的时间序列。需要注意的是,库里的dtw算法及其变种是支持变长的时间序列的。 For computational efficiency, most algorithms implemented in pyts can only deal with data sets of equal-length time series. pyts支持单变量和多变量时间序列数据集。

Comparison to Related Software

上图是pyts支持的各种算法(后续在GitHub里还有更新)。


论文很简短,下面看一下怎么使用这个库。 首先,中文博客里面对pyts库的应用我所见到的仅有一个——把时间序列转化为图片【1】。那这里我来写一个如何调用它的shapelets吧。 阅读官方关于shapelets文档【2】,直接看一个示例:

from pyts.classification import LearningShapelets
from pyts.datasets import load_gunpoint
X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
clf = LearningShapelets(random_state=42, tol=0.01)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

这里可以通过clf.shapelets_来获取学习到的shapelets。在文档里,提到shapelets的shape为: array shape = (n_tasks, n_shapelets) n_shapelets是个啥呢?应该是指shapelets的个数。如果想要获取某个shapelets。语法是这样的:

shapelets = np.asarray([clf.shapelets_[0, -9], clf.shapelets_[0, -12]])

这个意思是说,我从训练好的shapelets中获取倒数第九个和倒数第12个。 还有一份代码:

import matplotlib.pyplot as plt
import numpy as np
from pyts.classification import LearningShapelets
from pyts.datasets import load_gunpoint
from pyts.utils import windowed_view

# Load the data set and fit the classifier
X, _, y, _ = load_gunpoint(return_X_y=True)
clf = LearningShapelets(random_state=42, tol=0.01)
clf.fit(X, y)

# Select two shapelets
shapelets = np.asarray([clf.shapelets_[0, -9], clf.shapelets_[0, -12]])

# Derive the distances between the time series and the shapelets
shapelet_size = shapelets.shape[1]
X_window = windowed_view(X, window_size=shapelet_size, window_step=1)
X_dist = np.mean(
    (X_window[:, :, None] - shapelets[None, :]) ** 2, axis=3).min(axis=1)


plt.figure(figsize=(14, 4))

# Plot the two shapelets
plt.subplot(1, 2, 1)
plt.plot(shapelets[0])
plt.plot(shapelets[1])
plt.title('Two learned shapelets', fontsize=14)

# Plot the distances
plt.subplot(1, 2, 2)
for color, label in zip('br', (1, 2)):
    plt.scatter(X_dist[y == label, 0], X_dist[y == label, 1],
                c=color, label='Class {}'.format(label))
plt.title('Distances between the time series and both shapelets',
          fontsize=14)
plt.legend()
plt.show()

首先,通过 clf.fit(x,y) ,一共获得了90条shapelets,其中,有30条长度为15,有30条长度为30,有30条长度为45。为什么会有90条,又为什么长度分别是15,30,45呢。 这个需要看【3】,首先看一下Learning Shapelets实例化对象时的参数:

LearningShapelets(n_shapelets_per_size=0.2, min_shapelet_length=0.1, 
                  shapelet_scale=3, penalty='l2', 
                  tol=0.001, C=1000, learning_rate=1.0, 
                  max_iter=1000, multi_class='multinomial', 
                  alpha=-100, fit_intercept=True, 
                  intercept_scaling=1.0, 
                  class_weight=None, verbose=0, random_state=None, n_jobs=None)

看一下对参数 n_shapelets_per_size 的解释:

int or float (default = 0.2) Number of shapelets per size. If float, it represents a fraction of the number of timestamps and the number of shapelets per size is equal to  ceil(n_shapelets_per_size * n_timestamps).

我们的一条时间序列长度是150,150*0.2=30条,也就是说每个size下的shapelets要有30条。 再看参数 shapelet_scale :

The different scales for the lengths of the shapelets. The lengths of the shapelets are equal to min_shapelet_length * np.arange(1, shapelet_scale + 1). The total number of shapelets (and features) is equal to n_shapelets_per_size * shapelet_scale.

因为shapelet_scale默认为3,并且min_shapelet_length=0.1, 单条时间序列的长度为150,1500.1=15,153=45,所以shapelets的长度分为3类——15,30,45。而每一类下面有30条,所以一共有90条。

我对它的X_window还有X_dist很感兴趣。X_window就是根据windowSize把每一条时间序列都进行切分,并进行一个欧式距离的计算,然后取当中最小的一个。 那我就想,前期获取shapelets这一步用它的,后面利用滑动窗口进行切分并计算距离就我自己来实现。

from pyts.classification import LearningShapelets
from pyts.datasets import load_gunpoint
import numpy as np
import math

def getDataFromSlidingWindow(windowSize,data):
    # 目前只实现了step为1的情况
    dataList=[]
    for i in range(data.shape[0]-windowSize+1):
        temp=[]
        for j in range(windowSize):
            temp.append(data[i+j])
        dataList.append(temp)
    return dataList

# 计算平均距离,采用欧式距离计算
def meanDistance(windowData,shapelets):
    sum=0
    for i in range(len(windowData)):
        sum=sum+(windowData[i]-shapelets[i])*(windowData[i]-shapelets[i])
    # result=math.sqrt(float(sum))/len(windowData)
    result=float(sum)/len(windowData)
    return result

X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
clf = LearningShapelets(random_state=42, tol=0.01)
clf.fit(X_train, y_train)

shapelets = np.asarray([clf.shapelets_[0, 0], clf.shapelets_[0, -1]])
print(shapelets.shape)
print(clf.shapelets_[0,0].shape)
print(clf.shapelets_[0,-1].shape)

print(X_train.shape)
for i in range(X_train.shape[0]):
    print(X_train[i])

tempcalculateDisList=[]
windowSize=clf.shapelets_[0,0].shape[0] # windowSize与shapelets的长度保持一致
step=1 # 窗口每次向前滑动的步长

# print(len(getDataFromSlidingWindow(windowSize,X_train[0,:])))
# print(len(getDataFromSlidingWindow(windowSize,X_train[0,:])[0]))
windowData=getDataFromSlidingWindow(windowSize, X_train[0,:])
allDistanceList=[]
for item in windowData:
    allDistanceList.append(meanDistance(item,clf.shapelets_[0,0]))
print(min(allDistanceList))

Ok,就写到这里。后面的就是与论文相关了。

参考资料

【1】将一维时间序列转化成二维图片 【2】3. Extracting features from time series 【3】pyts官方文档——pyts.classification.LearningShapelets

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值