时间序列分类问题有用的库

最新推荐文章于 2024-09-23 20:25:40 发布

lizz2276

最新推荐文章于 2024-09-23 20:25:40 发布

阅读量1.3k

点赞数 5

本文链接：https://blog.csdn.net/lizz2276/article/details/118940466

版权

（一）sktime

(二）pyts

基于Shapelet的时间序列分类方法实战https://zhuanlan.zhihu.com/p/359666547

pyts库的介绍

https://zhuanlan.zhihu.com/p/272691705

今天搜索shapelets方面代码的时候看到了这样一个库：pyts，他的GitHub仓库地址是：https://github.com/johannfaouzi/pyts。在他的仓库readme下还放了介绍这个库的论文：《pyts: A Python Package for Time Series Classification》，于是我来阅读一下。挑主要的说：

Dependencies（依赖）

首先这个库依赖：numpy，scipy，scikit-learn，joblib还有numba库。其中，joblib的作用是running Python functions as pipeline jobs。 numba的作用是：Numbacompiled numerical algorithms in Python can reach the speeds of C or FORTRAN.

Assumptions on Input Data

为了计算的效率，pyts库应用的绝大部分算法仅仅针对等长的时间序列。需要注意的是，库里的dtw算法及其变种是支持变长的时间序列的。 For computational efficiency, most algorithms implemented in pyts can only deal with data sets of equal-length time series. pyts支持单变量和多变量时间序列数据集。

Comparison to Related Software

上图是pyts支持的各种算法（后续在GitHub里还有更新）。

论文很简短，下面看一下怎么使用这个库。首先，中文博客里面对pyts库的应用我所见到的仅有一个——把时间序列转化为图片【1】。那这里我来写一个如何调用它的shapelets吧。阅读官方关于shapelets文档【2】，直接看一个示例：

from pyts.classification import LearningShapelets
from pyts.datasets import load_gunpoint
X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
clf = LearningShapelets(random_state=42, tol=0.01)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

这里可以通过clf.shapelets_来获取学习到的shapelets。在文档里，提到shapelets的shape为： array shape = (n_tasks, n_shapelets) n_shapelets是个啥呢？应该是指shapelets的个数。如果想要获取某个shapelets。语法是这样的：

shapelets = np.asarray([clf.shapelets_[0, -9], clf.shapelets_[0, -12]])

这个意思是说，我从训练好的shapelets中获取倒数第九个和倒数第12个。还有一份代码：

import matplotlib.pyplot as plt
import numpy as np
from pyts.classification import LearningShapelets
from pyts.datasets import load_gunpoint
from pyts.utils import windowed_view

# Load the data set and fit the classifier
X, _, y, _ = load_gunpoint(return_X_y=True)
clf = LearningShapelets(random_state=42, tol=0.01)
clf.fit(X, y)

# Select two shapelets
shapelets = np.asarray([clf.shapelets_[0, -9], clf.shapelets_[0, -12]])

# Derive the distances between the time series and the shapelets
shapelet_size = shapelets.shape[1]
X_window = windowed_view(X, window_size=shapelet_size, window_step=1)
X_dist = np.mean(
    (X_window[:, :, None] - shapelets[None, :]) ** 2, axis=3).min(axis=1)


plt.figure(figsize=(14, 4))

# Plot the two shapelets
plt.subplot(1, 2, 1)
plt.plot(shapelets[0])
plt.plot(shapelets[1])
plt.title('Two learned shapelets', fontsize=14)

# Plot the distances
plt.subplot(1, 2, 2)
for color, label in zip('br', (1, 2)):
    plt.scatter(X_dist[y == label, 0], X_dist[y == label, 1],
                c=color, label='Class {}'.format(label))
plt.title('Distances between the time series and both shapelets',
          fontsize=14)
plt.legend()
plt.show()

首先，通过 clf.fit(x,y) ，一共获得了90条shapelets，其中，有30条长度为15，有30条长度为30，有30条长度为45。为什么会有90条，又为什么长度分别是15，30，45呢。这个需要看【3】，首先看一下Learning Shapelets实例化对象时的参数：

LearningShapelets(n_shapelets_per_size=0.2, min_shapelet_length=0.1, 
                  shapelet_scale=3, penalty='l2', 
                  tol=0.001, C=1000, learning_rate=1.0, 
                  max_iter=1000, multi_class='multinomial', 
                  alpha=-100, fit_intercept=True, 
                  intercept_scaling=1.0, 
                  class_weight=None, verbose=0, random_state=None, n_jobs=None)

看一下对参数 n_shapelets_per_size 的解释：

int or float (default = 0.2) Number of shapelets per size. If float, it represents a fraction of the number of timestamps and the number of shapelets per size is equal to ceil(n_shapelets_per_size * n_timestamps).

我们的一条时间序列长度是150，150*0.2=30条，也就是说每个size下的shapelets要有30条。再看参数 shapelet_scale ：

The different scales for the lengths of the shapelets. The lengths of the shapelets are equal to min_shapelet_length * np.arange(1, shapelet_scale + 1). The total number of shapelets (and features) is equal to n_shapelets_per_size * shapelet_scale.

因为shapelet_scale默认为3，并且min_shapelet_length=0.1, 单条时间序列的长度为150，1500.1=15，153=45，所以shapelets的长度分为3类——15，30，45。而每一类下面有30条，所以一共有90条。

我对它的X_window还有X_dist很感兴趣。X_window就是根据windowSize把每一条时间序列都进行切分，并进行一个欧式距离的计算，然后取当中最小的一个。那我就想，前期获取shapelets这一步用它的，后面利用滑动窗口进行切分并计算距离就我自己来实现。

from pyts.classification import LearningShapelets
from pyts.datasets import load_gunpoint
import numpy as np
import math

def getDataFromSlidingWindow(windowSize,data):
    # 目前只实现了step为1的情况
    dataList=[]
    for i in range(data.shape[0]-windowSize+1):
        temp=[]
        for j in range(windowSize):
            temp.append(data[i+j])
        dataList.append(temp)
    return dataList

# 计算平均距离，采用欧式距离计算
def meanDistance(windowData,shapelets):
    sum=0
    for i in range(len(windowData)):
        sum=sum+(windowData[i]-shapelets[i])*(windowData[i]-shapelets[i])
    # result=math.sqrt(float(sum))/len(windowData)
    result=float(sum)/len(windowData)
    return result

X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
clf = LearningShapelets(random_state=42, tol=0.01)
clf.fit(X_train, y_train)

shapelets = np.asarray([clf.shapelets_[0, 0], clf.shapelets_[0, -1]])
print(shapelets.shape)
print(clf.shapelets_[0,0].shape)
print(clf.shapelets_[0,-1].shape)

print(X_train.shape)
for i in range(X_train.shape[0]):
    print(X_train[i])

tempcalculateDisList=[]
windowSize=clf.shapelets_[0,0].shape[0] # windowSize与shapelets的长度保持一致
step=1 # 窗口每次向前滑动的步长

# print(len(getDataFromSlidingWindow(windowSize,X_train[0,:])))
# print(len(getDataFromSlidingWindow(windowSize,X_train[0,:])[0]))
windowData=getDataFromSlidingWindow(windowSize, X_train[0,:])
allDistanceList=[]
for item in windowData:
    allDistanceList.append(meanDistance(item,clf.shapelets_[0,0]))
print(min(allDistanceList))

Ok，就写到这里。后面的就是与论文相关了。