【时间序列 - 04】tsfresh：一种“提取时间序列特征”的包

最新推荐文章于 2022-07-24 15:51:57 发布

Houchaoqun_XMU

最新推荐文章于 2022-07-24 15:51:57 发布

阅读量1.7w

点赞数 3

分类专栏：【预测&amp;时间序列】【机器学习】【python】【日积月累】预测领域：时间序列、特征设计+回归文章标签：时间序列 tsfresh

本文链接：https://blog.csdn.net/Houchaoqun_XMU/article/details/82495267

版权

【日积月累】同时被 3 个专栏收录

26 篇文章 0 订阅

订阅专栏

【机器学习】

18 篇文章 4 订阅

订阅专栏

【python】

18 篇文章 0 订阅

订阅专栏

Install

假设你的PC已经装了python开发环境：

## 使用pip直接安装
pip install tsfresh

## 测试是否安装成功
from tsfresh import extract_features

如果还没装python环境，建议安装Anaconda，可参考：https://blog.csdn.net/houchaoqun_xmu/article/details/65629050
requirements.txt（使用pip安装时，会自动安装 requirements 所需的环境）

requests>=2.9.1
numpy>=1.10.4
pandas>=0.20.3
scipy>=0.17.0
statsmodels>=0.8.0  ## 基于 statsmodels 框架
patsy>=0.4.1
scikit-learn>=0.17.1
future>=0.16.0
six>=1.10.0
tqdm>=4.10.0
ipaddress>=1.0.18; python_version <= '2.7'
dask>=0.15.2
distributed>=1.18.3

基本步骤

reference：https://blog.csdn.net/takethevow/article/details/80171440

准备数据：需要处理的时间序列数据，女装项目就是时间与gmv的数据；
特征提取：extract_features
特征过滤：过滤掉没有意义的值（NaN），保留有意义的特征；降维；
特征提取和过滤同时进行：extract_relevant_features(timeseries, y, column_id='id', column_sort='time')

案例

源码中的案例

https://github.com/blue-yonder/tsfresh/tree/master/notebooks

available tasks

time series classification
compression
forecasting

Time Series Forecasting - jupyter notebook

tsfresh.utilities.dataframe_functions.make_forecasting_frame(x, kind, max_timeshift, rolling_direction)

x (np.array or pd.Series) – the singular time series；历史数据，
kind (str) – the kind of the time series；
max_timeshift (int) – If not None, shift only up to max_timeshift. If None, shift as often as possible；
rolling_direction (int) – The sign decides, if to roll backwards (if sign is positive) or forwards in "time"；
Returns：time series container df, target vector y；

说明：df_shift, y = make_forecasting_frame(class_df_all['y'], kind="gmv", max_timeshift=24, rolling_direction=1)make_forecasting_frame() 函数的滑动过程如上图所示，假如：len(class_df_all['y']) = 59，max_timeshift = 10。

(max_timeshift + 1)*(max_timeshift/2) + (len(y) - max_timeshift)*max_timeshift

当rolling_direction = 1，那么返回的 df_shift 将是一个545行的组合数据，过程如下：

id = 1：feature_matrix, time = 0

id = 2：feature_matrix, time = 0，1，

id = 3：feature_matrix, time = 0，1，2

... ...

id = 10：feature_matrix, time = 0,1,2,3,4,5,6,7,8,9 ##

id = 11：feature_matrix, time = 0,1,2,3,4,5,6,7,8,9 ## 由于 max_timeshift =10，限制了最大长度为10

id = 12：feature_matrix, time = 0,1,2,3,4,5,6,7,8,9 ## 由于 max_timeshift =10，限制了最大长度为10

... ...

id = 58：feature_matrix, time = 0,1,2,3,4,5,6,7,8,9

所以：545 = (1+10)*10/2 + (59-10)*10

当 rolling_direction = -1 时，过程如下：

id = 1：feature_matrix, time = 0,1,2,3,4,5,6,7,8,9 ## 由于 max_timeshift =10，限制了最大长度为10

id = 2：feature_matrix, time = 0,1,2,3,4,5,6,7,8,9

id = 3：feature_matrix, time = 0,1,2,3,4,5,6,7,8,9

... ...

id = 57：feature_matrix, time = 0,1

id = 58：feature_matrix, time = 0

683·380

extract_features，特征提取：根据上述滑动组合得到的 df_shift 数据，提取特征：X = extract_features(df_shift, column_id="id", column_sort="time", column_value="value", impute_function=impute, show_warnings=False) ## 在 spyder 上无法work，而在 jupyter notebook 可以 work；
得到的特征：[59 rows x 794 columns] --> 794 维的特征，59行样本数

（794维特征，class ComprehensiveFCParameters）

extract_features 提取特征的对象：

1）a pandas.DataFrame containing the different time series;

2）a dictionary of pandas.DataFrame each containing one type of time series;

extract_relevant_features：过滤掉部分特征

思路问题

回归模型

输入：特征向量 - feature
输出：预测值（回归值）
问题：gmv是目标值，如果数据仅仅是（ds，gmv），是否不适用回归模型？
分析：回归模型的输入是特征，如果需要预测未来2个月的gmv值，那么需要知道未来2个月各自对应的特征向量 feature，并将 feature 作为模型的输入，得到对应的预测值。

Script - 20180717

import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns

from tsfresh import extract_features
from tsfresh.utilities.dataframe_functions import make_forecasting_frame
from sklearn.ensemble import AdaBoostRegressor
from tsfresh.utilities.dataframe_functions import impute

import warnings
warnings.filterwarnings('ignore')

## load dateset

month_list = ["Jan","Feb","Mar","Apr","May","June",
              "July","Aug","Sept","Oct","Nov","Dec"]

all_leaf_class_name_dict = {cate_id: cate_name}

df = pd.read_csv('./cate_by_month_histroy.csv', header=0, encoding='gbk')
df.columns = ['ds', 'cate_id', 'cate_name', 'y']
class_df_all = df[df.cate_name.str.startswith(cate_name)].reset_index(drop=True)
class_df_all = class_df_all[['ds', 'y']]
class_df_all = class_df_all[:60]
# print(class_df_all.head())

## plot
fig = plt.figure(facecolor='white')
ax = fig.add_subplot(111)
ax.plot(class_df_all['ds'], class_df_all['y'])
for tick in ax.get_xticklabels():
    tick.set_rotation(90)
fig.set_size_inches(18, 8)
plt.legend()

## make_forecasting_frame
df_shift, y = make_forecasting_frame(class_df_all['y'], kind="gmv", max_timeshift=24, rolling_direction=1)
# print(df_shift)
# print(y)

## extract_features
X = extract_features(df_shift, column_id="id", column_sort="time", column_value="value", impute_function=impute, 
                     show_warnings=False)

## 回归模型
ada = AdaBoostRegressor()

y_pred = [0] * len(y)
# print(y_pred)
y_pred[0] = y.iloc[0]
# print(y_pred[0])

ada.fit(X.iloc[:], y[:])
y_pred = ada.predict(X.iloc[:])
print((X.iloc[:]).shape)
# for i in range(1, len(y)):
#     ada.fit(X.iloc[:i], y[:i])
#     # print(len(X.iloc[i, :]))
#     y_pred[i] = ada.predict(X.iloc[i, :])
    
y_pred = pd.Series(data=y_pred, index=y.index)

plt.figure(figsize=(15, 6))
plt.plot(y, label="true")
plt.plot(y_pred, label="predicted")
plt.legend()
plt.show()

问题汇总

ImportError: cannot import name 'is_list_like'：https://stackoverflow.com/questions/50394873/import-pandas-datareader-gives-importerror-cannot-import-name-is-list-like
extract_features：Anaconda-spyder 执行到 extract_features 命令时，跑不动（编译器问题？），如下图所示：
extract_features：使用 jupyter notebook 就能顺利跑动，如下图所示：