Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh –A Python package)

【目的】简单记录文献阅读过程,学习 tsfresh 库的相关SCI文章,希望可以减轻相关兴趣者的阅读与学习时间。

【注意】翻译过程更多的为机器翻译和直译,有错误之处敬请斧正。

Abstract: Time series feature engineering is a time-consuming process because scientists and engineers have to consider the multifarious algorithms of signal processing and time series analysis for identifying and extracting meaningful features from time series. The Python package tsfresh (Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests) accelerates this process by combining 63 time series characterization methods, which by default compute a total of 794 time series features, with feature selection on basis automatically configured hypothesis tests. By identifying statistically significant time series characteristics in an early stage of the data science process, tsfresh closes feedback loops with domain experts and fosters the development of domain specific features early on. The package implements standard APIs of time series and machine learning libraries (e.g. pandas and scikit-learn ) and is designed for both exploratory analyses as well as straightforward integration into operational data science applications.

摘要:时间序列特征工程是一个耗时的过程,因为科学家和工程师必须考虑信号处理和时间序列分析的多种算法,以从时间序列中识别和提取有意义的特征。 Python 包 tsfresh(基于可扩展假设检验的时间序列特征提取库)通过结合 63 种时间序列表征方法来加速这一过程,这些方法默认计算总共 794 个时间序列特征,并根据自动配置的假设检验进行特征选择。 通过在数据科学过程的早期阶段识别具有统计意义的时间序列特征,tsfresh关闭了与领域专家的反馈循环,并促进了早期领域特定特征的开发。该软件包实现了时间序列和机器学习库(例如 pandas 和 scikit-learn)的标准 API,旨在用于探索性分析以及直接集成到操作数据科学应用程序中。

1.Introduction

Trends such as the Internet of Things (IoT) [1], Industry 4.0 [2], and precision medicine [3] are driven by the availability of cheap sensors and advancing connectivity, which among others increases the availability of temporally annotated data. The resulting time series are the basis for machine learning applications like the classification of hard drives into risk classes concerning a specific defect [4], the analysis of the human heart beat [5], the optimization of production lines [6], the log analysis of server farms for detecting intruders [7], or the identification of patients with high infection risk [8]. Examples for regression tasks are the prediction of the remaining useful life of machinery [9] or the estimation of conditional event occurrence in complex event processing applications [10]. Other frequently occurring temporal data are event series from processes, which could be transformed to uniformly sampled time series via process evolution functions [11]. Time series feature extraction plays a major role during the early phases of data science projects in order to rapidly extract and explore different time series features and evaluate their statistical signficance for predicting the target. The Python package tsfresh supports this process by providing automated time series feature extraction and selection on basis of the FRESH algorithm [12].

物联网 (IoT) [1]、工业 4.0 [2] 和精准医疗 [3] 等趋势是由廉价传感器的可用性和先进的连接性驱动的,其中包括增加了时间注释数据的可用性。由此产生的时间序列是机器学习应用程序的基础,例如将硬盘驱动器分类为与特定缺陷有关的风险等级 [4]、人类心跳分析 [5]、生产线优化 [6]、日志分析服务器场以检测入侵者[7],或识别具有高感染风险的患者 [8]。回归任务的示例是预测机器的剩余使用寿命 [9] 或估计复杂事件处理应用程序中的条件事件发生 [10]。其他频繁出现的时间数据是来自过程的事件序列,可以通过过程演化函数 [11] 将其转换为均匀采样的时间序列。时间序列特征提取在数据科学项目的早期阶段发挥着重要作用,以便快速提取和探索不同的时间序列特征并评估其预测目标的统计意义。Python 包 tsfresh 通过基于 FRESH 算法 [12] 提供自动时间序列特征提取和选择来支持这一过程。

2. Problems and background

A time series is a sequence of observations taken sequentially in time [13]. In order to use a set of time series  as input for supervised or unsupervised machine learning algorithms, each time series χi needs to be mapped into a well-defined feature space with problem specific dimensionality M and feature vector x i = (x i, 1, x i, 2, . . . , x i,M). In principle, one might decide to map the time series of set D into a design matrix of N rows and M columns by choosing M data points from each time series χi as elements of feature vector xi. However, from the perspective of pattern recognition [14], it is much more efficient and effective to characterize the time series with respect to the distribution of data points, correlation properties, stationarity, entropy, and non-linear time series analysis [15]. Therefore the feature vector  can be constructed by applying time series characterization methods fj:  to the respective time series χi, which results into feature vector . This feature vector can be extended by additional univariate attributes  and feature vectors from other types of time series. If a machine learning problem had a total of K different time series and U uni- variate variables per sample i, the resulting design matrix would have i rows and (K · M + U) columns. Here, M denotes the number of features from time series characterization methods. The processed time series do not need to have the same number of data points.

时间序列是按时间顺序进行的一系列观察[13]。为了使用一组时间序列  作为有监督或无监督机器学习算法的输入,每个时间序列 χi 需要映射到具有特定维度 M 和特征的明确定义的特征空间向量x i = (xi, 1 , xi, 2 , . . . , xi,M ) 。原则上,可以通过从每个时间序列 χi 中选择 M 个数据点作为特征向量x i 的元素,将集合 D 的时间序列映射到 N 行和 M 列的设计矩阵中。然而,从模式识别的角度来看[14],从数据点的分布、相关性、平稳性、熵和非线性时间序列分析[15]等方面来表征时间序列要高效得多。因此,特征向量可以通过将时间序列表征方法 f j :  应用于相应的时间序列χi来构建,从而得到特征向量。这个特征向量可以被扩展通过额外的单变量属性 和来自其他类型时间序列的特征向量。如果机器学习问题每个样本 i 总共有 K 个不同的时间序列和 U 个单变量变量,则生成的设计矩阵将有 i 行和 (K · M + U) 列。这里,M 表示来自时间序列表征方法的特征数量。处理后的时间序列不需要具有相同数量的数据点。

For classification and regression tasks, the significance of extracted features is of high importance, because too many irrelevant features will impair the ability of the algorithm to generalize beyond the train set and result in overfitting [12]. Therefore, tsfresh provides a highly parallel feature selection algorithm on basis of statistical hypothesis tests, which by default are configured automatically depending on the type of supervised machine learning problem (classification/regression) and the feature type (categorical/continuous) [12].

对于分类和回归任务,提取特征的重要性非常重要,因为太多不相关的特征会削弱算法在训练集之外的泛化能力并导致过度拟合[12]。因此,tsfresh 在统计假设检验的基础上提供了一种高度并行的特征选择算法,默认情况下根据监督机器学习问题的类型(分类/回归)和特征类型(分类/连续)自动配置 [12]。

3. Software framework

3.1 Software architecture

By widely deploying pandas.DataFrames, e.g. as input and output objects, and providing scikit-learn compatible transformer classes, tsfresh implements the application programming interfaces of the most popular Python machine learning and data analysis frameworks such as scikit-learn [16], NumPy [17], pandas [18], scipy [19], Keras [20] or TensorFlow [21]. This enables users to seamlessly integrate tsfresh into complex machine learning systems that rely on state-of-the-art Python data analysis packages.

通过广泛部署 pandas.DataFrames,例如 作为输入和输出对象,并提供 scikit-learn 兼容的转换器类,tsfresh 实现了最流行的 Python 机器学习和数据分析框架如 scikit-learn [16]、numpy [17]、pandas [18] 的应用程序编程接口 , scipy [19] , keras [20] 或 tensorflow [21]。这使用户能够将 tsfresh 无缝集成到依赖于最先进的 Python 数据分析包的复杂机器学习系统中。

The feature_extraction submodule ( Fig. 1 ) contains both the collection of feature calculators and the logic to apply them efficiently to the time series data. The main public function of this submodule is extract_features. The number and parameters of all extracted features are controlled by a settings dictionary. It can be filled manually, instantiated using one of the predefined objects, or reconstructed from the column names of an existing feature matrix. The feature_selection submodule provides the function select_features, which implements the highly parallel feature selection algorithm [12]. Additionally, tsfresh contains several minor submodules: utilities provides helper functions used all over the package. convenience contains the extract_relevant_features function, which combines the extraction and selection with an additional imputing step in between. transformers enables the usage of tsfresh as part of scikit-learn [16] pipelines. The test suite covers 97% of the code base.

特征提取子模块(图 1)包含特征计算器的集合以及将它们有效地应用于时间序列数据的逻辑。该子模块的主要公共功能是 extract_features。所有提取特征的数量和参数由设置字典控制。它可以手动填充,使用预定义对象之一实例化,或从现有特征矩阵的列名重建。  feature_selection 子模块提供了函数 select_features,它实现了高度并行的特征选择算法 [12]。 此外,tsfresh 包含几个小子模块:实用程序提供了在整个包中使用的辅助函数。 便利包含 extract_relevant_features 函数,它将提取和选择与额外的插补步骤相结合。 transformers 允许将 tsfresh 用作 scikit-learn [16] 管道的一部分。 测试套件覆盖了 97% 的代码库。

The feature selection and the calculation of features in tsfresh are parallelized and unnecessary calculations are prevented by calculating groups of similar features and sharing auxiliary results. For example, if multiple features return the coefficients of a fitted autoregressive model (AR), the AR model is only fitted once and shared. Local parallelization is realized on basis of the Python module multiprocessing, which is used both for feature selection and feature extraction. Distributed computing on a cluster is supported on basis of Dask [22].

tsfresh中的特征选择和特征计算是并行的,通过计算相似特征组和共享辅助结果来防止不必要的计算。 例如,如果多个特征返回拟合自回归模型 (AR) 的系数,则 AR 模型仅拟合一次并共享。 局部并行化是在 Python 模块 multiprocessing 的基础上实现的,既可用于特征选择,也可用于特征提取。 基于 Dask [22] 支持集群上的分布式计算。

The parallelization in the extraction and selection process of the features enables significant runtime improvement (Fig.2). However, the memory temporarily used by the feature calculators quickly adds up in a parallel regime. Hence, the memory consumption of the parallelized calculations can be high, which can make the usage of a high number of processes on machines with low memory infeasible.

特征提取和选择过程中的并行化可以显着改善运行时间(图 2)。但是,特征计算器临时使用的内存会以并行方式迅速增加。 因此,并行计算的内存消耗可能很高,这使得在内存不足的机器上使用大量进程是不可行的。
3.2. Software functionalities

tsfresh provides 63 time series characterization methods, which compute a total of 794 time series features. A design matrix of univariate attributes can be extended by time series features from one or more associated time series. Alternatively, a design matrix can be generated from a set of time series, which might have a different number of data points and could comprise different types of time series. The resulting design matrix can be either used for unsupervised machine learning or supervised machine learning, in which case statistically significant features can be selected with respect to the classification or regression problem at hand. For this purpose, hypothesis tests are automatically configured de- pending on the type of supervised machine learning problem (classification/regression) and the feature type (categorical/continuous) [12].

tsfresh提供了63种时间序列表征方法,共计算了794个时间序列特征。单变量属性的设计矩阵可以通过一个或多个相关时间序列的时间序列特征进行扩展。或者,可以从一组时间序列中生成一个设计矩阵,该时间序列可能具有不同数量的数据点,并且可以包含不同类型的时间序列。所得到的设计矩阵既可以用于无监督机器学习,也可以用于有监督机器学习,在这种情况下,可以根据手头的分类或回归问题选择统计上显著的特征。为此目的,假设测试根据有监督的机器学习问题的类型(分类/回归)和特征类型(分类/连续)[12]自动配置。

Rows of the design matrix correspond to samples identified by their id, while its columns correspond to the extracted features. Column names uniquely identify the corresponding features with respect to the following three aspects. (1) the kind of time series the feature is based on, (2) the name of the feature calculator, which has been used to extract the feature, and (3) key-value pairs of parameters configuring the respective feature calculator:

[kind] _ [calculator] _ [parameterA] _ [valueA] _ [parameterB] _ [valueB]

设计矩阵的行对应由它们的 id 标识的样本,而它的列对应于提取的特征。 列名唯一标识了以下三个方面的相应特征。  (1) 特征所基于的时间序列类型,(2) 已用于提取特征的特征计算器的名称,以及 (3) 配置相应特征计算器的参数键值对:

For supervised machine learning tasks, for which an instance of sklearn compatible transformer FeatureSelector had been fitted, the feature importance can be inspected, which is reported as (1 −p-value) with respect to the result of the automatically configured hypothesis tests.

对于监督机器学习任务(例如已配置的 sklearn 兼容转换器 FeatureSelector已经被拟合的实例),特征重要性可以检查,其相对于自动配置的假设检验的结果报告为 (1 -p 值)。

4. Illustrative example and empirical results

For this example, we will inspect 15 force and torque sensors on a robot that allow the detection of failures in the robots based on the sensor time series. This example is extended in the Jupyter notebook robot_failure_example.ipynb.

在这个例子中,我们将检查机器人上的15个力和扭矩传感器,这些传感器可以根据传感器的时间序列来检测机器人的故障。这个例子在Jupyter笔记本 robot_failure_example.ipynb中得到了扩展。

The runtime analysis given in the appendix (Fig. A.1) is summarized in Fig. 3. It shows the histogram and boxplot of the logarithm of the average runtimes for the feature calculators of tsfresh for a time series of 10 0 0 data points. The average has been obtained from a sample of 30 different time series for which all features had been computed three times. The time series were simulated beforehand from the following sequence with ηt t N (0 , 1) [25, p. 164] and y being the target. The figure shows that all but 9 of the time series feature extraction methods have an average runtime below 25 ms.

附录(图 A.1)中给出的运行时分析总结在图 3 中。它显示了 tsfresh 的特征计算器对于 10 0 0 个数据点的时间序列的平均运行时间的对数的直方图和箱线图。平均值是从 30 个不同时间序列的样本中获得的,其中所有特征都被计算了3次。时间序列是预先从以下序列模拟的,y是目标。 该图显示,除了 9 种时间序列特征提取方法外,其他所有方法的平均运行时间都低于 25 毫秒。

On a selection of 31 datasets from the UCR time series classification archive as well as an industrial dataset from the production of steel billets by a German steel manufacturer, the FRESH algorithm was able to automatically extract relevant features for time series classification tasks [12]. The fresh algorithm is able to scale linearly with the number of used features, the number of devices/samples and the number of different time series [12]. Its scaling with respect to the length of the time series or number of features depends on the individual feature calculators. Some features such as the maximal value scale linearly with the length of the time series while other, more complex features have higher execution

在从UCR时间序列分类档案中选择的31个数据集以及一个来自德国钢铁制造商生产钢坯的工业数据集上,FRESH算法能够为时间序列分类任务自动提取相关特征[12]。FRESH算法能够随着使用的特征数量、设备/样本数量和不同时间序列的数量而线性扩展[12]。它与时间序列的长度或特征的数量的比例取决于个别特征的计算者。一些特征,如最大值与时间序列的长度成线性比例,而其他更复杂的特征则有更高的执行。

In line 5 we load the dataset. Then, in line 6, the features are extracted from the pandas.DataFrame df containing the time series. The resulting pandas.DataFrame X, the design matrix, comprises 3270 time series features. We have to replace non-finite values (e.g. NaN or inf) by imputing in line 7. Finally, the feature selection of tsfresh is used to filter out irrelevant features. The final design matrix X_filtered contains 623 time series features, which can now be used for training a classifier (e.g. a RandomForest from the scikit-learn package) to predict a robot execution failure costs such as the calculation of the sample entropy (Fig. A.1). So, adjusting the calculated features can change the total runtime of tsfresh drastically.

在第 5 行,我们加载数据集。 然后,在第 6 行,从包含时间序列的 pandas.DataFrame df 中提取特征。 生成的 pandas.DataFrame X,即设计矩阵,包含 3270 个时间序列特征。 我们必须通过第 7 行的插补来替换非有限值(例如 NaN 或 inf)。最后,使用 tsfresh 的特征选择来过滤掉不相关的特征。 最终设计矩阵 X_filtered 包含 623 个时间序列特征,现在可用于训练分类器(例如 scikit-learn 包中的 RandomForest)来预测机器人执行失败成本,例如计算样本熵(图 A  .1)。 因此,调整计算的特征可以极大地改变 tsfresh 的总运行时间。

5. Conclusions

The Python based machine learning library tsfresh is a fast and standardized machine learning library for automatic time series feature extraction and selection. It is the only Python based machine learning library for this purpose. The only alternative is the Matlab based package hctsa [26] , which extracts more than 7700 time series features. Because tsfresh implements the application programming interface of scikit-learn , it can be easily integrated into complex machine learning pipelines.

基于 Python 的机器学习库 tsfresh 是一个快速、标准化的机器学习库,用于时间序列特征的自动提取和选择。 它是唯一用于此目的的基于 Python 的机器学习库。 唯一的选择是基于 Matlab 的包 hctsa [26],它提取了 7700 多个时间序列特征。由于 tsfresh 实现了 scikit-learn 的应用程序编程接口,它可以很容易地集成到复杂的机器学习管道中。

The widespread adoption of the tsfresh package shows that there is a pressing need to automatically extract features, originating from e.g. financial, biological or industrial applications. We expect that, due to the increasing availability of an annotated temporally data, the interest in such tools will continue to grow.

tsfresh 包的广泛采用表明迫切需要自动提取特征,例如 金融、生物或工业应用。 我们预计,由于带注释的时间数据的可用性越来越高,对此类工具的兴趣将继续增长。

【英文文献信息】

Maximilian Christ, Nils Braun, Julius Neuffer, Andreas W. Kempa-Liehr,
Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package),
Neurocomputing,
Volume 307,
2018,
Pages 72-77,
ISSN 0925-2312,
https://doi.org/10.1016/j.neucom.2018.03.067.
(https://www.sciencedirect.com/science/article/pii/S0925231218304843)
 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值