python tsfresh特征中文详解（更新中）

最新推荐文章于 2024-02-03 16:29:19 发布

豆乳_艾米

最新推荐文章于 2024-02-03 16:29:19 发布

阅读量7.1k

点赞数 6

分类专栏：时间序列预测文章标签：时间序列 python tsfresh 特征提取

本文链接：https://blog.csdn.net/yunini2/article/details/92805230

版权

时间序列预测专栏收录该内容

1 篇文章 0 订阅

订阅专栏

tsfresh是开源的提取时序数据特征的python包，能够提取出超过64种特征，堪称提取时序特征的瑞士军刀。最近有需求，所以一直在看，目前还没有中文文档，有些特征含义还是很难懂的，我把我已经看懂的一部分放这，没看懂的我只写了标题，待我看懂我添加注解。 => 感谢这位作者的帖子，在这位作者基础上，增加了一些内容

原贴：https://blog.csdn.net/xindoo/article/details/79177378

https://tsfresh.readthedocs.io/en/latest/api/tsfresh.feature_extraction.html

tsfresh.feature_extraction.feature_calculators.abs_energy(x)
时间序列的平方和

参数：x(pandas.Series) 需要计算特征的时间序列
返回值：特征值
返回值类型：float
函数类型：简单

tsfresh.feature_extraction.feature_calculators.absolute_sum_of_changes(x)
返回序列x的连续变化的绝对值之和

参数：x(pandas.Series) 需要计算特征的时间序列
返回值：特征值
返回值类型：float
函数类型：简单

tsfresh.feature_extraction.feature_calculators.agg_autocorrelation(x, param)
计算聚合函数f_agg(例如方差或者均值)处理后的自相关性，在一定程度可以衡量数据的周期性质，l表示滞后值，如果某个l计算出的值比较大，表示改时序数据具有l周期性质。

n是时间序列 $_{X_{i}}$ 的长度， $\sigma ^{2}$ 是方差，μ表示均值
参数：x(pandas.Series) 需要计算特征的时间序列
返回值：特征值
返回值类型：float
函数类型：简单

tsfresh.feature_extraction.feature_calculators.agg_linear_trend(x, param)
对时序分块聚合后（max, min, mean, meidan），然后聚合后的值做线性回归，算出 pvalue(),rvalue(相关系数), intercept(截距), slope(斜率), stderr(拟合的标准差)
Parameters:   x (pandas.Series) – the time series to calculate the feature of
param (list) – contains dictionaries {“attr”: x, “chunk_len”: l, “f_agg”: f} with x, f an string and l an int
Returns:   the different feature values
Return type:   pandas.Series

tsfresh.feature_extraction.feature_calculators.approximate_entropy(x, m, r)
近似熵，用来衡量一个时间序列的周期性、不可预测性和波动性

tsfresh.feature_extraction.feature_calculators.ar_coefficient(x, param)
自回归模型系数，适用于极大似然估计的AR(k)模型，参数k是滞后项

tsfresh.feature_extraction.feature_calculators.augmented_dickey_fuller(x, param)

扩张的Dickey-Fuller检验(ADF)是在时间序列分析中用来辨识个别变数的样本资料是否存在单位根，返回的是测试统计量的值

tsfresh.feature_extraction.feature_calculators.autocorrelation(x, lag)
滞后lag的自相关系数

tsfresh.feature_extraction.feature_calculators.binned_entropy(x, max_bins)
把整个序列按值均分成max_bins个桶，然后把每个值放进相应的桶中，然后求熵。

$p_{k}$ 表示落在第k个桶中的数占总体的比例。这个特征是为了衡量样本值分布的均匀度。
参数：x(pandas.Series) 需要计算特征的时间序列
　　　max_bins (int) 桶的数量
返回值：特征值
返回值类型：float
函数类型：简单

tsfresh.feature_extraction.feature_calculators.c3(x, lag)

$\frac{1}{n-2lag} \sum_{i=0}^{n-2lag} x_{i + 2 \cdot lag}^2 \cdot x_{i + lag} \cdot x_{i}$

which is

$\mathbb{E}[L^2(X)^2 \cdot L(X) \cdot X]$

衡量时序数据的非线性性

tsfresh.feature_extraction.feature_calculators.change_quantiles(x, ql, qh, isabs, f_agg)
先用ql和qh两个分位数在x中确定出一个区间，然后在这个区间里计算时序数据的均值、绝对值、连续变化值。

Parameters:
x (pandas.Series) – 时序数据
ql (float) – 分位数的下限
qh (float) – 分位数的上线
isabs (bool) – 使用使用绝对值
f_agg (str, name of a numpy function (e.g. mean, var, std, median)) – numpy自带的聚合函数（均值，方差，标准差，中位数）

tsfresh.feature_extraction.feature_calculators.cid_ce(x, normalize)
用来评估时间序列的复杂度，越复杂的序列有越多的谷峰。
$\sqrt{ \sum_{i=0}^{n-2lag} ( x_{i} - x_{i+1})^2 }$

tsfresh.feature_extraction.feature_calculators.count_above_mean(x)
大于均值的数的个数

tsfresh.feature_extraction.feature_calculators.count_below_mean(x)
小于均值的数的个数

tsfresh.feature_extraction.feature_calculators.cwt_coefficients(x, param)
计算Ricker小波的连续小波变化，又被成为“墨西哥帽小波”
$\frac{2}{\sqrt{3a} \pi^{\frac{1}{4}}} (1 - \frac{x^2}{a^2}) exp(-\frac{x^2}{2a^2})$
采用所有不同宽度的数组，对每个不同宽度的数组进行一次cwt计算

tsfresh.feature_extraction.feature_calculators.energy_ratio_by_chunks(x, param)
Calculates the sum of squares of chunk i out of N chunks expressed as a ratio with the sum of squares over the whole series.

计算块i在N个块的平方和，对整个级数求平方和的比率

Takes as input parameters the number num_segments of segments to divide the series into and segment_focus which is the segment number (starting at zero) to return a feature on.

If the length of the time series is not a multiple of the number of segments, the remaining data points are distributed on the bins starting from the first. For example, if your time series consists of 8 entries, the first two bins will contain 3 and the last two values, e.g. [ 0., 1., 2.], [ 3., 4., 5.] and [ 6., 7.].

Note that the answer for num_segments = 1 is a trivial “1” but we handle this scenario in case somebody calls it. Sum of the ratios should be 1.0.

Parameters:
x (numpy.ndarray) – the time series to calculate the feature of
param – contains dictionaries {“num_segments”: N, “segment_focus”: i} with N, i both ints

Returns:
the feature values

Return type:
list of tuples (index, data)

tsfresh.feature_extraction.feature_calculators.fft_aggregated(x, param)
Returns the spectral centroid (mean)光谱矩心, variance方差, skew偏度, and kurtosis of the absolute fourier transform spectrum绝对傅里叶变换频谱的峰度.

Parameters:
x (numpy.ndarray) – the time series to calculate the feature of
param (list) – contains dictionaries {“aggtype”: s} where s str and in [“centroid”, “variance”, “skew”, “kurtosis”]

Returns:
the different feature values

Return type:
pandas.Series

This function is of type: combiner

tsfresh.feature_extraction.feature_calculators.fft_coefficient(x, param)
通过快速傅里叶变换算法计算一维离散傅里叶变换的傅里叶系数

$A_k = \sum_{m=0}^{n-1} a_m \exp \left \{ -2 \pi i \frac{m k}{n} \right \}, \qquad k = 0, \ldots , n-1.$

The resulting coefficients will be complex, this feature calculator can return the real part (attr==”real”), the imaginary part (attr==”imag), the absolute value (attr=”“abs) and the angle in degrees (attr==”angle).

Parameters:
x (numpy.ndarray) – the time series to calculate the feature of
param (list) – contains dictionaries {“coeff”: x, “attr”: s} with x int and x >= 0, s str and in [“real”, “imag”, “abs”, “angle”]

Returns:
the different feature values

Return type:
pandas.Series

This function is of type: combiner

tsfresh.feature_extraction.feature_calculators.first_location_of_maximum(x)
最大值第一次出现的位置

tsfresh.feature_extraction.feature_calculators.first_location_of_minimum(x)
最小值第一次出现的位置

tsfresh.feature_extraction.feature_calculators.friedrich_coefficients(x, param)
Coefficients of polynomial h(x), which has been fitted to the deterministic dynamics of Langevin model
多项式h(x),已被确定性动力学中Langevin模型拟合出系数？

$\dot{x}(t) = h(x(t)) + \mathcal{N}(0,R)$

as described by [1].

For short time-series this method is highly dependent on the parameters.

Parameters:
x (numpy.ndarray) – the time series to calculate the feature of
param (list) – contains dictionaries {“m”: x, “r”: y, “coeff”: z} with x being positive integer, the order of polynom to fit for estimating fixed points of dynamics, y positive float, the number of quantils to use for averaging and finally z, a positive integer corresponding to the returned coefficient

Returns:
the different feature values

Return type:
pandas.Series

tsfresh.feature_extraction.feature_calculators.has_duplicate(x)
有没有重复值，bool

tsfresh.feature_extraction.feature_calculators.has_duplicate_max(x)
最大值有没有重复, bool

tsfresh.feature_extraction.feature_calculators.has_duplicate_min(x)
最小值有没有重复, bool

tsfresh.feature_extraction.feature_calculators.index_mass_quantile(x, param)

这些应用特性计算相对指数i，其中q%的时间序列x位于i的左侧

tsfresh.feature_extraction.feature_calculators.kurtosis(x)

返回x的峰度(采用调整后的Fisher-Pearson标准化矩系数G2计算)

tsfresh.feature_extraction.feature_calculators.large_standard_deviation(x, r)
x的标准差是否大于r乘以最大值减最小值
$std(x) > r * (max(X)-min(X))$

tsfresh.feature_extraction.feature_calculators.last_location_of_maximum(x)
最大值最后出现的位置

tsfresh.feature_extraction.feature_calculators.last_location_of_minimum(x)
最小值最后出现的位置

tsfresh.feature_extraction.feature_calculators.length(x)

x的长度

tsfresh.feature_extraction.feature_calculators.linear_trend(x, param)

tsfresh.feature_extraction.feature_calculators.linear_trend_timewise(x, param)

计算时间序列的值与从0到时间序列长度- 1之间的序列的线性最小二乘回归，特征假设信号是均匀采样的，不会使用时间戳来匹配模型，参数控制返回哪些特性。可能提取的属性：pvalue|rvalue|intercept|slope|stderr...

tsfresh.feature_extraction.feature_calculators.longest_strike_above_mean(x)
大于均值的最长连续子序列长度

tsfresh.feature_extraction.feature_calculators.longest_strike_below_mean(x)
小于均值的最长连续子序列长度

tsfresh.feature_extraction.feature_calculators.max_langevin_fixed_point(x, r, m) “不明白”
Largest fixed point of dynamics :math:argmax_x {h(x)=0}` estimated from polynomial h(x), which has been fitted to the deterministic dynamics of Langevin model
$\dot(x)(t) = h(x(t)) + R \mathcal(N)(0,1)$

as described by

Friedrich et al. (2000): Physics Letters A 271, p. 217-222 Extracting model equations from experimental data
For short time-series this method is highly dependent on the parameters.

Parameters:
x (numpy.ndarray) – the time series to calculate the feature of
m (int) – order of polynom to fit for estimating fixed points of dynamics
r (float) – number of quantils to use for averaging

Returns:
Largest fixed point of deterministic dynamics

Return type:
float

tsfresh.feature_extraction.feature_calculators.maximum(x)
最大值

tsfresh.feature_extraction.feature_calculators.mean(x)
均值

tsfresh.feature_extraction.feature_calculators.mean_abs_change(x)
连续变化值绝对值的均值
$\frac{1}{n} \sum_{i=1,\ldots, n-1} | x_{i+1} - x_{i}|$

tsfresh.feature_extraction.feature_calculators.mean_change(x)
连续变化值的均值
$\frac{1}{n} \sum_{i=1,\ldots, n-1} x_{i+1} - x_{i}$

tsfresh.feature_extraction.feature_calculators.mean_second_derivative_central(x)

Returns the mean value of a central approximation of the second derivative

$\frac{1}{n} \sum_{i=1,\ldots, n-1} \frac{1}{2} (x_{i+2} - 2 \cdot x_{i+1} + x_i)$

tsfresh.feature_extraction.feature_calculators.median(x)
中位数

tsfresh.feature_extraction.feature_calculators.minimum(x)
最小值

tsfresh.feature_extraction.feature_calculators.number_crossing_m(x, m)
计算x与m的相交次数，相交被定义为：；两个序列值，第一个值比m小第二个值更大，反之亦然。如果把m设为0，得到0相交的数量。

Parameters:
x (numpy.ndarray) – the time series to calculate the feature of
m (float) – the threshold for the crossing

Returns:
the value of this feature

Return type:
int

tsfresh.feature_extraction.feature_calculators.number_cwt_peaks(x, n)
搜索X中不同的峰值，x被ricker小波平滑，宽度从1到n。返回的是在足够的宽度上出现峰的数量及SNR高信号噪声比？

Parameters:
x (numpy.ndarray) – the time series to calculate the feature of
n (int) – maximum width to consider

Returns:
the value of this feature

Return type:
int

tsfresh.feature_extraction.feature_calculators.number_peaks(x, n)
峰值个数

tsfresh.feature_extraction.feature_calculators.partial_autocorrelation(x, param)
指定滞后lag的偏自相关函数的值。lag : k

tsfresh.feature_extraction.feature_calculators.percentage_of_reoccurring_datapoints_to_all_datapoints(x)
len(different values occurring more than once) / len(different values)
出现超过1次的值的个数/总的取值的个数（重复值只算一个），百分比被规范化为惟一值的数量，与.percentage_of_reoccurring_values_to_all_values(x)形成对照

tsfresh.feature_extraction.feature_calculators.percentage_of_reoccurring_values_to_all_values(x)
出现超过1次的值的个数/总个数

tsfresh.feature_extraction.feature_calculators.quantile(x, q)
返回x中q的分位数，q% 小于分位数。

tsfresh.feature_extraction.feature_calculators.range_count(x, min, max)
x中在min和max之间的数的个数

tsfresh.feature_extraction.feature_calculators.ratio_beyond_r_sigma(x, r)
取值大于r倍标准差的比例

tsfresh.feature_extraction.feature_calculators.ratio_value_number_to_time_series_length(x)
把 x unique后的长度除以x原始长度 len(set(x))/len(x)

tsfresh.feature_extraction.feature_calculators.sample_entropy(x)
熵

tsfresh.feature_extraction.feature_calculators.set_property(key, value)

This method returns a decorator that sets the property key of the function to value
tsfresh.feature_extraction.feature_calculators.skewness(x)

返回x的偏度(采用调整后的Fisher-Pearson标准化矩系数G1计算)
tsfresh.feature_extraction.feature_calculators.spkt_welch_density(x, param) “不明白”

This feature calculator estimates the cross power spectral density of the time series x at different frequencies. To do so, the time series is first shifted from the time domain to the frequency domain.

tsfresh.feature_extraction.feature_calculators.standard_deviation(x)
标准差

tsfresh.feature_extraction.feature_calculators.sum_of_reoccurring_data_points(x)
出现过多次的点的个数

tsfresh.feature_extraction.feature_calculators.sum_of_reoccurring_values(x)
出现过多次的值的和

tsfresh.feature_extraction.feature_calculators.sum_values(x)
所有值的和

tsfresh.feature_extraction.feature_calculators.symmetry_looking(x, param) $| mean(X)-median(X)| < r * (max(X)-min(X))$

tsfresh.feature_extraction.feature_calculators.time_reversal_asymmetry_statistic(x, lag)

$\frac{1}{n-2lag} \sum_{i=0}^{n-2lag} x_{i + 2 \cdot lag}^2 \cdot x_{i + lag} - x_{i + lag} \cdot x_{i}^2$

which is

$\mathbb{E}[L^2(X)^2 \cdot L(X) - L(X) \cdot X^2]$

tsfresh.feature_extraction.feature_calculators.value_count(x, value)
x中值等于value的计数

tsfresh.feature_extraction.feature_calculators.variance(x)
方差

tsfresh.feature_extraction.feature_calculators.variance_larger_than_standard_deviation(x)
方差是否大于标准差

豆乳_艾米

关注

6
点赞
踩
42

收藏

觉得还不错? 一键收藏
1
评论
python tsfresh特征中文详解（更新中）

tsfresh是开源的提取时序数据特征的python包，能够提取出超过64种特征，堪称提取时序特征的瑞士军刀。最近有需求，所以一直在看，目前还没有中文文档，有些特征含义还是很难懂的，我把我已经看懂的一部分放这，没看懂的我只写了标题，待我看懂我添加注解。 => 感谢这位作者的帖子，在这位作者基础上，增加了一些内容原贴：https://blog.csdn.net/xindoo/articl...
复制链接

扫一扫