心电图Task03
特征工程
比赛地址:https://tianchi.aliyun.com/competition/entrance/531883/introduction
学习目标
- 学习时间序列数据的特征预处理方法
- 学习时间序列特征处理工具 Tsfresh(TimeSeries Fresh)的使用
内容介绍
- 数据预处理
- 时间序列数据格式处理
- 加入时间步特征time
- 特征工程
- 时间序列特征构造
- 特征筛选
- 使用 tsfresh 进行时间序列特征处理
代码示例
1 导入包并读取数据
# 包导入
import pandas as pd
import numpy as np
import tsfresh as tsf
from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import impute
# 数据读取
data_train = pd.read_csv("train.csv")
data_test_A = pd.read_csv("testA.csv")
print(data_train.shape)
print(data_test_A.shape)
(100000, 3)
(20000, 2)
2 数据预处理
# 对心电特征进行行转列处理,同时为每个心电信号加入时间步特征time
train_heartbeat_df = data_train["heartbeat_signals"].str.split(",", expand=True).stack()
train_heartbeat_df = train_heartbeat_df.reset_index()
train_heartbeat_df = train_heartbeat_df.set_index("level_0")
train_heartbeat_df.index.name = None
train_heartbeat_df.rename(columns={"level_1":"time", 0:"heartbeat_signals"}, inplace=True)
train_heartbeat_df["heartbeat_signals"] = train_heartbeat_df["heartbeat_signals"].astype(float)
train_heartbeat_df
time heartbeat_signals
0 0 0.991230
0 1 0.943533
0 2 0.764677
0 3 0.618571
0 4 0.379632
... ... ...
99999 200 0.000000
99999 201 0.000000
99999 202 0.000000
99999 203 0.000000
99999 204 0.000000
20500000 rows × 2 columns
# 将处理后的心电特征加入到训练数据中,同时将训练数据label列单独存储
data_train_label = data_train["label"]
data_train = data_train.drop("label", axis=1)
data_train = data_train.drop("heartbeat_signals", axis=1)
data_train = data_train.join(train_heartbeat_df)
data_train
id time heartbeat_signals
0 0 0 0.991230
0 0 1 0.943533
0 0 2 0.764677
0 0 3 0.618571
0 0 4 0.379632
... ... ... ...
99999 99999 200 0.0
99999 99999 201 0.0
99999 99999 202 0.0
99999 99999 203 0.0
99999 99999 204 0.0
20500000 rows × 4 columns
# 对测试数据进行相同的预处理操作
test_A_heartbeat_df = data_test_A["heartbeat_signals"].str.split(",", expand=True).stack()
test_A_heartbeat_df = test_A_heartbeat_df.reset_index()
test_A_heartbeat_df = test_A_heartbeat_df.set_index("level_0")
test_A_heartbeat_df.index.name = None
test_A_heartbeat_df.rename(columns={"level_1":"time", 0:"heartbeat_signals"}, inplace=True)
test_A_heartbeat_df["heartbeat_signals"] = test_A_heartbeat_df["heartbeat_signals"].astype(float)
test_A_heartbeat_df
time heartbeat_signals
0 0 0.991571
0 1 1.000000
0 2 0.631816
0 3 0.136230
0 4 0.041420
... ... ...
19999 200 0.000000
19999 201 0.000000
19999 202 0.000000
19999 203 0.000000
19999 204 0.000000
4100000 rows × 2 columns
data_test_A = data_test_A.drop("heartbeat_signals", axis=1)
data_test_A = data_test_A.join(test_A_heartbeat_df)
data_test_A
id time heartbeat_signals
0 100000 0 0.991571
0 100000 1 1.000000
0 100000 2 0.631816
0 100000 3 0.136230
0 100000 4 0.041420
... ... ... ...
19999 119999 200 0.000000
19999 119999 201 0.000000
19999 119999 202 0.000000
19999 119999 203 0.000000
19999 119999 204 0.000000
4100000 rows × 3 columns
3 使用 tsfresh 进行时间序列特征处理
tsfresh(TimeSeries Fresh)是一个Python第三方工具包。 它可以自动计算大量的时间序列数据的特征。此外,该包还包含了特征重要性评估、特征选择的方法,因此,不管是基于时序数据的分类问题还是回归问题,tsfresh都会是特征提取一个不错的选择。官方文档:https://tsfresh.readthedocs.io/en/latest/index.html
# 特征抽取
from tsfresh.feature_extraction import extract_features
train_features = extract_features(data_train, column_id='id', column_sort='time')
train_features
id sum_values abs_energy mean_abs_change mean_change ...
0 38.927945 18.216197 0.019894 -0.004859 ...
1 19.445634 7.705092 0.019952 -0.004762 ...
2 21.192974 9.140423 0.009863 -0.004902 ...
... ... ... ... ... ...
99997 40.897057 16.412857 0.019470 -0.004538 ...
99998 42.333303 14.281281 0.017032 -0.004902 ...
99999 53.290117 21.637471 0.021870 -0.004539 ...
100000 rows × 779 columns
train_features中包含了heartbeat_signals的几百种常见的时间序列特征(所有这些特征的解释可以去看官方文档),这其中有的特征可能为NaN值(产生原因为当前数据不支持此类特征的计算),使用以下方式去除NaN值:
from tsfresh.utilities.dataframe_functions import impute
# 去除抽取特征中的NaN值
impute(train_features)
heartbeat_signals__variance_larger_than_standard_deviation heartbeat_signals__has_duplicate_max heartbeat_signals__has_duplicate_min heartbeat_signals__has_duplicate heartbeat_signals__sum_values heartbeat_signals__abs_energy heartbeat_signals__mean_abs_change heartbeat_signals__mean_change heartbeat_signals__mean_second_derivative_central heartbeat_signals__median ... heartbeat_signals__permutation_entropy__dimension_5__tau_1 heartbeat_signals__permutation_entropy__dimension_6__tau_1 heartbeat_signals__permutation_entropy__dimension_7__tau_1 heartbeat_signals__query_similarity_count__query_None__threshold_0.0 heartbeat_signals__matrix_profile__feature_"min"__threshold_0.98 heartbeat_signals__matrix_profile__feature_"max"__threshold_0.98 heartbeat_signals__matrix_profile__feature_"mean"__threshold_0.98 heartbeat_signals__matrix_profile__feature_"median"__threshold_0.98 heartbeat_signals__matrix_profile__feature_"25"__threshold_0.98 heartbeat_signals__matrix_profile__feature_"75"__threshold_0.98
0 0.0 0.0 1.0 1.0 38.927945 18.216197 0.019894 -0.004859 0.000117 0.125531 ... 2.184420 2.500658 2.722686 NaN 6.445546 12.165525 10.246524 10.746992 8.388625 11.484910
1 0.0 0.0 1.0 1.0 19.445634 7.705092 0.019952 -0.004762 0.000105 0.030481 ... 2.710933 3.065802 3.224835 NaN 3.209140 12.649111 9.031069 9.437545 6.723180 12.094899
2 0.0 0.0 1.0 1.0 21.192974 9.140423 0.009863 -0.004902 0.000101 0.000000 ... 1.263370 1.406001 1.509478 NaN 3.054539 8.246211 7.370478 8.246211 5.966122 8.246211
3 0.0 0.0 1.0 1.0 42.113066 15.757623 0.018743 -0.004783 0.000103 0.241397 ... 2.986728 3.534354 3.854177 NaN 3.010557 9.797959 6.331360 6.406440 5.266743 7.091706
4 0.0 0.0 1.0 1.0 69.756786 51.229616 0.014514 0.000000 -0.000137 0.000000 ... 1.914511 2.165627 2.323993 NaN 9.181236 13.429784 9.959913 9.516290 9.286013 10.270925
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
99995 0.0 0.0 1.0 1.0 63.323449 28.742238 0.023588 -0.004902 0.000794 0.388402 ... 2.873602 3.391830 3.679969 NaN 2.436377 9.591663 5.635231 6.366205 3.596982 7.033638
99996 0.0 0.0 1.0 1.0 69.657534 31.866323 0.017373 -0.004543 0.000051 0.421138 ... 3.085504 3.728881 4.095457 NaN 1.415410 7.483315 2.893592 2.684349 2.049241 3.334109
99997 0.0 0.0 1.0 1.0 40.897057 16.412857 0.019470 -0.004538 0.000834 0.213306 ... 2.601062 2.996962 3.293562 NaN 5.748652 12.165525 8.524637 7.983410 7.062217 10.081756
99998 0.0 0.0 1.0 1.0 42.333303 14.281281 0.017032 -0.004902 0.000013 0.264974 ... 3.236950 3.793512 4.018302 NaN 2.346822 8.246211 4.951374 4.727535 4.069786 5.615282
99999 0.0 0.0 1.0 1.0 53.290117 21.637471 0.021870 -0.004539 0.000023 0.320124 ... 2.949266 3.462549 3.688612 NaN 1.959139 9.380832 4.573691 3.908621 3.094614 5.916164
100000 rows × 787 columns
接下来,按照特征和响应变量之间的相关性进行特征选择,决定哪些特征可以被保留。
from tsfresh import select_features
# 按照特征和数据label之间的相关性进行特征选择
from tsfresh import select_features
train_features_filtered = select_features(train_features, data_train_label, ml_task='classification', multiclass=True, n_significant=4)
train_features_filtered
heartbeat_signals__sum_values heartbeat_signals__fft_coefficient__attr_"abs"__coeff_37 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_36 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_35 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_34 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_33 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_32 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_31 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_30 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_29 ... heartbeat_signals__fft_coefficient__attr_"real"__coeff_7 heartbeat_signals__fft_coefficient__attr_"real"__coeff_64 heartbeat_signals__fft_coefficient__attr_"real"__coeff_65 heartbeat_signals__fft_coefficient__attr_"angle"__coeff_42 heartbeat_signals__fft_coefficient__attr_"real"__coeff_11 heartbeat_signals__fft_coefficient__attr_"real"__coeff_79 heartbeat_signals__fft_coefficient__attr_"real"__coeff_69 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_15 heartbeat_signals__fft_coefficient__attr_"real"__coeff_72 heartbeat_signals__fft_coefficient__attr_"imag"__coeff_20
0 38.927945 1.090709 0.848728 1.168685 0.982133 1.223496 1.236300 1.104172 1.497129 1.358095 ... 4.802874 0.632444 0.371408 -34.890840 7.586678 0.554999 0.243302 4.235934 0.599780 -2.762530
1 19.445634 1.280923 1.850706 1.460752 1.924501 1.925485 1.715938 2.079957 1.818636 2.490450 ... -0.315606 0.265035 0.533203 -64.457188 1.572106 0.210973 0.299822 4.631155 0.330685 0.052987
2 21.192974 1.619051 1.215343 1.787166 2.146987 1.686190 1.540137 2.291031 2.403422 1.765422 ... 0.881947 0.129980 0.233749 -73.590789 2.124529 0.303548 0.438839 2.787643 0.339396 -1.850936
3 42.113066 0.619634 2.366413 2.071539 1.000340 2.728281 1.391727 2.017176 2.610492 0.747448 ... 1.293405 0.117710 0.351229 -65.921705 -0.189462 0.294034 -0.051896 4.854732 0.014928 -2.891763
4 69.756786 0.348882 0.092119 0.653924 0.231422 1.080003 0.711244 1.357904 1.237998 1.346404 ... 0.882643 -0.163350 0.086834 -39.619402 -0.764864 0.126241 0.086609 2.225832 -0.162851 -1.340291
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
99995 63.323449 1.186210 1.396236 0.417221 2.036034 1.659054 0.500584 1.693545 0.859932 1.963009 ... 2.505803 0.249828 1.163824 6.742953 1.745956 0.465622 0.661054 3.558706 0.312066 -1.574527
99996 69.657534 1.393960 0.989147 1.611333 1.793044 1.092325 0.507138 1.763940 2.677643 2.640827 ... -2.186706 0.571303 0.222418 -81.910362 1.110144 0.173993 0.416224 1.837650 0.494759 -1.862968
99997 40.897057 1.000355 0.706395 1.190514 0.674603 1.632769 0.229008 2.027802 0.302457 2.016243 ... 2.927318 0.845331 0.459831 71.062444 2.683385 1.027678 0.552139 3.211831 0.699251 -1.472546
99998 42.333303 1.354894 2.238589 1.237608 1.325212 2.785515 1.918571 0.814167 2.613950 2.083409 ... 2.082214 0.539493 0.545347 -67.639185 1.745814 0.209132 0.357763 3.847665 0.572154 -2.906817
99999 53.290117 1.739088 2.936555 0.154759 2.921164 2.183932 1.485150 2.685922 0.583443 3.101826 ... 1.569063 0.515355 0.229914 -20.466461 -1.612953 0.571380 0.366731 1.062002 0.583342 1.486188
100000 rows × 615 columns
对测试数据进行同样的特征提取和特征选择
# 特征抽取
test_features = extract_features(data_test_A, column_id='id', column_sort='time')
# 去除抽取特征中的NaN值
impute(test_features)
# 选择训练数据筛选后的特征
features = list(train_features_filtered.columns)
del features[0]
test_features_filtered = test_features_filtered[features]
test_features_filtered
heartbeat_signals__sum_values heartbeat_signals__fft_coefficient__attr_"abs"__coeff_37 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_36 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_35 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_34 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_33 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_32 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_31 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_30 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_29 ... heartbeat_signals__fft_coefficient__attr_"real"__coeff_7 heartbeat_signals__fft_coefficient__attr_"real"__coeff_64 heartbeat_signals__fft_coefficient__attr_"real"__coeff_65 heartbeat_signals__fft_coefficient__attr_"angle"__coeff_42 heartbeat_signals__fft_coefficient__attr_"real"__coeff_11 heartbeat_signals__fft_coefficient__attr_"real"__coeff_79 heartbeat_signals__fft_coefficient__attr_"real"__coeff_69 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_15 heartbeat_signals__fft_coefficient__attr_"real"__coeff_72 heartbeat_signals__fft_coefficient__attr_"imag"__coeff_20
0 19.229863 0.832151 2.509869 1.082112 2.517858 1.656104 2.257162 2.213421 1.815374 2.789240 ... 4.066211 0.404908 0.170329 -34.488864 4.400229 0.375732 0.317564 1.018486 0.222551 0.240511
1 84.298932 0.856174 0.616261 0.293339 0.191558 0.528684 1.010080 1.478182 1.713876 1.776822 ... -3.340020 -0.016725 0.057665 -47.745249 2.904283 0.115441 -0.097490 3.130902 -0.121626 0.419612
2 47.789921 1.165387 1.004378 0.951231 1.542114 0.946219 1.673430 1.445220 1.118439 1.964690 ... 2.659329 0.493426 0.788276 -22.386566 2.889959 0.263781 0.546949 3.747050 0.850317 -1.675587
3 47.069011 0.044897 3.392946 3.054217 0.726293 3.582653 2.414946 1.257669 3.188068 2.066035 ... -2.000133 -0.020724 0.274307 -87.585032 0.440068 -0.114356 0.050635 2.994499 0.352981 2.273138
4 24.899397 1.401020 0.536501 1.712592 1.044629 1.533405 1.330258 1.251771 1.441028 1.176947 ... 1.029007 0.531500 0.413351 -72.241533 4.551938 0.471690 0.381546 4.966050 0.525944 -2.521332
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19995 43.175130 0.211527 1.986940 0.393550 1.693620 1.139395 1.459990 1.734535 1.025180 1.911093 ... 1.690481 0.600167 0.640551 -93.919263 1.565466 0.464138 0.401174 2.667257 0.527446 -1.736704
19996 31.030782 2.483726 1.105440 1.979721 2.821799 0.475276 2.782573 2.827882 0.520034 3.177382 ... 1.564650 0.412253 0.551063 -38.791093 2.892860 0.595191 0.514195 3.688629 0.610594 -1.295856
19997 31.648623 0.546706 2.340499 1.362651 1.942634 2.043679 0.994065 2.248144 1.007128 2.084967 ... 3.357143 0.311689 0.480912 -44.206654 -0.030723 0.283291 0.504489 3.188981 0.552585 -1.870631
19998 19.305442 2.355288 1.051282 1.742370 2.164058 0.435583 2.649994 1.190594 2.328580 2.672429 ... 3.661958 0.501568 0.433165 -68.368133 -0.608498 0.417747 0.264187 4.375039 0.519174 -2.261992
19999 35.204569 0.492990 1.627089 1.106799 0.639821 1.350155 0.533904 1.332401 1.229578 0.343820 ... 1.781820 0.148821 0.228512 11.946596 -0.967423 0.412876 0.065268 1.389901 0.140379 -0.366795
20000 rows × 615 columns
总结思考
- 使用tsfresh进行特征提取和特征筛选后,模型精度有10%~20%左右的提升,但特征选择和特征筛选后特征过多。大量工程经验表明,过多的特征反而会给模型精度造成干扰,让模型学习到错误特征关系。因此可以根据经验以及数据集观察进行优化,保留重要特征,以提高模型精度。
- tsfresh进行特征提取十分依赖于硬件算力。算力充足当然最好,可穷举出时间序列特征。如果硬件条件不允许,可以先手动筛选出重要特征,以减少对算力的依赖。