【数据挖掘】心跳信号分类预测 之 My_Task3特征工程

Table of Contents

3.1 学习目标

  • 学习时间序列数据的特征预处理方法
  • 学习时间序列特征处理工具Tsfresh(TimeSeries Fresh) 的使用

3.2 内容介绍

数据预处理

  • 时间序列数据格式处理
  • 加入时间步特征time

特征工程

  • 时间序列特征构造
  • 特征筛选
  • 使用tsfresh

3.3 代码示例

3.3.1 导入包并读取数据

Tsfresh是处理时间序列的关系数据库的特征工程工具,能自动从时间序列中提取100多个特征。
该软件包包含多种特征提取方法和一种稳健的特征选择算法,还包含评价这些特征对回归或分类
任务的解释能力和重要性的方法。
https://zhuanlan.zhihu.com/p/93310900

# 包导入
import pandas as pd
import numpy as np
import tsfresh as tsf
from tsfresh import extract_features,select_features
from tsfresh.utilities.dataframe_functions import impute
# 数据读取
data_train = pd.read_csv("train.csv")
data_test_A = pd.read_csv("testA.csv")

print(data_train.shape)
print(data_test_A.shape)
(100000, 3)
(20000, 2)

3.3.2 数据预处理

  • 对心电特征进行行列处理,同时为每个心电信号加入时间步特征time
  • reset_index()和set_index()的使用
train_heartbeat_df = data_train["heartbeat_signals"].str.split(",",expand=True).stack()
train_heartbeat_df
0      0      0.9912297987616655
       1      0.9435330436439665
       2      0.7646772997256593
       3      0.6185708990212999
       4      0.3796321642826237
                     ...        
99999  200                   0.0
       201                   0.0
       202                   0.0
       203                   0.0
       204                   0.0
Length: 20500000, dtype: object
  • 重新设置索引 且变成了数据框的形式
train_heartbeat_df = train_heartbeat_df.reset_index()  
train_heartbeat_df
level_0level_10
0000.9912297987616655
1010.9435330436439665
2020.7646772997256593
3030.6185708990212999
4040.3796321642826237
............
20499995999992000.0
20499996999992010.0
20499997999992020.0
20499998999992030.0
20499999999992040.0

20500000 rows × 3 columns

  • 将level_0 设置为索引
train_heartbeat_df =  train_heartbeat_df.set_index("level_0")
train_heartbeat_df
level_10
level_0
000.9912297987616655
010.9435330436439665
020.7646772997256593
030.6185708990212999
040.3796321642826237
.........
999992000.0
999992010.0
999992020.0
999992030.0
999992040.0

20500000 rows × 2 columns

  • 将索引的名字置空,感觉就好像是扔掉了
train_heartbeat_df.index.name = None
train_heartbeat_df
level_10
000.9912297987616655
010.9435330436439665
020.7646772997256593
030.6185708990212999
040.3796321642826237
.........
999992000.0
999992010.0
999992020.0
999992030.0
999992040.0

20500000 rows × 2 columns

  • 使用rename()方法更改列名,inplace为True应该就是原地更改的意思【直接修改】
train_heartbeat_df.rename(columns={"level_1":"time",0:"heartbeat_signals"},inplace=True)
train_heartbeat_df
timeheartbeat_signals
000.9912297987616655
010.9435330436439665
020.7646772997256593
030.6185708990212999
040.3796321642826237
.........
999992000.0
999992010.0
999992020.0
999992030.0
999992040.0

20500000 rows × 2 columns

train_heartbeat_df["heartbeat_signals"] = train_heartbeat_df["heartbeat_signals"].astype(float)
train_heartbeat_df
timeheartbeat_signals
000.991230
010.943533
020.764677
030.618571
040.379632
.........
999992000.000000
999992010.000000
999992020.000000
999992030.000000
999992040.000000

20500000 rows × 2 columns

  • 将处理后的心电特征加入到训练数据中,同时将训练数据label列单独存储
data_train_label = data_train["label"]
data_train_label
0        0.0
1        0.0
2        2.0
3        0.0
4        2.0
        ... 
99995    0.0
99996    2.0
99997    3.0
99998    2.0
99999    0.0
Name: label, Length: 100000, dtype: float64
  • 将data_train去掉label这一列
data_train = data_train.drop('label',axis=1)
data_train
idheartbeat_signals
000.9912297987616655,0.9435330436439665,0.764677...
110.9714822034884503,0.9289687459588268,0.572932...
221.0,0.9591487564065292,0.7013782792997189,0.23...
330.9757952826275774,0.9340884687738161,0.659636...
440.0,0.055816398940721094,0.26129357194994196,0...
.........
99995999951.0,0.677705342021188,0.22239242747868546,0.25...
99996999960.9268571578157265,0.9063471198026871,0.636993...
99997999970.9258351628306013,0.5873839035878395,0.633226...
99998999981.0,0.9947621698382489,0.8297017704865509,0.45...
99999999990.9259994004527861,0.916476635326053,0.4042900...

100000 rows × 2 columns

data_train = data_train.drop("heartbeat_signals", axis=1)
data_train
id
00
11
22
33
44
......
9999599995
9999699996
9999799997
9999899998
9999999999

100000 rows × 1 columns

data_train = data_train.join(train_heartbeat_df)
data_train
idtimeheartbeat_signals
0000.991230
0010.943533
0020.764677
0030.618571
0040.379632
............
99999999992000.000000
99999999992010.000000
99999999992020.000000
99999999992030.000000
99999999992040.000000

20500000 rows × 3 columns

data_train[data_train["id"]==1]
idtimeheartbeat_signals
1100.971482
1110.928969
1120.572933
1130.178457
1140.122962
............
112000.000000
112010.000000
112020.000000
112030.000000
112040.000000

205 rows × 3 columns

可以看到,每个样本的心电特征都由205个时间步的心电信号组成

3.3.3 使用tsfresh 进行时间序列特征处理

1.特征抽取
**Tsfresh(TimeSeries Fresh)**是一个Python第三方工具包。 它可以自动计算大量的时间序列数据的特征。此外,该包还包含了特征重要性评估、特征选择的方法,因此,不管是基于时序数据的分类问题还是回归问题,tsfresh都会是特征提取一个不错的选择。官方文档:Introduction — tsfresh 0.17.1.dev24+g860c4e1 documentation

# # 特征提取
# train_features = extract_features(data_train,column_id = 'id',column_sort='time')
# train_features
  • 导入已经跑好的特征(以pkl格式存储),直接读取用,不用每次都要重新生成这么耗时
import pickle
feature_file = open("./HeartbeatClassification/train_features_file.pkl","rb")
train_features = pickle.load(feature_file)

train_features
heartbeat_signals__variance_larger_than_standard_deviationheartbeat_signals__has_duplicate_maxheartbeat_signals__has_duplicate_minheartbeat_signals__has_duplicateheartbeat_signals__sum_valuesheartbeat_signals__abs_energyheartbeat_signals__mean_abs_changeheartbeat_signals__mean_changeheartbeat_signals__mean_second_derivative_centralheartbeat_signals__median...heartbeat_signals__fourier_entropy__bins_2heartbeat_signals__fourier_entropy__bins_3heartbeat_signals__fourier_entropy__bins_5heartbeat_signals__fourier_entropy__bins_10heartbeat_signals__fourier_entropy__bins_100heartbeat_signals__permutation_entropy__dimension_3__tau_1heartbeat_signals__permutation_entropy__dimension_4__tau_1heartbeat_signals__permutation_entropy__dimension_5__tau_1heartbeat_signals__permutation_entropy__dimension_6__tau_1heartbeat_signals__permutation_entropy__dimension_7__tau_1
00.00.01.01.038.92794518.2161970.019894-0.0048590.0001170.125531...0.0957630.1092220.1092220.3561750.9404921.1808281.7349172.1844202.5006582.722686
10.00.01.01.019.4456347.7050920.019952-0.0047620.0001050.030481...0.2483330.4097670.5679440.9130161.7919641.3608282.1182492.7109333.0658023.224835
20.00.01.01.021.1929749.1404230.009863-0.0049020.0001010.000000...0.0546590.0546590.1502310.2046010.5420130.7122211.0310641.2633701.4060011.509478
30.00.01.01.042.11306615.7576230.018743-0.0047830.0001030.241397...0.0546590.1092220.1860620.2588741.4263451.3896862.2060882.9867283.5343543.854177
40.00.01.01.069.75678651.2296160.0145140.000000-0.0001370.000000...0.0546590.1092220.1092220.1636900.5177221.0453391.5433381.9145112.1656272.323993
..................................................................
999950.00.01.01.063.32344928.7422380.023588-0.0049020.0007940.388402...0.0546590.0546590.1092220.1092221.4053611.3262082.1374112.8736023.3918303.679969
999960.00.01.01.069.65753431.8663230.017373-0.0045430.0000510.421138...0.0957630.0957630.1092220.1636900.7495551.4082842.2441663.0855043.7288814.095457
999970.00.01.01.040.89705716.4128570.019470-0.0045380.0008340.213306...0.1642240.1860620.2995880.3536610.9951741.3056262.0052822.6010622.9969623.293562
999980.00.01.01.042.33330314.2812810.017032-0.0049020.0000130.264974...0.0957630.1092220.1636900.2180601.3212411.4609802.3871323.2369503.7935124.018302
999990.00.01.01.053.29011721.6374710.021870-0.0045390.0000230.320124...0.0957630.1502310.2046010.4636041.7682241.3446072.1862862.9492663.4625493.688612

100000 rows × 779 columns

  1. 特征选择
    train_features中包含了heartbeat_signals的779种常见的时间序列特征(所有这些特征的解释可以去看官方文档),这其中有的特征可能为NaN值(产生原因为当前数据不支持此类特征的计算),使用以下方式去除NaN值:
# 去除抽取特征中的NAN值
impute(train_features)
heartbeat_signals__variance_larger_than_standard_deviationheartbeat_signals__has_duplicate_maxheartbeat_signals__has_duplicate_minheartbeat_signals__has_duplicateheartbeat_signals__sum_valuesheartbeat_signals__abs_energyheartbeat_signals__mean_abs_changeheartbeat_signals__mean_changeheartbeat_signals__mean_second_derivative_centralheartbeat_signals__median...heartbeat_signals__fourier_entropy__bins_2heartbeat_signals__fourier_entropy__bins_3heartbeat_signals__fourier_entropy__bins_5heartbeat_signals__fourier_entropy__bins_10heartbeat_signals__fourier_entropy__bins_100heartbeat_signals__permutation_entropy__dimension_3__tau_1heartbeat_signals__permutation_entropy__dimension_4__tau_1heartbeat_signals__permutation_entropy__dimension_5__tau_1heartbeat_signals__permutation_entropy__dimension_6__tau_1heartbeat_signals__permutation_entropy__dimension_7__tau_1
00.00.01.01.038.92794518.2161970.019894-0.0048590.0001170.125531...0.0957630.1092220.1092220.3561750.9404921.1808281.7349172.1844202.5006582.722686
10.00.01.01.019.4456347.7050920.019952-0.0047620.0001050.030481...0.2483330.4097670.5679440.9130161.7919641.3608282.1182492.7109333.0658023.224835
20.00.01.01.021.1929749.1404230.009863-0.0049020.0001010.000000...0.0546590.0546590.1502310.2046010.5420130.7122211.0310641.2633701.4060011.509478
30.00.01.01.042.11306615.7576230.018743-0.0047830.0001030.241397...0.0546590.1092220.1860620.2588741.4263451.3896862.2060882.9867283.5343543.854177
40.00.01.01.069.75678651.2296160.0145140.000000-0.0001370.000000...0.0546590.1092220.1092220.1636900.5177221.0453391.5433381.9145112.1656272.323993
..................................................................
999950.00.01.01.063.32344928.7422380.023588-0.0049020.0007940.388402...0.0546590.0546590.1092220.1092221.4053611.3262082.1374112.8736023.3918303.679969
999960.00.01.01.069.65753431.8663230.017373-0.0045430.0000510.421138...0.0957630.0957630.1092220.1636900.7495551.4082842.2441663.0855043.7288814.095457
999970.00.01.01.040.89705716.4128570.019470-0.0045380.0008340.213306...0.1642240.1860620.2995880.3536610.9951741.3056262.0052822.6010622.9969623.293562
999980.00.01.01.042.33330314.2812810.017032-0.0049020.0000130.264974...0.0957630.1092220.1636900.2180601.3212411.4609802.3871323.2369503.7935124.018302
999990.00.01.01.053.29011721.6374710.021870-0.0045390.0000230.320124...0.0957630.1502310.2046010.4636041.7682241.3446072.1862862.9492663.4625493.688612

100000 rows × 779 columns

接下来,按照特征和响应变量之间的相关性进行特征选择,这一过程包含两步:

  • 首先单独计算每个特征和响应变量之间的相关性
  • 然后利用Benjamini-Yekutieli procedure[1]进行特征选择,决定那些特征可以被保留.
    特征选择的一些常用方法
    在这里插入图片描述
# 按照特征和数据label之间的相关性进行特征选择
train_features_filtered = select_features(train_features,data_train_label)

train_features_filtered
heartbeat_signals__sum_valuesheartbeat_signals__fft_coefficient__attr_"abs"__coeff_35heartbeat_signals__fft_coefficient__attr_"abs"__coeff_34heartbeat_signals__fft_coefficient__attr_"abs"__coeff_33heartbeat_signals__fft_coefficient__attr_"abs"__coeff_32heartbeat_signals__fft_coefficient__attr_"abs"__coeff_31heartbeat_signals__fft_coefficient__attr_"abs"__coeff_30heartbeat_signals__fft_coefficient__attr_"abs"__coeff_29heartbeat_signals__fft_coefficient__attr_"abs"__coeff_28heartbeat_signals__fft_coefficient__attr_"abs"__coeff_27...heartbeat_signals__fft_coefficient__attr_"abs"__coeff_84heartbeat_signals__fft_coefficient__attr_"imag"__coeff_97heartbeat_signals__fft_coefficient__attr_"abs"__coeff_90heartbeat_signals__fft_coefficient__attr_"abs"__coeff_94heartbeat_signals__fft_coefficient__attr_"abs"__coeff_92heartbeat_signals__fft_coefficient__attr_"real"__coeff_97heartbeat_signals__fft_coefficient__attr_"abs"__coeff_75heartbeat_signals__fft_coefficient__attr_"real"__coeff_88heartbeat_signals__fft_coefficient__attr_"real"__coeff_92heartbeat_signals__fft_coefficient__attr_"real"__coeff_83
038.9279451.1686850.9821331.2234961.2363001.1041721.4971291.3580951.7042251.745158...0.531883-0.0474380.5543700.3075860.5645960.5629600.5918590.5041240.5284500.473568
119.4456341.4607521.9245011.9254851.7159382.0799571.8186362.4904501.6732442.821067...0.563590-0.1095790.6974460.3980730.6409690.2701920.2249250.6450820.6351350.297325
221.1929741.7871662.1469871.6861901.5401372.2910312.4034221.7654221.9932132.756081...0.712487-0.0740420.3217030.3903860.7169290.3165240.4220770.7227420.6805900.383754
342.1130662.0715391.0003402.7282811.3917272.0171762.6104920.7474482.9002991.294779...0.601499-0.1842480.5646690.6233530.4669800.6517740.3089150.5500970.4669040.494024
469.7567860.6539240.2314221.0800030.7112441.3579041.2379981.3464041.6458700.941866...0.0152920.0705050.0658350.0517800.0929400.1037730.179405-0.0896110.0918410.056867
..................................................................
9999563.3234490.4172212.0360341.6590540.5005841.6935450.8599321.9630091.5248311.344715...0.7799550.0055250.4860130.2733720.7053860.6028980.4479290.4748440.5642660.133969
9999669.6575341.6113331.7930441.0923250.5071381.7639402.6776432.6408271.1280490.856280...0.5394890.1146700.5794980.4172260.2701100.5565960.7032580.4623120.2697190.539236
9999740.8970571.1905140.6746031.6327690.2290082.0278020.3024572.0162430.3526021.836034...0.282597-0.4746290.4606470.4783410.5278910.9041110.7285290.1784100.5008130.773985
9999842.3333031.2376081.3252122.7855151.9185710.8141672.6139502.0834091.3309342.801509...0.594252-0.1621060.6942760.6810250.3571960.4980880.4332970.4061540.3247710.340727
9999953.2901170.1547592.9211642.1839321.4851502.6859220.5834433.1018261.2648422.877000...0.4636970.2893640.2853210.4221030.6920090.2762360.2457800.2695190.681719-0.053993

100000 rows × 700 columns

特征工程总结:
在这里插入图片描述

参考

GitHub链接

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

数据闲逛人

谢谢大嘎喔~ 开心就好

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值