数据分析学习笔记(三)

心电图Task03

特征工程

比赛地址:https://tianchi.aliyun.com/competition/entrance/531883/introduction

学习目标

  • 学习时间序列数据的特征预处理方法
  • 学习时间序列特征处理工具 Tsfresh(TimeSeries Fresh)的使用

内容介绍

  • 数据预处理
    • 时间序列数据格式处理
    • 加入时间步特征time
  • 特征工程
    • 时间序列特征构造
    • 特征筛选
    • 使用 tsfresh 进行时间序列特征处理

代码示例

1 导入包并读取数据

# 包导入
import pandas as pd
import numpy as np
import tsfresh as tsf
from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import impute
# 数据读取
data_train = pd.read_csv("train.csv")
data_test_A = pd.read_csv("testA.csv")

print(data_train.shape)
print(data_test_A.shape)
(100000, 3)
(20000, 2)

2 数据预处理

# 对心电特征进行行转列处理,同时为每个心电信号加入时间步特征time
train_heartbeat_df = data_train["heartbeat_signals"].str.split(",", expand=True).stack()
train_heartbeat_df = train_heartbeat_df.reset_index()
train_heartbeat_df = train_heartbeat_df.set_index("level_0")
train_heartbeat_df.index.name = None
train_heartbeat_df.rename(columns={"level_1":"time", 0:"heartbeat_signals"}, inplace=True)
train_heartbeat_df["heartbeat_signals"] = train_heartbeat_df["heartbeat_signals"].astype(float)
train_heartbeat_df
			time		heartbeat_signals
0			0				0.991230
0			1				0.943533
0			2				0.764677
0			3				0.618571
0			4				0.379632
...		...			...
99999	200			0.000000
99999	201			0.000000
99999	202			0.000000
99999	203			0.000000
99999	204			0.000000

20500000 rows × 2 columns
# 将处理后的心电特征加入到训练数据中,同时将训练数据label列单独存储
data_train_label = data_train["label"]
data_train = data_train.drop("label", axis=1)
data_train = data_train.drop("heartbeat_signals", axis=1)
data_train = data_train.join(train_heartbeat_df)
data_train
			id		time	heartbeat_signals
0			0			0			0.991230
0			0			1			0.943533
0			0			2			0.764677
0			0			3			0.618571
0			0			4			0.379632
...		...		...		...
99999	99999	200		0.0
99999	99999	201		0.0
99999	99999	202		0.0
99999	99999	203		0.0
99999	99999	204		0.0

20500000 rows × 4 columns
# 对测试数据进行相同的预处理操作
test_A_heartbeat_df = data_test_A["heartbeat_signals"].str.split(",", expand=True).stack()
test_A_heartbeat_df = test_A_heartbeat_df.reset_index()
test_A_heartbeat_df = test_A_heartbeat_df.set_index("level_0")
test_A_heartbeat_df.index.name = None
test_A_heartbeat_df.rename(columns={"level_1":"time", 0:"heartbeat_signals"}, inplace=True)
test_A_heartbeat_df["heartbeat_signals"] = test_A_heartbeat_df["heartbeat_signals"].astype(float)
test_A_heartbeat_df
  time	heartbeat_signals
0	0	0.991571
0	1	1.000000
0	2	0.631816
0	3	0.136230
0	4	0.041420
...	...	...
19999	200	0.000000
19999	201	0.000000
19999	202	0.000000
19999	203	0.000000
19999	204	0.000000
4100000 rows × 2 columns
data_test_A = data_test_A.drop("heartbeat_signals", axis=1)
data_test_A = data_test_A.join(test_A_heartbeat_df)
data_test_A

      id	time	heartbeat_signals
0	100000	0	0.991571
0	100000	1	1.000000
0	100000	2	0.631816
0	100000	3	0.136230
0	100000	4	0.041420
...	...	...	...
19999	119999	200	0.000000
19999	119999	201	0.000000
19999	119999	202	0.000000
19999	119999	203	0.000000
19999	119999	204	0.000000
4100000 rows × 3 columns

3 使用 tsfresh 进行时间序列特征处理

tsfresh(TimeSeries Fresh)是一个Python第三方工具包。 它可以自动计算大量的时间序列数据的特征。此外,该包还包含了特征重要性评估、特征选择的方法,因此,不管是基于时序数据的分类问题还是回归问题,tsfresh都会是特征提取一个不错的选择。官方文档:https://tsfresh.readthedocs.io/en/latest/index.html

# 特征抽取
from tsfresh.feature_extraction import extract_features
train_features = extract_features(data_train, column_id='id', column_sort='time')
train_features
id		sum_values		abs_energy		mean_abs_change		mean_change 	...
0			38.927945			18.216197			0.019894					-0.004859			...
1			19.445634			7.705092			0.019952					-0.004762			...
2			21.192974			9.140423			0.009863					-0.004902			...
...		...						...						...								...						...
99997	40.897057			16.412857			0.019470					-0.004538			...
99998	42.333303			14.281281			0.017032					-0.004902			...
99999	53.290117			21.637471			0.021870					-0.004539			...

100000 rows × 779 columns

train_features中包含了heartbeat_signals的几百种常见的时间序列特征(所有这些特征的解释可以去看官方文档),这其中有的特征可能为NaN值(产生原因为当前数据不支持此类特征的计算),使用以下方式去除NaN值:

from tsfresh.utilities.dataframe_functions import impute
# 去除抽取特征中的NaN值
impute(train_features)
	heartbeat_signals__variance_larger_than_standard_deviation	heartbeat_signals__has_duplicate_max	heartbeat_signals__has_duplicate_min	heartbeat_signals__has_duplicate	heartbeat_signals__sum_values	heartbeat_signals__abs_energy	heartbeat_signals__mean_abs_change	heartbeat_signals__mean_change	heartbeat_signals__mean_second_derivative_central	heartbeat_signals__median	...	heartbeat_signals__permutation_entropy__dimension_5__tau_1	heartbeat_signals__permutation_entropy__dimension_6__tau_1	heartbeat_signals__permutation_entropy__dimension_7__tau_1	heartbeat_signals__query_similarity_count__query_None__threshold_0.0	heartbeat_signals__matrix_profile__feature_"min"__threshold_0.98	heartbeat_signals__matrix_profile__feature_"max"__threshold_0.98	heartbeat_signals__matrix_profile__feature_"mean"__threshold_0.98	heartbeat_signals__matrix_profile__feature_"median"__threshold_0.98	heartbeat_signals__matrix_profile__feature_"25"__threshold_0.98	heartbeat_signals__matrix_profile__feature_"75"__threshold_0.98
0	0.0	0.0	1.0	1.0	38.927945	18.216197	0.019894	-0.004859	0.000117	0.125531	...	2.184420	2.500658	2.722686	NaN	6.445546	12.165525	10.246524	10.746992	8.388625	11.484910
1	0.0	0.0	1.0	1.0	19.445634	7.705092	0.019952	-0.004762	0.000105	0.030481	...	2.710933	3.065802	3.224835	NaN	3.209140	12.649111	9.031069	9.437545	6.723180	12.094899
2	0.0	0.0	1.0	1.0	21.192974	9.140423	0.009863	-0.004902	0.000101	0.000000	...	1.263370	1.406001	1.509478	NaN	3.054539	8.246211	7.370478	8.246211	5.966122	8.246211
3	0.0	0.0	1.0	1.0	42.113066	15.757623	0.018743	-0.004783	0.000103	0.241397	...	2.986728	3.534354	3.854177	NaN	3.010557	9.797959	6.331360	6.406440	5.266743	7.091706
4	0.0	0.0	1.0	1.0	69.756786	51.229616	0.014514	0.000000	-0.000137	0.000000	...	1.914511	2.165627	2.323993	NaN	9.181236	13.429784	9.959913	9.516290	9.286013	10.270925
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
99995	0.0	0.0	1.0	1.0	63.323449	28.742238	0.023588	-0.004902	0.000794	0.388402	...	2.873602	3.391830	3.679969	NaN	2.436377	9.591663	5.635231	6.366205	3.596982	7.033638
99996	0.0	0.0	1.0	1.0	69.657534	31.866323	0.017373	-0.004543	0.000051	0.421138	...	3.085504	3.728881	4.095457	NaN	1.415410	7.483315	2.893592	2.684349	2.049241	3.334109
99997	0.0	0.0	1.0	1.0	40.897057	16.412857	0.019470	-0.004538	0.000834	0.213306	...	2.601062	2.996962	3.293562	NaN	5.748652	12.165525	8.524637	7.983410	7.062217	10.081756
99998	0.0	0.0	1.0	1.0	42.333303	14.281281	0.017032	-0.004902	0.000013	0.264974	...	3.236950	3.793512	4.018302	NaN	2.346822	8.246211	4.951374	4.727535	4.069786	5.615282
99999	0.0	0.0	1.0	1.0	53.290117	21.637471	0.021870	-0.004539	0.000023	0.320124	...	2.949266	3.462549	3.688612	NaN	1.959139	9.380832	4.573691	3.908621	3.094614	5.916164
100000 rows × 787 columns

接下来,按照特征和响应变量之间的相关性进行特征选择,决定哪些特征可以被保留。

from tsfresh import select_features
# 按照特征和数据label之间的相关性进行特征选择
from tsfresh import select_features
train_features_filtered = select_features(train_features, data_train_label, ml_task='classification', multiclass=True, n_significant=4)
train_features_filtered
	heartbeat_signals__sum_values	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_37	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_36	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_35	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_34	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_33	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_32	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_31	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_30	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_29	...	heartbeat_signals__fft_coefficient__attr_"real"__coeff_7	heartbeat_signals__fft_coefficient__attr_"real"__coeff_64	heartbeat_signals__fft_coefficient__attr_"real"__coeff_65	heartbeat_signals__fft_coefficient__attr_"angle"__coeff_42	heartbeat_signals__fft_coefficient__attr_"real"__coeff_11	heartbeat_signals__fft_coefficient__attr_"real"__coeff_79	heartbeat_signals__fft_coefficient__attr_"real"__coeff_69	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_15	heartbeat_signals__fft_coefficient__attr_"real"__coeff_72	heartbeat_signals__fft_coefficient__attr_"imag"__coeff_20
0	38.927945	1.090709	0.848728	1.168685	0.982133	1.223496	1.236300	1.104172	1.497129	1.358095	...	4.802874	0.632444	0.371408	-34.890840	7.586678	0.554999	0.243302	4.235934	0.599780	-2.762530
1	19.445634	1.280923	1.850706	1.460752	1.924501	1.925485	1.715938	2.079957	1.818636	2.490450	...	-0.315606	0.265035	0.533203	-64.457188	1.572106	0.210973	0.299822	4.631155	0.330685	0.052987
2	21.192974	1.619051	1.215343	1.787166	2.146987	1.686190	1.540137	2.291031	2.403422	1.765422	...	0.881947	0.129980	0.233749	-73.590789	2.124529	0.303548	0.438839	2.787643	0.339396	-1.850936
3	42.113066	0.619634	2.366413	2.071539	1.000340	2.728281	1.391727	2.017176	2.610492	0.747448	...	1.293405	0.117710	0.351229	-65.921705	-0.189462	0.294034	-0.051896	4.854732	0.014928	-2.891763
4	69.756786	0.348882	0.092119	0.653924	0.231422	1.080003	0.711244	1.357904	1.237998	1.346404	...	0.882643	-0.163350	0.086834	-39.619402	-0.764864	0.126241	0.086609	2.225832	-0.162851	-1.340291
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
99995	63.323449	1.186210	1.396236	0.417221	2.036034	1.659054	0.500584	1.693545	0.859932	1.963009	...	2.505803	0.249828	1.163824	6.742953	1.745956	0.465622	0.661054	3.558706	0.312066	-1.574527
99996	69.657534	1.393960	0.989147	1.611333	1.793044	1.092325	0.507138	1.763940	2.677643	2.640827	...	-2.186706	0.571303	0.222418	-81.910362	1.110144	0.173993	0.416224	1.837650	0.494759	-1.862968
99997	40.897057	1.000355	0.706395	1.190514	0.674603	1.632769	0.229008	2.027802	0.302457	2.016243	...	2.927318	0.845331	0.459831	71.062444	2.683385	1.027678	0.552139	3.211831	0.699251	-1.472546
99998	42.333303	1.354894	2.238589	1.237608	1.325212	2.785515	1.918571	0.814167	2.613950	2.083409	...	2.082214	0.539493	0.545347	-67.639185	1.745814	0.209132	0.357763	3.847665	0.572154	-2.906817
99999	53.290117	1.739088	2.936555	0.154759	2.921164	2.183932	1.485150	2.685922	0.583443	3.101826	...	1.569063	0.515355	0.229914	-20.466461	-1.612953	0.571380	0.366731	1.062002	0.583342	1.486188
100000 rows × 615 columns

对测试数据进行同样的特征提取和特征选择

# 特征抽取
test_features = extract_features(data_test_A, column_id='id', column_sort='time')
# 去除抽取特征中的NaN值
impute(test_features)
# 选择训练数据筛选后的特征
features = list(train_features_filtered.columns)
del features[0]
test_features_filtered = test_features_filtered[features]
test_features_filtered
	heartbeat_signals__sum_values	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_37	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_36	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_35	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_34	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_33	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_32	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_31	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_30	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_29	...	heartbeat_signals__fft_coefficient__attr_"real"__coeff_7	heartbeat_signals__fft_coefficient__attr_"real"__coeff_64	heartbeat_signals__fft_coefficient__attr_"real"__coeff_65	heartbeat_signals__fft_coefficient__attr_"angle"__coeff_42	heartbeat_signals__fft_coefficient__attr_"real"__coeff_11	heartbeat_signals__fft_coefficient__attr_"real"__coeff_79	heartbeat_signals__fft_coefficient__attr_"real"__coeff_69	heartbeat_signals__fft_coefficient__attr_"abs"__coeff_15	heartbeat_signals__fft_coefficient__attr_"real"__coeff_72	heartbeat_signals__fft_coefficient__attr_"imag"__coeff_20
0	19.229863	0.832151	2.509869	1.082112	2.517858	1.656104	2.257162	2.213421	1.815374	2.789240	...	4.066211	0.404908	0.170329	-34.488864	4.400229	0.375732	0.317564	1.018486	0.222551	0.240511
1	84.298932	0.856174	0.616261	0.293339	0.191558	0.528684	1.010080	1.478182	1.713876	1.776822	...	-3.340020	-0.016725	0.057665	-47.745249	2.904283	0.115441	-0.097490	3.130902	-0.121626	0.419612
2	47.789921	1.165387	1.004378	0.951231	1.542114	0.946219	1.673430	1.445220	1.118439	1.964690	...	2.659329	0.493426	0.788276	-22.386566	2.889959	0.263781	0.546949	3.747050	0.850317	-1.675587
3	47.069011	0.044897	3.392946	3.054217	0.726293	3.582653	2.414946	1.257669	3.188068	2.066035	...	-2.000133	-0.020724	0.274307	-87.585032	0.440068	-0.114356	0.050635	2.994499	0.352981	2.273138
4	24.899397	1.401020	0.536501	1.712592	1.044629	1.533405	1.330258	1.251771	1.441028	1.176947	...	1.029007	0.531500	0.413351	-72.241533	4.551938	0.471690	0.381546	4.966050	0.525944	-2.521332
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
19995	43.175130	0.211527	1.986940	0.393550	1.693620	1.139395	1.459990	1.734535	1.025180	1.911093	...	1.690481	0.600167	0.640551	-93.919263	1.565466	0.464138	0.401174	2.667257	0.527446	-1.736704
19996	31.030782	2.483726	1.105440	1.979721	2.821799	0.475276	2.782573	2.827882	0.520034	3.177382	...	1.564650	0.412253	0.551063	-38.791093	2.892860	0.595191	0.514195	3.688629	0.610594	-1.295856
19997	31.648623	0.546706	2.340499	1.362651	1.942634	2.043679	0.994065	2.248144	1.007128	2.084967	...	3.357143	0.311689	0.480912	-44.206654	-0.030723	0.283291	0.504489	3.188981	0.552585	-1.870631
19998	19.305442	2.355288	1.051282	1.742370	2.164058	0.435583	2.649994	1.190594	2.328580	2.672429	...	3.661958	0.501568	0.433165	-68.368133	-0.608498	0.417747	0.264187	4.375039	0.519174	-2.261992
19999	35.204569	0.492990	1.627089	1.106799	0.639821	1.350155	0.533904	1.332401	1.229578	0.343820	...	1.781820	0.148821	0.228512	11.946596	-0.967423	0.412876	0.065268	1.389901	0.140379	-0.366795
20000 rows × 615 columns

总结思考

  • 使用tsfresh进行特征提取和特征筛选后,模型精度有10%~20%左右的提升,但特征选择和特征筛选后特征过多。大量工程经验表明,过多的特征反而会给模型精度造成干扰,让模型学习到错误特征关系。因此可以根据经验以及数据集观察进行优化,保留重要特征,以提高模型精度。
  • tsfresh进行特征提取十分依赖于硬件算力。算力充足当然最好,可穷举出时间序列特征。如果硬件条件不允许,可以先手动筛选出重要特征,以减少对算力的依赖。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值