【天池】心跳信号分类预测 时间序列特征 Part2

1 写在前面:

加入时间序列特征后,先比之前的baseline的损失降低了不少
但是时间序列特征的效果并没有想的那么中那么重要
可能存在一下原因:
1) Tsfresh中生存的时间序列特征,并不是我们想要的特征
2) 模型可能过拟合,导致得分不理想

2 导入包并读取数据

# 包导入
import pandas as pd
import numpy as np
import tsfresh as tsf
import pickle
from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import impute
import lightgbm as lgb

from sklearn.model_selection import StratifiedKFold,KFold
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
# 数据读取
data_train = pd.read_csv("train.csv")
data_test_A = pd.read_csv("testA.csv")

print(data_train.shape)
print(data_test_A.shape)
(100000, 3)
(20000, 2)

3 数据预处理

# 对心电特征进行行转列处理,同时为每个心电信号加入时间步特征time
train_heartbeat_df = data_train["heartbeat_signals"].str.split(",", expand=True).stack()
train_heartbeat_df = train_heartbeat_df.reset_index()
train_heartbeat_df = train_heartbeat_df.set_index("level_0")
train_heartbeat_df.index.name = None
train_heartbeat_df.rename(columns={"level_1":"time", 0:"heartbeat_signals"}, inplace=True)
train_heartbeat_df["heartbeat_signals"] = train_heartbeat_df["heartbeat_signals"].astype(float)

train_heartbeat_df
timeheartbeat_signals
000.991230
010.943533
020.764677
030.618571
040.379632
.........
999992000.000000
999992010.000000
999992020.000000
999992030.000000
999992040.000000

20500000 rows × 2 columns

4 使用 tsfresh 进行时间序列特征处理

4.1 特征抽取

Tsfresh(TimeSeries Fresh)是一个Python第三方工具包。 它可以自动计算大量的时间序列数据的特征。此外,该包还包含了特征重要性评估、特征选择的方法,因此,不管是基于时序数据的分类问题还是回归问题,tsfresh都会是特征提取一个不错的选择。

  • 官方文档:Introduction — tsfresh 0.17.1.dev24+g860c4e1 documentation
# 将处理后的心电特征加入到训练数据中,同时将训练数据label列单独存储
data_train_label = data_train["label"]
data_train = data_train.drop("label", axis=1)
data_train = data_train.drop("heartbeat_signals", axis=1)
data_train = data_train.join(train_heartbeat_df)

data_train
idtimeheartbeat_signals
0000.991230
0010.943533
0020.764677
0030.618571
0040.379632
............
99999999992000.000000
99999999992010.000000
99999999992020.000000
99999999992030.000000
99999999992040.000000

20500000 rows × 3 columns

data_train[data_train["id"]==1]
idtimeheartbeat_signals
1100.971482
1110.928969
1120.572933
1130.178457
1140.122962
............
112000.000000
112010.000000
112020.000000
112030.000000
112040.000000

205 rows × 3 columns

4.2 获取时间序列特征

  • train_features中包含了heartbeat_signals的787种常见的时间序列特征(所有这些特征的解释可以去看官方文档),
  • 但是这其中有的特征可能为NaN值(产生原因为当前数据不支持此类特征的计算),使用以下方式去除NaN值:
from tsfresh import extract_features

# 特征提取
train_features = extract_features(data_train, column_id='id', column_sort='time')

# 导出数据
train_features.to_pickle('./all_data.pkl')
#读入数据
all_features = pd.read_pickle('./all_data.pkl')
display(train_features.shape)
train_features =all_features[:100000]
test_features = all_features[100000:]
data_train_label = pd.read_csv("train.csv")['label']
(100000, 787)

4.3 对特征进行筛选

接下来,按照特征和响应变量之间的相关性进行特征选择,这一过程包含两步:

  • 首先单独计算每个特征和响应变量之间的相关性,
  • 然后利用Benjamini-Yekutieli procedure 进行特征选择,决定哪些特征可以被保留。
from tsfresh import select_features

# 按照特征和数据label之间的相关性进行特征选择
train_features_filtered = select_features(train_features, data_train_label)

train_features_filtered
heartbeat_signals__sum_valuesheartbeat_signals__fft_coefficient__attr_"abs"__coeff_38heartbeat_signals__fft_coefficient__attr_"abs"__coeff_37heartbeat_signals__fft_coefficient__attr_"abs"__coeff_36heartbeat_signals__fft_coefficient__attr_"abs"__coeff_35heartbeat_signals__fft_coefficient__attr_"abs"__coeff_34heartbeat_signals__fft_coefficient__attr_"abs"__coeff_33heartbeat_signals__fft_coefficient__attr_"abs"__coeff_32heartbeat_signals__fft_coefficient__attr_"abs"__coeff_31heartbeat_signals__fft_coefficient__attr_"abs"__coeff_30...heartbeat_signals__fft_coefficient__attr_"abs"__coeff_84heartbeat_signals__fft_coefficient__attr_"imag"__coeff_97heartbeat_signals__fft_coefficient__attr_"abs"__coeff_90heartbeat_signals__fft_coefficient__attr_"abs"__coeff_94heartbeat_signals__fft_coefficient__attr_"abs"__coeff_92heartbeat_signals__fft_coefficient__attr_"real"__coeff_97heartbeat_signals__fft_coefficient__attr_"abs"__coeff_75heartbeat_signals__fft_coefficient__attr_"real"__coeff_88heartbeat_signals__fft_coefficient__attr_"real"__coeff_92heartbeat_signals__fft_coefficient__attr_"real"__coeff_83
038.9279450.6609491.0907090.8487281.1686850.9821331.2234961.2363001.1041721.497129...0.531883-0.0474380.5543700.3075860.5645960.5629600.5918590.5041240.5284500.473568
119.4456341.7182171.2809231.8507061.4607521.9245011.9254851.7159382.0799571.818636...0.563590-0.1095790.6974460.3980730.6409690.2701920.2249250.6450820.6351350.297325
221.1929741.8142811.6190511.2153431.7871662.1469871.6861901.5401372.2910312.403422...0.712487-0.0740420.3217030.3903860.7169290.3165240.4220770.7227420.6805900.383754
342.1130662.1095500.6196342.3664132.0715391.0003402.7282811.3917272.0171762.610492...0.601499-0.1842480.5646690.6233530.4669800.6517740.3089150.5500970.4669040.494024
469.7567860.1945490.3488820.0921190.6539240.2314221.0800030.7112441.3579041.237998...0.0152920.0705050.0658350.0517800.0929400.1037730.179405-0.0896110.0918410.056867
..................................................................
9999563.3234490.8406511.1862101.3962360.4172212.0360341.6590540.5005841.6935450.859932...0.7799550.0055250.4860130.2733720.7053860.6028980.4479290.4748440.5642660.133969
9999669.6575341.5577871.3939600.9891471.6113331.7930441.0923250.5071381.7639402.677643...0.5394890.1146700.5794980.4172260.2701100.5565960.7032580.4623120.2697190.539236
9999740.8970570.4697581.0003550.7063951.1905140.6746031.6327690.2290082.0278020.302457...0.282597-0.4746290.4606470.4783410.5278910.9041110.7285290.1784100.5008130.773985
9999842.3333030.9929481.3548942.2385891.2376081.3252122.7855151.9185710.8141672.613950...0.594252-0.1621060.6942760.6810250.3571960.4980880.4332970.4061540.3247710.340727
9999953.2901171.6246251.7390882.9365550.1547592.9211642.1839321.4851502.6859220.583443...0.4636970.2893640.2853210.4221030.6920090.2762360.2457800.2695190.681719-0.053993

100000 rows × 707 columns

test_features[list(train_features_filtered.columns)]
heartbeat_signals__sum_valuesheartbeat_signals__fft_coefficient__attr_"abs"__coeff_38heartbeat_signals__fft_coefficient__attr_"abs"__coeff_37heartbeat_signals__fft_coefficient__attr_"abs"__coeff_36heartbeat_signals__fft_coefficient__attr_"abs"__coeff_35heartbeat_signals__fft_coefficient__attr_"abs"__coeff_34heartbeat_signals__fft_coefficient__attr_"abs"__coeff_33heartbeat_signals__fft_coefficient__attr_"abs"__coeff_32heartbeat_signals__fft_coefficient__attr_"abs"__coeff_31heartbeat_signals__fft_coefficient__attr_"abs"__coeff_30...heartbeat_signals__fft_coefficient__attr_"abs"__coeff_84heartbeat_signals__fft_coefficient__attr_"imag"__coeff_97heartbeat_signals__fft_coefficient__attr_"abs"__coeff_90heartbeat_signals__fft_coefficient__attr_"abs"__coeff_94heartbeat_signals__fft_coefficient__attr_"abs"__coeff_92heartbeat_signals__fft_coefficient__attr_"real"__coeff_97heartbeat_signals__fft_coefficient__attr_"abs"__coeff_75heartbeat_signals__fft_coefficient__attr_"real"__coeff_88heartbeat_signals__fft_coefficient__attr_"real"__coeff_92heartbeat_signals__fft_coefficient__attr_"real"__coeff_83
10000019.2298632.3812140.8321512.5098691.0821122.5178581.6561042.2571622.2134211.815374...0.563470-0.0405760.4854410.4720590.4480180.4493470.4799500.4804480.4422790.355992
10000184.2989320.9876600.8561740.6162610.2933390.1915580.5286841.0100801.4781821.713876...0.0373070.0100740.2728970.2475380.2869480.1438290.1894160.1242930.1546240.077530
10000247.7899210.6963931.1653871.0043780.9512311.5421140.9462191.6734301.4452201.118439...0.738423-0.1595050.4182980.5666280.8496840.9508510.7793240.4392550.8393150.454957
10000347.0690113.1376680.0448973.3929463.0542170.7262933.5826532.4149461.2576693.188068...0.2731420.3669490.8916900.2145850.9275620.6488720.7301780.6065280.8301050.662320
10000424.8993970.4960101.4010200.5365011.7125921.0446291.5334051.3302581.2517711.441028...0.644046-0.1297000.5785600.7832580.4805980.4850030.6671110.5942340.4479800.511133
..................................................................
11999543.1751301.7769370.2115271.9869400.3935501.6936201.1393951.4599901.7345351.025180...0.546742-0.0602540.5079500.5601920.5415340.2497500.6087960.4554440.5353060.268471
11999631.0307821.4510452.4837261.1054401.9797212.8217990.4752762.7825732.8278820.520034...0.4916620.0164130.4803800.4591720.3637560.4270280.5446920.7548340.3618660.536087
11999731.6486232.1413010.5467062.3404991.3626511.9426342.0436790.9940652.2481441.007128...0.5298800.0010120.7689600.8341590.6721140.5202150.3415190.7134190.6643540.370047
11999819.3054420.2217082.3552881.0512821.7423702.1640580.4355832.6499941.1905942.328580...0.527500-0.1035740.5212220.4264350.6368870.4463650.5514420.5037030.6352460.258394
11999935.2045690.8270170.4929901.6270891.1067990.6398211.3501550.5339041.3324011.229578...0.2487760.0912180.6597500.6362820.3199220.4728240.3558300.3463110.3127970.540855

20000 rows × 707 columns

5 模型训练

x_train = train_features_filtered
y_train = data_train_label.astype(int)
x_test = test_features[list(train_features_filtered.columns)]
new_col = list(np.arange(0,x_train.shape[1]))
x_train.columns= new_col
new_col = list(np.arange(0,x_test.shape[1]))
x_test.columns= new_col
# 定义结果评价函数
def abs_sum(y_pre,y_tru):
    y_pre=np.array(y_pre)
    y_tru=np.array(y_tru)
    loss=sum(sum(abs(y_pre-y_tru)))
    return loss
# 训练模型
def cv_model(clf, train_x, train_y, test_x, clf_name):
    folds = 5
    seed = 2021
    #shuffle 表示是否打乱划分,默认False,即不打乱
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed) # 5折交叉验证
    #生成测试集的概率矩阵(0矩阵)
    test = np.zeros((test_x.shape[0],4))

    #5折的精度
    cv_scores = []
    
    #默认sparse参数为True,编码后返回的是一个稀疏矩阵的对象,如果要使用一般要调用toarray()方法转化成array对象。
    #若将sparse参数设置为False,则直接生成array对象,可直接使用。
    onehot_encoder = OneHotEncoder(sparse=False) # 数据转换
    
    #enumerate是枚举函数
    """
    eg:
    list1 = ["这", "是", "一个", "测试"]
    for index, item in enumerate(list1):
        print index, item
    >>>
    0 这
    1 是
    2 一个
    3 测试
    """
    ##这里使用的应该是5折交叉验证,i表示第几个模型,train_index、valid_index表示对应的训练集和测试集索引
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
        
        #传入的是lgb,表示使用lgb模型,下面是对应的参数
        if clf_name == "lgb":
            #对应的是训练集和测试集
            train_matrix = clf.Dataset(trn_x, label=trn_y)
            valid_matrix = clf.Dataset(val_x, label=val_y)
            
            #lgb中对应的参数
            params = {
                'boosting_type': 'gbdt',
                'objective': 'multiclass',
                'num_class': 4,
                'num_leaves': 2 ** 5,
                'feature_fraction': 0.8,
                'bagging_fraction': 0.8,
                'bagging_freq': 4,
                'learning_rate': 0.1,
                'seed': seed,
                'nthread': 28,
                'n_jobs':24,
                'verbose': -1,
            }
            
            #模型:传入训练集和测试集
            #num_boost_round 这是指提升迭代的个数
            #verbose_eval 每隔100个迭代输出一次。
            #early_stopping_rounds早停次数
            model = clf.train(params, 
                      train_set=train_matrix, 
                      valid_sets=valid_matrix, 
                      num_boost_round=2000, 
                      verbose_eval=100, 
                      early_stopping_rounds=200)
            
            #这是是测试集的预测,但是还有test的预测
            val_pred = model.predict(val_x, num_iteration=model.best_iteration)
            test_pred = model.predict(test_x, num_iteration=model.best_iteration) 
            
       #将测试集的预测,变为只有1列,然后oneHot
        val_y=np.array(val_y).reshape(-1, 1)
        val_y = onehot_encoder.fit_transform(val_y)
        print('预测的概率矩阵为:')
        
        #这里是输出之前tes对应的概率
        print(test_pred)
        test += test_pred
        #----------------------------------------------------------->>here,这里是对应的评分标准
        score=abs_sum(val_y, val_pred)
        cv_scores.append(score)
        print(cv_scores)
    print("%s_scotrainre_list:" % clf_name, cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    print("%s_score_std:" % clf_name, np.std(cv_scores))
    #对于5折后的所有概率的累计和,然后除以多少个模型
    test=test/kf.n_splits

    return test
# 采用基于GBDT算法的LightGBM框架建模,速度更快
def lgb_model(x_train, y_train, x_test):
    lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
    return lgb_test
lgb_test = lgb_model(x_train, y_train, x_test)
************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0404083
[200]	valid_0's multi_logloss: 0.0420551
[300]	valid_0's multi_logloss: 0.0489518
Early stopping, best iteration is:
[123]	valid_0's multi_logloss: 0.0399632
预测的概率矩阵为:
[[9.99698070e-01 2.70039109e-04 2.69411380e-05 4.94999289e-06]
 [1.21746804e-05 5.85496993e-05 9.99927738e-01 1.53724157e-06]
 [9.83607048e-07 9.53088446e-06 5.78380704e-06 9.99983702e-01]
 ...
 [1.28588929e-01 1.05085229e-04 8.71214661e-01 9.13245122e-05]
 [9.99898884e-01 9.44640033e-05 5.24433804e-06 1.40752937e-06]
 [9.96917117e-01 8.77305702e-04 9.33462129e-04 1.27211560e-03]]
[624.0616253339873]
************************************ 2 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0409499
[200]	valid_0's multi_logloss: 0.0430094
[300]	valid_0's multi_logloss: 0.0509631
Early stopping, best iteration is:
[132]	valid_0's multi_logloss: 0.0403365
预测的概率矩阵为:
[[9.99766094e-01 2.27649008e-04 5.10883800e-06 1.14815289e-06]
 [4.90135572e-06 1.76149950e-05 9.99976985e-01 4.98221302e-07]
 [5.02979604e-07 1.75218549e-06 3.95701033e-06 9.99993788e-01]
 ...
 [1.88770643e-01 1.68366757e-04 8.11037284e-01 2.37062540e-05]
 [9.99929675e-01 5.78811648e-05 1.17246894e-05 7.18797217e-07]
 [9.78389668e-01 8.11758773e-03 9.64454422e-03 3.84820021e-03]]
[624.0616253339873, 570.6563156765974]
************************************ 3 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0347992
[200]	valid_0's multi_logloss: 0.0363279
[300]	valid_0's multi_logloss: 0.0424064
Early stopping, best iteration is:
[127]	valid_0's multi_logloss: 0.0340505
预测的概率矩阵为:
[[9.99609883e-01 3.70753146e-04 1.32602819e-05 6.10350667e-06]
 [1.22141545e-05 4.55585638e-05 9.99941288e-01 9.39131609e-07]
 [6.81621049e-07 3.79067446e-06 5.15956743e-06 9.99990368e-01]
 ...
 [6.00128813e-02 1.98460513e-04 9.39774930e-01 1.37284585e-05]
 [9.99889796e-01 1.01020579e-04 7.44737754e-06 1.73593239e-06]
 [9.93567672e-01 2.09674541e-03 9.72441138e-04 3.36314145e-03]]
[624.0616253339873, 570.6563156765974, 529.4810745605361]
************************************ 4 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0426303
[200]	valid_0's multi_logloss: 0.0466983
[300]	valid_0's multi_logloss: 0.0544747
Early stopping, best iteration is:
[106]	valid_0's multi_logloss: 0.0425068
预测的概率矩阵为:
[[9.99676694e-01 2.73223539e-04 4.25582706e-05 7.52417657e-06]
 [1.26638715e-05 1.05055412e-04 9.99880095e-01 2.18565228e-06]
 [4.06183581e-06 1.50831540e-05 2.03243762e-05 9.99960531e-01]
 ...
 [1.60476892e-01 2.13563287e-04 8.39219569e-01 8.99752495e-05]
 [9.99714443e-01 2.55687992e-04 2.45784124e-05 5.29012122e-06]
 [9.72822921e-01 7.10275449e-03 9.99610715e-03 1.00782169e-02]]
[624.0616253339873, 570.6563156765974, 529.4810745605361, 652.3274745655527]
************************************ 5 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0374434
[200]	valid_0's multi_logloss: 0.0388786
[300]	valid_0's multi_logloss: 0.0451416
Early stopping, best iteration is:
[122]	valid_0's multi_logloss: 0.0366685
预测的概率矩阵为:
[[9.99744769e-01 2.35103603e-04 1.49822504e-05 5.14491703e-06]
 [1.17520698e-05 1.20944642e-04 9.99865223e-01 2.08049130e-06]
 [1.51365352e-06 3.32215936e-06 3.60549178e-06 9.99991559e-01]
 ...
 [3.74454974e-02 7.66420172e-05 9.62455775e-01 2.20853869e-05]
 [9.99837260e-01 1.52861288e-04 7.38539621e-06 2.49381247e-06]
 [9.67159636e-01 2.84279467e-03 1.31817376e-02 1.68158322e-02]]
[624.0616253339873, 570.6563156765974, 529.4810745605361, 652.3274745655527, 558.9042176962932]
lgb_scotrainre_list: [624.0616253339873, 570.6563156765974, 529.4810745605361, 652.3274745655527, 558.9042176962932]
lgb_score_mean: 587.0861415665934
lgb_score_std: 44.73504596598341
import pandas as pd
temp=pd.DataFrame(lgb_test)
result=pd.read_csv('sample_submit.csv')
result['label_0']=temp[0]
result['label_1']=temp[1]
result['label_2']=temp[2]
result['label_3']=temp[3]
result.to_csv('./时间序列1.0.csv',index=False)
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
天池心跳信号分类预测是基于卷积神经网络(Convolutional Neural Network,CNN)的一个任务。CNN是一种专门用于处理具有网格状拓扑结构数据的深度学习算法,常用于图像识别和处理方面。 对于天池心跳信号分类预测任务,首先需要准备好心跳信号的数据集,包括心电图信号的采集数据以及对应的标签。然后,可以使用CNN模型对这些心跳信号进行分类预测。 CNN模型的主要思想是通过多层卷积和池化操作来提取信号特征,并利用这些特征进行分类预测。具体而言,CNN模型由输入层、卷积层、池化层、全连接层和输出层组成。首先,输入层接收心跳信号数据,然后通过卷积层提取信号的局部特征,并通过池化层对特征进行降维。接下来,通过全连接层将数据进行分类,最后在输出层得到分类预测结果。 在训练CNN模型时,通常采用反向传播算法来更新模型的参数,通过最小化损失函数来优化模型的分类效果。训练过程中,可以采用一部分数据用于训练,另一部分数据用于验证模型的泛化能力。 总之,天池心跳信号分类预测使用CNN模型进行信号特征提取和分类预测,通过卷积、池化和全连接等操作,充分利用心跳信号的局部特征进行分类判断,从而实现对心跳信号的准确分类预测。这有助于医学领域对心脏疾病等相关问题的研究和诊断。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值