【天池】零基础入门数据挖掘-心跳信号分类预测-特征工程-提分点2-386,33名

10 篇文章 0 订阅
4 篇文章 0 订阅
该博客介绍了如何使用Tsfresh库进行时间序列特征工程,包括数据预处理、特征提取和选择。通过在阿里云天池大赛的心跳信号分类任务中应用这些方法,作者展示了如何将特征与LightGBM模型结合,提高预测性能。文章还提到了遇到的内存和进程管理问题,并分享了解决方案。
摘要由CSDN通过智能技术生成

赛题理解
 

  • 学习时间序列数据的特征预处理方法
  • 学习时间序列特征处理工具 Tsfresh(TimeSeries Fresh)的使用

比赛地址:https://tianchi.aliyun.com/competition/entrance/531883/introduction

内容介绍

  • 数据预处理
    • 时间序列数据格式处理
    • 加入时间步特征time
  • 特征工程
    • 时间序列特征构造
    • 特征筛选
    • 使用 tsfresh 进行时间序列特征处理

代码

为了加快看到结果,训练样本10万条缩减为1000条,47s跑完。这里面有个小插曲,需要将scipy升级到最新,不然要报各种错,原本是:scipy-1.4.1 pip install -U scipy 升级为:1.5.4 就好了

# 包导入
import pandas as pd
import numpy as np
import tsfresh as tsf
from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import impute
import multiprocessing
if __name__ == '__main__':
    multiprocessing.freeze_support()
    # 数据读取
    data_train = pd.read_csv("./datasets/train.csv")
    data_test_A = pd.read_csv("./datasets/testA.csv")

    print(data_train.shape)
    data_train = data_train.loc[99000:, :]
    print(data_train.shape)
    print(data_test_A.shape)

    print(data_train.head())

    print(data_test_A.head())

    # 对心电特征进行行转列处理,同时为每个心电信号加入时间步特征time
    train_heartbeat_df = data_train["heartbeat_signals"].str.split(",", expand=True).stack()
    train_heartbeat_df = train_heartbeat_df.reset_index()
    train_heartbeat_df = train_heartbeat_df.set_index("level_0")
    train_heartbeat_df.index.name = None
    train_heartbeat_df.rename(columns={"level_1":"time", 0:"heartbeat_signals"}, inplace=True)
    train_heartbeat_df["heartbeat_signals"] = train_heartbeat_df["heartbeat_signals"].astype(float)

    print(train_heartbeat_df)

    # 将处理后的心电特征加入到训练数据中,同时将训练数据label列单独存储
    data_train_label = data_train["label"]
    data_train = data_train.drop("label", axis=1)
    data_train = data_train.drop("heartbeat_signals", axis=1)
    data_train = data_train.join(train_heartbeat_df)

    print(data_train)

    print(data_train[data_train["id"]==1])

    from tsfresh import extract_features

    # 特征提取
    train_features = extract_features(data_train, column_id='id', column_sort='time')
    print(train_features)

    from tsfresh.utilities.dataframe_functions import impute

    # 去除抽取特征中的NaN值
    impute(train_features)


    from tsfresh import select_features

    # 按照特征和数据label之间的相关性进行特征选择
    train_features_filtered = select_features(train_features, data_train_label)

    print(train_features_filtered)

输出:

(100000, 3)
(1000, 3)
(20000, 2)
          id                                  heartbeat_signals  label
99000  99000  0.9756354658280755,0.9219974760554872,0.735584...    0.0
99001  99001  0.013815892693101902,0.0,0.0671713603494603,0....    2.0
99002  99002  0.0,0.23649261108230346,0.38197547195844656,0....    0.0
99003  99003  0.9806634976267818,0.7258250285618127,0.509013...    0.0
99004  99004  0.9166493737668732,0.8767511968692409,0.820486...    2.0
       id                                  heartbeat_signals
0  100000  0.9915713654170097,1.0,0.6318163407681274,0.13...
1  100001  0.6075533139615096,0.5417083883163654,0.340694...
2  100002  0.9752726292239277,0.6710965234906665,0.686758...
3  100003  0.9956348033996116,0.9170249621481004,0.521096...
4  100004  1.0,0.8879490481178918,0.745564725322326,0.531...
       time  heartbeat_signals
99000     0           0.975635
99000     1           0.921997
99000     2           0.735584
99000     3           0.462566
99000     4           0.252643
...     ...                ...
99999   200           0.000000
99999   201           0.000000
99999   202           0.000000
99999   203           0.000000
99999   204           0.000000

[205000 rows x 2 columns]
          id  time  heartbeat_signals
99000  99000     0           0.975635
99000  99000     1           0.921997
99000  99000     2           0.735584
99000  99000     3           0.462566
99000  99000     4           0.252643
...      ...   ...                ...
99999  99999   200           0.000000
99999  99999   201           0.000000
99999  99999   202           0.000000
99999  99999   203           0.000000
99999  99999   204           0.000000

[205000 rows x 3 columns]
Empty DataFrame
Columns: [id, time, heartbeat_signals]
Index: []
Feature Extraction: 100%|██████████| 40/40 [00:47<00:00,  1.20s/it]
C:\Anaconda3-5.2.0-64\lib\site-packages\tsfresh\utilities\dataframe_functions.py:164: FutureWarning: The 'get_values' method is deprecated and will be removed in a future version. Use '.values' or 'np.asarray(..)' instead.
  data = df.get_values()
variable  heartbeat_signals__abs_energy  ...  heartbeat_signals__variance_larger_than_standard_deviation
id                                       ...                                                            
99000                         10.750091  ...                                                0.0         
99001                         51.773551  ...                                                0.0         
99002                         74.970303  ...                                                0.0         
99003                          8.504231  ...                                                0.0         
99004                         16.079925  ...                                                0.0         
...                                 ...  ...                                                ...         
99995                         28.742238  ...                                                0.0         
99996                         31.866323  ...                                                0.0         
99997                         16.412857  ...                                                0.0         
99998                         14.281281  ...                                                0.0         
99999                         21.637471  ...                                                0.0         

[1000 rows x 794 columns]
WARNING:tsfresh.feature_selection.relevance:Infered regression as machine learning task
variable  heartbeat_signals__percentage_of_reoccurring_values_to_all_values  ...  heartbeat_signals__fft_coefficient__coeff_31__attr_"angle"
id                                                                           ...                                                            
99000                                              0.848780                  ...                                         -54.920196         
99001                                              0.443902                  ...                                         131.401170         
99002                                              0.741463                  ...                                         -18.500814         
99003                                              0.780488                  ...                                         -68.498682         
99004                                              0.419512                  ...                                        -100.719774         
...                                                     ...                  ...                                                ...         
99995                                              0.692683                  ...                                         -40.819373         
99996                                              0.697561                  ...                                          -7.698228         
99997                                              0.404878                  ...                                         -40.873693         
99998                                              0.292683                  ...                                         -18.446829         
99999                                              0.780488                  ...                                         -40.764763         

[1000 rows x 485 columns]

将特征工程的代码与训练,测试代码融合见后文。

 

小插曲,tsfresh版本为0.11.2

>>> import tsfresh
>>> tsfresh.__version__
'0.11.2'

多次报错内存不够,整个程序跑不完就退出:

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
File "C:\Anaconda3-5.2.0-64\lib\site-packages\tqdm\std.py", line 1166, in __iter__
    for obj in iterable:
  File "C:\Anaconda3-5.2.0-64\lib\multiprocessing\pool.py", line 735, in next
    raise value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x0000022D1B1BD9B0>'. Reason: 'PicklingError("Can't pickle <class 'MemoryError'>: it's not the same object as builtins.MemoryError",)'

内存不够或进程管理部分有点问题。

对tsfresh升级后一切就正常了。

pip install -U tsfresh

升级后的版本为 tsfresh-0.18.0

win10下32GB内存机器,报内存错误。最后还是在ubuntu18.04上跑,20多分钟能跑完。

安装的最新的lightgbm仍然报如下错误,lightgbm 过高,将其版本改为 3.0.0还是不正常,估计是因为tsfresh生成的特征列名里有双引号等特殊符号,采用这里方法替换下就可以顺利跑过了https://www.pythonheidong.com/blog/article/628512/233c40db39f5a2494a81/ 这链接里的方法或许能解决这个问题

您知道吗,此消息通常在LGBMClassifier()模型(即LGBM)上找到。从熊猫上传数据后,只要在开头就放下这一行,而您的头就会出现问题:

import re
df = df.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))
    model = clf.train(params,
  File "/root/miniconda3/lib/python3.9/site-packages/lightgbm/engine.py", line 228, in train
    booster = Booster(params=params, train_set=train_set)
  File "/root/miniconda3/lib/python3.9/site-packages/lightgbm/basic.py", line 2229, in __init__
    train_set.construct()
  File "/root/miniconda3/lib/python3.9/site-packages/lightgbm/basic.py", line 1468, in construct
    self._lazy_init(self.data, label=self.label,
  File "/root/miniconda3/lib/python3.9/site-packages/lightgbm/basic.py", line 1298, in _lazy_init
    return self.set_feature_name(feature_name)
  File "/root/miniconda3/lib/python3.9/site-packages/lightgbm/basic.py", line 1780, in set_feature_name
    _safe_call(_LIB.LGBM_DatasetSetFeatureNames(
  File "/root/miniconda3/lib/python3.9/site-packages/lightgbm/basic.py", line 110, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Do not support special JSON characters in feature name.

还是升级为最新版本lightgm,Successfully installed lightgbm-3.2.0 最终代码如下:

## 1.导入第三方包
import pandas as pd
import numpy as np

import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler


from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

from tqdm import tqdm
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
import multiprocessing

import re

def gen_tsfresh_features():

    # 数据读取
    data_train = pd.read_csv("./datasets/train.csv")

    # print(data_train.shape)
    # data_train = data_train.loc[99000:, :]
    # print(data_train.shape)

    # print(data_train.head())

    # print(data_test_A.shape)
    # data_test_A = data_test_A.loc[19000:, :]
    # print(data_test_A.shape)

    # print(data_test_A.head())

    # 对训练数据处理

    # 对心电特征进行行转列处理,同时为每个心电信号加入时间步特征time
    train_heartbeat_df = data_train["heartbeat_signals"].str.split(",", expand=True).stack()
    train_heartbeat_df = train_heartbeat_df.reset_index()
    train_heartbeat_df = train_heartbeat_df.set_index("level_0")
    train_heartbeat_df.index.name = None
    train_heartbeat_df.rename(columns={"level_1": "time", 0: "heartbeat_signals"}, inplace=True)
    train_heartbeat_df["heartbeat_signals"] = train_heartbeat_df["heartbeat_signals"].astype(float)

    # print(train_heartbeat_df)

    # 将处理后的心电特征加入到训练数据中,同时将训练数据label列单独存储
    data_train_label = data_train["label"]
    data_train = data_train.drop("label", axis=1)
    data_train = data_train.drop("heartbeat_signals", axis=1)
    data_train = data_train.join(train_heartbeat_df)

    # print(data_train)

    # print(data_train[data_train["id"] == 1])

    from tsfresh import extract_features

    print(data_train.info())
    print(data_train.tail())
    # 减少内存
    data_train = reduce_mem_usage(data_train)
    data_train.heartbeat_signals = data_train.heartbeat_signals.astype(np.float32) # extract_features 中有函数不支持 float16
    print('data_train done Memory usage of dataframe is {:.2f} MB'.format(data_train.memory_usage().sum() / 1024 ** 2))
    print(data_train.info())
    print(data_train.tail())

    # 特征提取
    from tsfresh.feature_extraction import ComprehensiveFCParameters
    settings = ComprehensiveFCParameters()
    # from tsfresh.feature_extraction import MinimalFCParameters
    # settings = MinimalFCParameters()
    from tsfresh.feature_extraction import extract_features
    train_features = extract_features(data_train, default_fc_parameters=settings, column_id='id', column_sort='time')

    # 特征提取
    # train_features = extract_features(data_train, column_id='id', column_sort='time')
    # print(train_features)

    from tsfresh.utilities.dataframe_functions import impute

    # 去除抽取特征中的NaN值
    impute(train_features)
    # print(f"train_features.columns:{train_features.columns} {len(train_features.columns)}")

    from tsfresh import select_features

    # 按照特征和数据label之间的相关性进行特征选择
    train_features_filtered = select_features(train_features, data_train_label)

    # print(train_features_filtered)
    # print(f"train_features_filtered.columns:{train_features_filtered.columns} {len(train_features_filtered.columns)}")

    # 对测试数据处理

    data_test_A = pd.read_csv("./datasets/testA.csv")

    # 对心电特征进行行转列处理,同时为每个心电信号加入时间步特征time
    test_heartbeat_df = data_test_A["heartbeat_signals"].str.split(",", expand=True).stack()
    test_heartbeat_df = test_heartbeat_df.reset_index()
    test_heartbeat_df = test_heartbeat_df.set_index("level_0")
    test_heartbeat_df.index.name = None
    test_heartbeat_df.rename(columns={"level_1": "time", 0: "heartbeat_signals"}, inplace=True)
    test_heartbeat_df["heartbeat_signals"] = test_heartbeat_df["heartbeat_signals"].astype(float)

    # print(test_heartbeat_df)

    # 将处理后的心电特征加入到训练数据中,同时将训练数据label列单独存储
    data_test_A = data_test_A.drop("heartbeat_signals", axis=1)
    data_test_A = data_test_A.join(test_heartbeat_df)

    # print(data_test_A)

    # print(data_test_A[data_test_A["id"] == 1])

    from tsfresh import extract_features

    # 减少内存
    data_test_A = reduce_mem_usage(data_test_A)
    data_test_A.heartbeat_signals = data_test_A.heartbeat_signals.astype(np.float32)  # extract_features 中有函数不支持 float16
    print('data_test_A done Memory usage of dataframe is {:.2f} MB'.format(data_test_A.memory_usage().sum() / 1024 ** 2))
    print(data_test_A.info())
    print(data_test_A.tail())

    # 特征提取
    from tsfresh.feature_extraction import ComprehensiveFCParameters
    settings = ComprehensiveFCParameters()
    # from tsfresh.feature_extraction import MinimalFCParameters
    # settings = MinimalFCParameters()
    from tsfresh.feature_extraction import extract_features
    test_features = extract_features(data_test_A, default_fc_parameters=settings, column_id='id', column_sort='time')

    # 特征提取
    # test_features = extract_features(data_test_A, column_id='id', column_sort='time')
    # print(test_features)

    from tsfresh.utilities.dataframe_functions import impute

    # 去除抽取特征中的NaN值
    impute(test_features)
    # 测试数据的特征列与训练数据最终筛选出来的列对齐
    # print(f"test_features.columns:{test_features.columns} {len(test_features.columns)}")
    test_features_filtered = test_features[train_features_filtered.columns]
    # print(f"test_features_filtered.columns:{test_features_filtered.columns} {len(test_features_filtered.columns)}")

    return train_features_filtered, data_train_label, test_features_filtered

def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() / 1024 ** 2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024 ** 2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

def lightgbm_train_test(train, label, test):
    # 简单预处理
    train = reduce_mem_usage(train)
    test = reduce_mem_usage(test)

    ## 4.训练数据/测试数据准备

    x_train = train
    x_train.reset_index(drop=True, inplace=True)

    y_train = label
    y_train.reset_index(drop=True, inplace=True)

    x_test = test
    x_test.reset_index(drop=True, inplace=True)

    # print("x_train.columns:", x_train.columns)
    # print("x_test.columns:", x_test.columns)
    x_train = x_train.rename(columns=lambda x: re.sub('[^A-Za-z0-9_]+', '', x))
    x_test = x_test.rename(columns=lambda x: re.sub('[^A-Za-z0-9_]+', '', x))
    # print("x_train.columns:", x_train.columns)
    # print("x_test.columns:", x_test.columns)

    print(x_train.shape, x_test.shape, y_train.shape)

    ## 5.模型训练

    def abs_sum(y_pre, y_tru):
        y_pre = np.array(y_pre)
        y_tru = np.array(y_tru)
        loss = sum(sum(abs(y_pre - y_tru)))
        return loss

    def cv_model(clf, train_x, train_y, test_x, clf_name):
        folds = 5
        seed = 2021
        kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
        test = np.zeros((test_x.shape[0], 4))

        cv_scores = []
        onehot_encoder = OneHotEncoder(sparse=False)
        for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
            print('************************************ {} ************************************'.format(str(i + 1)))
            trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
            if clf_name == "lgb":
                train_matrix = clf.Dataset(trn_x, label=trn_y)
                valid_matrix = clf.Dataset(val_x, label=val_y)

                params = {
                    'boosting_type': 'gbdt',
                    'objective': 'multiclass',
                    'num_class': 4,
                    'num_leaves': 2 ** 5,
                    'feature_fraction': 0.8,
                    'bagging_fraction': 0.8,
                    'bagging_freq': 4,
                    'learning_rate': 0.1,
                    'seed': seed,
                    'n_jobs': 40,
                    'verbose': -1,
                }

                model = clf.train(params,
                                  train_set=train_matrix,
                                  valid_sets=valid_matrix,
                                  num_boost_round=2000,
                                  verbose_eval=100,
                                  early_stopping_rounds=200)
                val_pred = model.predict(val_x, num_iteration=model.best_iteration)
                test_pred = model.predict(test_x, num_iteration=model.best_iteration)

            print("val_y:", val_y.shape)
            val_y = np.array(val_y).reshape(-1, 1)
            val_y = onehot_encoder.fit_transform(val_y)
            print("val_y:", val_y.shape)
            print('预测的概率矩阵为:')
            print(test_pred)
            test += test_pred
            score = abs_sum(val_y, val_pred)
            cv_scores.append(score)
            print(cv_scores)
        print("%s_scotrainre_list:" % clf_name, cv_scores)
        print("%s_score_mean:" % clf_name, np.mean(cv_scores))
        print("%s_score_std:" % clf_name, np.std(cv_scores))
        test = test / kf.n_splits

        return test

    def lgb_model(x_train, y_train, x_test):
        lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
        return lgb_test

    lgb_test = lgb_model(x_train, y_train, x_test)

    ## 6.预测结果

    temp = pd.DataFrame(lgb_test)
    result = pd.read_csv('./datasets/sample_submit.csv')
    result['label_0'] = temp[0]
    result['label_1'] = temp[1]
    result['label_2'] = temp[2]
    result['label_3'] = temp[3]
    result.to_csv('lightgbm_tsfresh_submit.csv', index=False)

if __name__ == '__main__':

    multiprocessing.freeze_support()

    ## 2.读取数据
    train, label, test = gen_tsfresh_features()

    ## 3.数据预处理
    lightgbm_train_test(train, label, test)

结果如下:

10万条训练样本tsfresh用了13分12秒提取特征,用的tsfresh全特征提取,测试数据2w条用了2分40秒。

打印日志:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20500000 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column             Dtype  
---  ------             -----  
 0   id                 int64  
 1   time               int64  
 2   heartbeat_signals  float64
dtypes: float64(1), int64(2)
memory usage: 625.6 MB
None
          id  time  heartbeat_signals
99999  99999   200                0.0
99999  99999   201                0.0
99999  99999   202                0.0
99999  99999   203                0.0
99999  99999   204                0.0
Memory usage of dataframe is 625.61 MB
Memory usage after optimization is: 312.81 MB
Decreased by 50.0%
data_train done Memory usage of dataframe is 351.91 MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20500000 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column             Dtype  
---  ------             -----  
 0   id                 int32  
 1   time               int16  
 2   heartbeat_signals  float32
dtypes: float32(1), int16(1), int32(1)
memory usage: 351.9 MB
None
          id  time  heartbeat_signals
99999  99999   200                0.0
99999  99999   201                0.0
99999  99999   202                0.0
99999  99999   203                0.0
99999  99999   204                0.0
Memory usage of dataframe is 125.12 MB
Memory usage after optimization is: 62.56 MB
Decreased by 50.0%
data_test_A done Memory usage of dataframe is 70.38 MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4100000 entries, 0 to 19999
Data columns (total 3 columns):
 #   Column             Dtype  
---  ------             -----  
 0   id                 int32  
 1   time               int16  
 2   heartbeat_signals  float32
dtypes: float32(1), int16(1), int32(1)
memory usage: 70.4 MB
None
           id  time  heartbeat_signals
19999  119999   200                0.0
19999  119999   201                0.0
19999  119999   202                0.0
19999  119999   203                0.0
19999  119999   204                0.0
Memory usage of dataframe is 540.16 MB
Memory usage after optimization is: 135.61 MB
Decreased by 74.9%
Memory usage of dataframe is 108.03 MB
Memory usage after optimization is: 27.12 MB
Decreased by 74.9%
(100000, 707) (20000, 707) (100000,)
************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0390895
[200]	valid_0's multi_logloss: 0.0403544
[300]	valid_0's multi_logloss: 0.0460664
Early stopping, best iteration is:
[120]	valid_0's multi_logloss: 0.0384817
val_y: (20000,)
val_y: (20000, 4)
预测的概率矩阵为:
[[9.99872421e-01 1.14566158e-04 1.04119738e-05 2.60074910e-06]
 [2.99781740e-05 3.57574813e-05 9.99933330e-01 9.34458391e-07]
 [1.33123890e-06 5.29917316e-06 6.80470320e-06 9.99986565e-01]
 ...
 [9.68671325e-02 1.04000140e-04 9.03002578e-01 2.62898604e-05]
 [9.99461999e-01 5.06142516e-04 2.46321321e-05 7.22680627e-06]
 [9.87411802e-01 2.59772061e-03 3.32525719e-03 6.66522064e-03]]
[616.9668653555406]
************************************ 2 ************************************
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0411013
[200]	valid_0's multi_logloss: 0.0435782
[300]	valid_0's multi_logloss: 0.0498305
Early stopping, best iteration is:
[128]	valid_0's multi_logloss: 0.0406417
val_y: (20000,)
val_y: (20000, 4)
预测的概率矩阵为:
[[9.99333579e-01 6.43609664e-04 1.83672346e-05 4.44378583e-06]
 [2.14190087e-05 4.23261287e-05 9.99935527e-01 7.28015920e-07]
 [7.10568621e-07 1.02579607e-06 8.49856913e-06 9.99989765e-01]
 ...
 [6.40215168e-02 1.32011105e-04 9.35838236e-01 8.23633287e-06]
 [9.99812305e-01 1.79436904e-04 7.10027554e-06 1.15781173e-06]
 [9.63323467e-01 1.84819306e-03 2.49731242e-02 9.85521529e-03]]
[616.9668653555406, 584.5661600687715]
************************************ 3 ************************************
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0345854
[200]	valid_0's multi_logloss: 0.0359173
[300]	valid_0's multi_logloss: 0.0414517
Early stopping, best iteration is:
[129]	valid_0's multi_logloss: 0.0334529
val_y: (20000,)
val_y: (20000, 4)
预测的概率矩阵为:
[[9.99670151e-01 3.10434420e-04 1.38141763e-05 5.60089113e-06]
 [3.29422885e-05 5.07408138e-05 9.99914696e-01 1.62129796e-06]
 [6.90331518e-07 5.30156240e-06 6.01034043e-06 9.99987998e-01]
 ...
 [4.69086868e-02 1.15471922e-04 9.52964291e-01 1.15503859e-05]
 [9.99859882e-01 1.31978953e-04 6.24375386e-06 1.89479306e-06]
 [9.66072018e-01 5.37069335e-03 6.71278736e-03 2.18445009e-02]]
[616.9668653555406, 584.5661600687715, 524.2566165286664]
************************************ 4 ************************************
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0421553
[200]	valid_0's multi_logloss: 0.0456984
[300]	valid_0's multi_logloss: 0.0531679
Early stopping, best iteration is:
[101]	valid_0's multi_logloss: 0.0420942
val_y: (20000,)
val_y: (20000, 4)
预测的概率矩阵为:
[[9.99438677e-01 5.05283413e-04 3.98975698e-05 1.61422673e-05]
 [3.86655428e-05 1.33491932e-04 9.99823524e-01 4.31866830e-06]
 [4.95004900e-06 1.28166538e-05 1.10356746e-05 9.99971198e-01]
 ...
 [1.94444222e-01 1.86454698e-04 8.05304622e-01 6.47005102e-05]
 [9.99590997e-01 3.25147150e-04 7.27911324e-05 1.10646745e-05]
 [9.51256879e-01 3.22789672e-03 1.31294692e-02 3.23857556e-02]]
[616.9668653555406, 584.5661600687715, 524.2566165286664, 668.5802393240347]
************************************ 5 ************************************
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0369026
[200]	valid_0's multi_logloss: 0.0378718
[300]	valid_0's multi_logloss: 0.0440953
Early stopping, best iteration is:
[135]	valid_0's multi_logloss: 0.0357339
val_y: (20000,)
val_y: (20000, 4)
预测的概率矩阵为:
[[9.99783458e-01 1.99305801e-04 1.34426163e-05 3.79360372e-06]
 [4.08836356e-05 9.34827615e-05 9.99863910e-01 1.72378711e-06]
 [1.44260947e-06 8.34397690e-06 5.16060345e-06 9.99985053e-01]
 ...
 [1.64699038e-01 2.15540464e-04 8.35032806e-01 5.26147275e-05]
 [9.99830946e-01 1.60164144e-04 7.30342961e-06 1.58611278e-06]
 [9.73556586e-01 4.86287743e-03 1.00772090e-02 1.15033276e-02]]
[616.9668653555406, 584.5661600687715, 524.2566165286664, 668.5802393240347, 528.0451056618585]
lgb_scotrainre_list: [616.9668653555406, 584.5661600687715, 524.2566165286664, 668.5802393240347, 528.0451056618585]
lgb_score_mean: 584.4829973877744
lgb_score_std: 54.662614773384554

做了特征工程比之前的baseline提高了5分,特征生成了很多,筛选环节可能丢失了一些信息提升得不是很多。

对全部特征进行训练,然后对每个最大概率取值为1,最小的取值为0,得分386分,提升到33名,这里是源码:零基础入门数据挖掘-心跳信号分类预测-386分-33名 代码.rar

参考


https://tianchi.aliyun.com/competition/entrance/531883/introduction

https://github.com/datawhalechina/team-learning-data-mining/tree/master/HeartbeatClassification

https://github.com/datawhalechina/team-learning-data-mining/blob/master/HeartbeatClassification/Task3%20%E7%89%B9%E5%BE%81%E5%B7%A5%E7%A8%8B.md

Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics, 1165–1188

https://blog.csdn.net/duxiaodong1122?spm=1011.2124.3001.5343&type=blog

 

  • 2
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Zda天天爱打卡

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值