心跳信号分类预测(任务一)

赛题理解

2016年6月,国务院办公厅印发《国务院办公厅关于促进和规范健康医疗大数据应用发展的指导意见》,文件指出健康医疗大数据应用发展将带来健康医疗模式的深刻变化,有利于提升健康医疗服务效率和质量。

赛题以心电图数据为背景,要求选手根据心电图感应数据预测心跳信号,其中心跳信号对应正常病例以及受不同心律不齐和心肌梗塞影响的病例,这是一个多分类的问题。通过这道赛题来引导大家了解医疗大数据的应用,帮助竞赛新人进行自我练习、自我提高。

该数据来自某平台心电图数据记录,总数据量超过20万,主要为1列心跳信号序列数据,其中每个样本的信号序列采样频次一致,长度相等。为了保证比赛的公平性,将会从中抽取10万条作为训练集,2万条作为测试集A,2万条作为测试集B,同时会对心跳信号类别(label)信息进行脱敏。

比赛地址:https://tianchi.aliyun.com/competition/entrance/531883/introduction

导入第三方包

import os
import gc
import math
​
import pandas as pd
import numpy as np
​
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
​
​
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
​
from tqdm import tqdm
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')

lightgbm xgboost catboost理解

LightGBM(Light Gradient Boosting Machine) :微软开源的一个机器学习框架;训练速度和效率更快、使用的内存更低、准确率更高、并且支持并行化学习与处理大规模数据。

Catboost ( Categorical Features+Gradient Boosting):俄罗斯的搜索巨头 Yandex 开源框架;性能卓越、鲁棒性与通用性更好、易于使用而且更实用

XGBoost(eXtreme Gradient Boosting):计算速度快,用于分类和回归问题。号称“比赛夺冠的必备杀器”

参考:大战三回合:XGBoost、LightGBM和Catboost一决高低

读取数据

train = pd.read_csv('train.csv')
test=pd.read_csv('testA.csv')
train.head()
test.head()

id heartbeat_signals label
0 0 0.9912297987616655,0.9435330436439665,0.764677… 0.0
1 1 0.9714822034884503,0.9289687459588268,0.572932… 0.0
2 2 1.0,0.9591487564065292,0.7013782792997189,0.23… 2.0
3 3 0.9757952826275774,0.9340884687738161,0.659636… 0.0
4 4 0.0,0.055816398940721094,0.26129357194994196,0… 2.0

id heartbeat_signals
0 100000 0.9915713654170097,1.0,0.6318163407681274,0.13…
1 100001 0.6075533139615096,0.5417083883163654,0.340694…
2 100002 0.9752726292239277,0.6710965234906665,0.686758…
3 100003 0.9956348033996116,0.9170249621481004,0.521096…
4 100004 1.0,0.8879490481178918,0.745564725322326,0.531…

数据预处理

def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() / 1024**2 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')
​
    end_mem = df.memory_usage().sum() / 1024**2 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

df.memory_usage()

设置参数memory_usage为’deep’来获得准确的内存使用量

reduce_mem_usage()函数

用于减少内存开销,原理在于类型转换,将数值类型转为开销小的类型,前提是不影响精度。

参考:用pandas处理大数据——节省90%内存消耗的小贴士

简单预处理

train_list = []for items in train.values:
    train_list.append([items[0]] + [float(i) for i in items[1].split(',')] + [items[2]])
​
train = pd.DataFrame(np.array(train_list))
train.columns = ['id'] + ['s_'+str(i) for i in range(len(train_list[0])-2)] + ['label']
train = reduce_mem_usage(train)
​
test_list=[]
for items in test.values:
    test_list.append([items[0]] + [float(i) for i in items[1].split(',')])
​
test = pd.DataFrame(np.array(test_list))
test.columns = ['id'] + ['s_'+str(i) for i in range(len(test_list[0])-1)]
test = reduce_mem_usage(test)
​
Memory usage of dataframe is 157.93 MB
Memory usage after optimization is: 39.67 MB
Decreased by 74.9%
Memory usage of dataframe is 31.43 MB
Memory usage after optimization is: 7.90 MB
Decreased by 74.9%
train.head()

id s_0 s_1 s_2 s_3 s_4 s_5 s_6 s_7 s_8 … s_196 s_197 s_198 s_199 s_200 s_201 s_202 s_203 s_204 label 0 0.0 0.991211 0.943359 0.764648 0.618652 0.379639 0.190796 0.040222 0.026001 0.031708 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 1.0 0.971680 0.929199 0.572754 0.178467 0.122986 0.132324 0.094421 0.089600 0.030487 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 2.0 1.000000 0.958984 0.701172 0.231812 0.000000 0.080688 0.128418 0.187500 0.280762 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 3 3.0 0.975586 0.934082 0.659668 0.249878 0.237061 0.281494 0.249878 0.249878 0.241455 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4 4.0 0.000000 0.055817 0.261230 0.359863 0.433105 0.453613 0.499023 0.542969 0.616699 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 5 rows × 207 columns

test.head()

id s_0 s_1 s_2 s_3 s_4 s_5 s_6 s_7 s_8 … s_195 s_196 s_197 s_198 s_199 s_200 s_201 s_202 s_203 s_204 0 100000.0 0.991699 1.000000 0.631836 0.136230 0.041412 0.102722 0.120850 0.123413 0.107910 … 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 1 100001.0 0.607422 0.541504 0.340576 0.000000 0.090698 0.164917 0.195068 0.168823 0.198853 … 0.389893 0.386963 0.367188 0.364014 0.360596 0.357178 0.350586 0.350586 0.350586 0.36377 2 100002.0 0.975098 0.670898 0.686523 0.708496 0.718750 0.716797 0.720703 0.701660 0.596680 … 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 3 100003.0 0.995605 0.916992 0.520996 0.000000 0.221802 0.404053 0.490479 0.527344 0.518066 … 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 4 100004.0 1.000000 0.888184 0.745605 0.531738 0.380371 0.224609 0.091125 0.057648 0.003914 … 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 5 rows × 206 columns

主要完成训练数据与标签拼接工作。完成数据的准备工作,并结合定义的reduce_mem_usage()函数压缩了数据大小。

训练数据/测试数据准备

x_train = train.drop(['id','label'], axis=1)
y_train = train['label']
x_test=test.drop(['id'], axis=1)
模型训练
def abs_sum(y_pre,y_tru):
    y_pre=np.array(y_pre)
    y_tru=np.array(y_tru)
    loss=sum(sum(abs(y_pre-y_tru)))
    return loss
    
def cv_model(clf, train_x, train_y, test_x, clf_name):
    folds = 5
    seed = 2021
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
    test = np.zeros((test_x.shape[0],4))
​
    cv_scores = []
    onehot_encoder = OneHotEncoder(sparse=False)
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
        
        if clf_name == "lgb":
            train_matrix = clf.Dataset(trn_x, label=trn_y)
            valid_matrix = clf.Dataset(val_x, label=val_y)
​
            params = {
                'boosting_type': 'gbdt',
                'objective': 'multiclass',
                'num_class': 4,
                'num_leaves': 2 ** 5,
                'feature_fraction': 0.8,
                'bagging_fraction': 0.8,
                'bagging_freq': 4,
                'learning_rate': 0.1,
                'seed': seed,
                'nthread': 28,
                'n_jobs':24,
                'verbose': -1,
            }
​
            model = clf.train(params, 
                      train_set=train_matrix, 
                      valid_sets=valid_matrix, 
                      num_boost_round=2000, 
                      verbose_eval=100, 
                      early_stopping_rounds=200)
            val_pred = model.predict(val_x, num_iteration=model.best_iteration)
            test_pred = model.predict(test_x, num_iteration=model.best_iteration) 
            
        val_y=np.array(val_y).reshape(-1, 1)
        val_y = onehot_encoder.fit_transform(val_y)
        print('预测的概率矩阵为:')
        print(test_pred)
        test += test_pred
        score=abs_sum(val_y, val_pred)
        cv_scores.append(score)
        print(cv_scores)
    print("%s_scotrainre_list:" % clf_name, cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    print("%s_score_std:" % clf_name, np.std(cv_scores))
    test=test/kf.n_splits
​
    return test
def lgb_model(x_train, y_train, x_test):
    lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
    return lgb_test
lgb_test = lgb_model(x_train, y_train, x_test)

************************************ 1 ************************************ Training until validation scores don’t improve for 200 rounds [100] valid_0’s multi_logloss: 0.0525735
[200] valid_0’s multi_logloss: 0.0422444 [300] valid_0’s
multi_logloss: 0.0407076 [400] valid_0’s multi_logloss: 0.0420398
Early stopping, best iteration is: [289] valid_0’s multi_logloss:
0.0405457 预测的概率矩阵为: [[9.99969791e-01 2.85197261e-05 1.00341946e-06 6.85357631e-07] [7.93287264e-05 7.69060914e-04 9.99151590e-01 2.00810971e-08] [5.75356884e-07 5.04051497e-08 3.15322414e-07 9.99999059e-01] … [6.79267940e-02 4.30206297e-04 9.31640185e-01 2.81516302e-06] [9.99960477e-01 3.94098074e-05 8.34030725e-08 2.94638661e-08] [9.88705846e-01 2.14081630e-03 6.67418381e-03 2.47915423e-03]] [607.0736049372186]
************************************ 2 ************************************ [LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current
value: num_threads=24 Training until validation scores don’t improve
for 200 rounds [100] valid_0’s multi_logloss: 0.0566626
[200] valid_0’s multi_logloss: 0.0450852

lgb_test
temp=pd.DataFrame(lgb_test)
temp
result=pd.read_csv('sample_submit.csv')
result['label_0']=temp[0]
result['label_1']=temp[1]
result['label_2']=temp[2]
result['label_3']=temp[3]
result.to_csv('submit.csv',index=False)
result=pd.read_csv('sample_submit.csv')
result['label_0']=temp[0]
result['label_1']=temp[1]
result['label_2']=temp[2]
result['label_3']=temp[3]
result.to_csv('submit.csv',index=False)

KFlod划分训练测试数据集,使用lightgbm库,设置超参数,训练模型,最终得预测标签(脱敏姑且用0-4表示)

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值