赛题理解
- 学习时间序列数据的特征预处理方法
- 学习时间序列特征处理工具 Tsfresh(TimeSeries Fresh)的使用
比赛地址:https://tianchi.aliyun.com/competition/entrance/531883/introduction
内容介绍
- 数据预处理
- 时间序列数据格式处理
- 加入时间步特征time
- 特征工程
- 时间序列特征构造
- 特征筛选
- 使用 tsfresh 进行时间序列特征处理
代码
为了加快看到结果,训练样本10万条缩减为1000条,47s跑完。这里面有个小插曲,需要将scipy升级到最新,不然要报各种错,原本是:scipy-1.4.1 pip install -U scipy 升级为:1.5.4 就好了
# 包导入
import pandas as pd
import numpy as np
import tsfresh as tsf
from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import impute
import multiprocessing
if __name__ == '__main__':
multiprocessing.freeze_support()
# 数据读取
data_train = pd.read_csv("./datasets/train.csv")
data_test_A = pd.read_csv("./datasets/testA.csv")
print(data_train.shape)
data_train = data_train.loc[99000:, :]
print(data_train.shape)
print(data_test_A.shape)
print(data_train.head())
print(data_test_A.head())
# 对心电特征进行行转列处理,同时为每个心电信号加入时间步特征time
train_heartbeat_df = data_train["heartbeat_signals"].str.split(",", expand=True).stack()
train_heartbeat_df = train_heartbeat_df.reset_index()
train_heartbeat_df = train_heartbeat_df.set_index("level_0")
train_heartbeat_df.index.name = None
train_heartbeat_df.rename(columns={"level_1":"time", 0:"heartbeat_signals"}, inplace=True)
train_heartbeat_df["heartbeat_signals"] = train_heartbeat_df["heartbeat_signals"].astype(float)
print(train_heartbeat_df)
# 将处理后的心电特征加入到训练数据中,同时将训练数据label列单独存储
data_train_label = data_train["label"]
data_train = data_train.drop("label", axis=1)
data_train = data_train.drop("heartbeat_signals", axis=1)
data_train = data_train.join(train_heartbeat_df)
print(data_train)
print(data_train[data_train["id"]==1])
from tsfresh import extract_features
# 特征提取
train_features = extract_features(data_train, column_id='id', column_sort='time')
print(train_features)
from tsfresh.utilities.dataframe_functions import impute
# 去除抽取特征中的NaN值
impute(train_features)
from tsfresh import select_features
# 按照特征和数据label之间的相关性进行特征选择
train_features_filtered = select_features(train_features, data_train_label)
print(train_features_filtered)
输出:
(100000, 3)
(1000, 3)
(20000, 2)
id heartbeat_signals label
99000 99000 0.9756354658280755,0.9219974760554872,0.735584... 0.0
99001 99001 0.013815892693101902,0.0,0.0671713603494603,0.... 2.0
99002 99002 0.0,0.23649261108230346,0.38197547195844656,0.... 0.0
99003 99003 0.9806634976267818,0.7258250285618127,0.509013... 0.0
99004 99004 0.9166493737668732,0.8767511968692409,0.820486... 2.0
id heartbeat_signals
0 100000 0.9915713654170097,1.0,0.6318163407681274,0.13...
1 100001 0.6075533139615096,0.5417083883163654,0.340694...
2 100002 0.9752726292239277,0.6710965234906665,0.686758...
3 100003 0.9956348033996116,0.9170249621481004,0.521096...
4 100004 1.0,0.8879490481178918,0.745564725322326,0.531...
time heartbeat_signals
99000 0 0.975635
99000 1 0.921997
99000 2 0.735584
99000 3 0.462566
99000 4 0.252643
... ... ...
99999 200 0.000000
99999 201 0.000000
99999 202 0.000000
99999 203 0.000000
99999 204 0.000000
[205000 rows x 2 columns]
id time heartbeat_signals
99000 99000 0 0.975635
99000 99000 1 0.921997
99000 99000 2 0.735584
99000 99000 3 0.462566
99000 99000 4 0.252643
... ... ... ...
99999 99999 200 0.000000
99999 99999 201 0.000000
99999 99999 202 0.000000
99999 99999 203 0.000000
99999 99999 204 0.000000
[205000 rows x 3 columns]
Empty DataFrame
Columns: [id, time, heartbeat_signals]
Index: []
Feature Extraction: 100%|██████████| 40/40 [00:47<00:00, 1.20s/it]
C:\Anaconda3-5.2.0-64\lib\site-packages\tsfresh\utilities\dataframe_functions.py:164: FutureWarning: The 'get_values' method is deprecated and will be removed in a future version. Use '.values' or 'np.asarray(..)' instead.
data = df.get_values()
variable heartbeat_signals__abs_energy ... heartbeat_signals__variance_larger_than_standard_deviation
id ...
99000 10.750091 ... 0.0
99001 51.773551 ... 0.0
99002 74.970303 ... 0.0
99003 8.504231 ... 0.0
99004 16.079925 ... 0.0
... ... ... ...
99995 28.742238 ... 0.0
99996 31.866323 ... 0.0
99997 16.412857 ... 0.0
99998 14.281281 ... 0.0
99999 21.637471 ... 0.0
[1000 rows x 794 columns]
WARNING:tsfresh.feature_selection.relevance:Infered regression as machine learning task
variable heartbeat_signals__percentage_of_reoccurring_values_to_all_values ... heartbeat_signals__fft_coefficient__coeff_31__attr_"angle"
id ...
99000 0.848780 ... -54.920196
99001 0.443902 ... 131.401170
99002 0.741463 ... -18.500814
99003 0.780488 ... -68.498682
99004 0.419512 ... -100.719774
... ... ... ...
99995 0.692683 ... -40.819373
99996 0.697561 ... -7.698228
99997 0.404878 ... -40.873693
99998 0.292683 ... -18.446829
99999 0.780488 ... -40.764763
[1000 rows x 485 columns]
将特征工程的代码与训练,测试代码融合见后文。
小插曲,tsfresh版本为0.11.2
>>> import tsfresh
>>> tsfresh.__version__
'0.11.2'
多次报错内存不够,整个程序跑不完就退出:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
File "C:\Anaconda3-5.2.0-64\lib\site-packages\tqdm\std.py", line 1166, in __iter__
for obj in iterable:
File "C:\Anaconda3-5.2.0-64\lib\multiprocessing\pool.py", line 735, in next
raise value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x0000022D1B1BD9B0>'. Reason: 'PicklingError("Can't pickle <class 'MemoryError'>: it's not the same object as builtins.MemoryError",)'
内存不够或进程管理部分有点问题。
对tsfresh升级后一切就正常了。
pip install -U tsfresh
升级后的版本为 tsfresh-0.18.0
win10下32GB内存机器,报内存错误。最后还是在ubuntu18.04上跑,20多分钟能跑完。
安装的最新的lightgbm仍然报如下错误,lightgbm 过高,将其版本改为 3.0.0还是不正常,估计是因为tsfresh生成的特征列名里有双引号等特殊符号,采用这里方法替换下就可以顺利跑过了https://www.pythonheidong.com/blog/article/628512/233c40db39f5a2494a81/ 这链接里的方法或许能解决这个问题
您知道吗,此消息通常在LGBMClassifier()模型(即LGBM)上找到。从熊猫上传数据后,只要在开头就放下这一行,而您的头就会出现问题:
import re
df = df.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))
model = clf.train(params,
File "/root/miniconda3/lib/python3.9/site-packages/lightgbm/engine.py", line 228, in train
booster = Booster(params=params, train_set=train_set)
File "/root/miniconda3/lib/python3.9/site-packages/lightgbm/basic.py", line 2229, in __init__
train_set.construct()
File "/root/miniconda3/lib/python3.9/site-packages/lightgbm/basic.py", line 1468, in construct
self._lazy_init(self.data, label=self.label,
File "/root/miniconda3/lib/python3.9/site-packages/lightgbm/basic.py", line 1298, in _lazy_init
return self.set_feature_name(feature_name)
File "/root/miniconda3/lib/python3.9/site-packages/lightgbm/basic.py", line 1780, in set_feature_name
_safe_call(_LIB.LGBM_DatasetSetFeatureNames(
File "/root/miniconda3/lib/python3.9/site-packages/lightgbm/basic.py", line 110, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Do not support special JSON characters in feature name.
还是升级为最新版本lightgm,Successfully installed lightgbm-3.2.0 最终代码如下:
## 1.导入第三方包
import pandas as pd
import numpy as np
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from tqdm import tqdm
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
import multiprocessing
import re
def gen_tsfresh_features():
# 数据读取
data_train = pd.read_csv("./datasets/train.csv")
# print(data_train.shape)
# data_train = data_train.loc[99000:, :]
# print(data_train.shape)
# print(data_train.head())
# print(data_test_A.shape)
# data_test_A = data_test_A.loc[19000:, :]
# print(data_test_A.shape)
# print(data_test_A.head())
# 对训练数据处理
# 对心电特征进行行转列处理,同时为每个心电信号加入时间步特征time
train_heartbeat_df = data_train["heartbeat_signals"].str.split(",", expand=True).stack()
train_heartbeat_df = train_heartbeat_df.reset_index()
train_heartbeat_df = train_heartbeat_df.set_index("level_0")
train_heartbeat_df.index.name = None
train_heartbeat_df.rename(columns={"level_1": "time", 0: "heartbeat_signals"}, inplace=True)
train_heartbeat_df["heartbeat_signals"] = train_heartbeat_df["heartbeat_signals"].astype(float)
# print(train_heartbeat_df)
# 将处理后的心电特征加入到训练数据中,同时将训练数据label列单独存储
data_train_label = data_train["label"]
data_train = data_train.drop("label", axis=1)
data_train = data_train.drop("heartbeat_signals", axis=1)
data_train = data_train.join(train_heartbeat_df)
# print(data_train)
# print(data_train[data_train["id"] == 1])
from tsfresh import extract_features
print(data_train.info())
print(data_train.tail())
# 减少内存
data_train = reduce_mem_usage(data_train)
data_train.heartbeat_signals = data_train.heartbeat_signals.astype(np.float32) # extract_features 中有函数不支持 float16
print('data_train done Memory usage of dataframe is {:.2f} MB'.format(data_train.memory_usage().sum() / 1024 ** 2))
print(data_train.info())
print(data_train.tail())
# 特征提取
from tsfresh.feature_extraction import ComprehensiveFCParameters
settings = ComprehensiveFCParameters()
# from tsfresh.feature_extraction import MinimalFCParameters
# settings = MinimalFCParameters()
from tsfresh.feature_extraction import extract_features
train_features = extract_features(data_train, default_fc_parameters=settings, column_id='id', column_sort='time')
# 特征提取
# train_features = extract_features(data_train, column_id='id', column_sort='time')
# print(train_features)
from tsfresh.utilities.dataframe_functions import impute
# 去除抽取特征中的NaN值
impute(train_features)
# print(f"train_features.columns:{train_features.columns} {len(train_features.columns)}")
from tsfresh import select_features
# 按照特征和数据label之间的相关性进行特征选择
train_features_filtered = select_features(train_features, data_train_label)
# print(train_features_filtered)
# print(f"train_features_filtered.columns:{train_features_filtered.columns} {len(train_features_filtered.columns)}")
# 对测试数据处理
data_test_A = pd.read_csv("./datasets/testA.csv")
# 对心电特征进行行转列处理,同时为每个心电信号加入时间步特征time
test_heartbeat_df = data_test_A["heartbeat_signals"].str.split(",", expand=True).stack()
test_heartbeat_df = test_heartbeat_df.reset_index()
test_heartbeat_df = test_heartbeat_df.set_index("level_0")
test_heartbeat_df.index.name = None
test_heartbeat_df.rename(columns={"level_1": "time", 0: "heartbeat_signals"}, inplace=True)
test_heartbeat_df["heartbeat_signals"] = test_heartbeat_df["heartbeat_signals"].astype(float)
# print(test_heartbeat_df)
# 将处理后的心电特征加入到训练数据中,同时将训练数据label列单独存储
data_test_A = data_test_A.drop("heartbeat_signals", axis=1)
data_test_A = data_test_A.join(test_heartbeat_df)
# print(data_test_A)
# print(data_test_A[data_test_A["id"] == 1])
from tsfresh import extract_features
# 减少内存
data_test_A = reduce_mem_usage(data_test_A)
data_test_A.heartbeat_signals = data_test_A.heartbeat_signals.astype(np.float32) # extract_features 中有函数不支持 float16
print('data_test_A done Memory usage of dataframe is {:.2f} MB'.format(data_test_A.memory_usage().sum() / 1024 ** 2))
print(data_test_A.info())
print(data_test_A.tail())
# 特征提取
from tsfresh.feature_extraction import ComprehensiveFCParameters
settings = ComprehensiveFCParameters()
# from tsfresh.feature_extraction import MinimalFCParameters
# settings = MinimalFCParameters()
from tsfresh.feature_extraction import extract_features
test_features = extract_features(data_test_A, default_fc_parameters=settings, column_id='id', column_sort='time')
# 特征提取
# test_features = extract_features(data_test_A, column_id='id', column_sort='time')
# print(test_features)
from tsfresh.utilities.dataframe_functions import impute
# 去除抽取特征中的NaN值
impute(test_features)
# 测试数据的特征列与训练数据最终筛选出来的列对齐
# print(f"test_features.columns:{test_features.columns} {len(test_features.columns)}")
test_features_filtered = test_features[train_features_filtered.columns]
# print(f"test_features_filtered.columns:{test_features_filtered.columns} {len(test_features_filtered.columns)}")
return train_features_filtered, data_train_label, test_features_filtered
def reduce_mem_usage(df):
start_mem = df.memory_usage().sum() / 1024 ** 2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum() / 1024 ** 2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
def lightgbm_train_test(train, label, test):
# 简单预处理
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)
## 4.训练数据/测试数据准备
x_train = train
x_train.reset_index(drop=True, inplace=True)
y_train = label
y_train.reset_index(drop=True, inplace=True)
x_test = test
x_test.reset_index(drop=True, inplace=True)
# print("x_train.columns:", x_train.columns)
# print("x_test.columns:", x_test.columns)
x_train = x_train.rename(columns=lambda x: re.sub('[^A-Za-z0-9_]+', '', x))
x_test = x_test.rename(columns=lambda x: re.sub('[^A-Za-z0-9_]+', '', x))
# print("x_train.columns:", x_train.columns)
# print("x_test.columns:", x_test.columns)
print(x_train.shape, x_test.shape, y_train.shape)
## 5.模型训练
def abs_sum(y_pre, y_tru):
y_pre = np.array(y_pre)
y_tru = np.array(y_tru)
loss = sum(sum(abs(y_pre - y_tru)))
return loss
def cv_model(clf, train_x, train_y, test_x, clf_name):
folds = 5
seed = 2021
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
test = np.zeros((test_x.shape[0], 4))
cv_scores = []
onehot_encoder = OneHotEncoder(sparse=False)
for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
print('************************************ {} ************************************'.format(str(i + 1)))
trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
if clf_name == "lgb":
train_matrix = clf.Dataset(trn_x, label=trn_y)
valid_matrix = clf.Dataset(val_x, label=val_y)
params = {
'boosting_type': 'gbdt',
'objective': 'multiclass',
'num_class': 4,
'num_leaves': 2 ** 5,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 4,
'learning_rate': 0.1,
'seed': seed,
'n_jobs': 40,
'verbose': -1,
}
model = clf.train(params,
train_set=train_matrix,
valid_sets=valid_matrix,
num_boost_round=2000,
verbose_eval=100,
early_stopping_rounds=200)
val_pred = model.predict(val_x, num_iteration=model.best_iteration)
test_pred = model.predict(test_x, num_iteration=model.best_iteration)
print("val_y:", val_y.shape)
val_y = np.array(val_y).reshape(-1, 1)
val_y = onehot_encoder.fit_transform(val_y)
print("val_y:", val_y.shape)
print('预测的概率矩阵为:')
print(test_pred)
test += test_pred
score = abs_sum(val_y, val_pred)
cv_scores.append(score)
print(cv_scores)
print("%s_scotrainre_list:" % clf_name, cv_scores)
print("%s_score_mean:" % clf_name, np.mean(cv_scores))
print("%s_score_std:" % clf_name, np.std(cv_scores))
test = test / kf.n_splits
return test
def lgb_model(x_train, y_train, x_test):
lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
return lgb_test
lgb_test = lgb_model(x_train, y_train, x_test)
## 6.预测结果
temp = pd.DataFrame(lgb_test)
result = pd.read_csv('./datasets/sample_submit.csv')
result['label_0'] = temp[0]
result['label_1'] = temp[1]
result['label_2'] = temp[2]
result['label_3'] = temp[3]
result.to_csv('lightgbm_tsfresh_submit.csv', index=False)
if __name__ == '__main__':
multiprocessing.freeze_support()
## 2.读取数据
train, label, test = gen_tsfresh_features()
## 3.数据预处理
lightgbm_train_test(train, label, test)
结果如下:
10万条训练样本tsfresh用了13分12秒提取特征,用的tsfresh全特征提取,测试数据2w条用了2分40秒。
打印日志:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20500000 entries, 0 to 99999
Data columns (total 3 columns):
# Column Dtype
--- ------ -----
0 id int64
1 time int64
2 heartbeat_signals float64
dtypes: float64(1), int64(2)
memory usage: 625.6 MB
None
id time heartbeat_signals
99999 99999 200 0.0
99999 99999 201 0.0
99999 99999 202 0.0
99999 99999 203 0.0
99999 99999 204 0.0
Memory usage of dataframe is 625.61 MB
Memory usage after optimization is: 312.81 MB
Decreased by 50.0%
data_train done Memory usage of dataframe is 351.91 MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20500000 entries, 0 to 99999
Data columns (total 3 columns):
# Column Dtype
--- ------ -----
0 id int32
1 time int16
2 heartbeat_signals float32
dtypes: float32(1), int16(1), int32(1)
memory usage: 351.9 MB
None
id time heartbeat_signals
99999 99999 200 0.0
99999 99999 201 0.0
99999 99999 202 0.0
99999 99999 203 0.0
99999 99999 204 0.0
Memory usage of dataframe is 125.12 MB
Memory usage after optimization is: 62.56 MB
Decreased by 50.0%
data_test_A done Memory usage of dataframe is 70.38 MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4100000 entries, 0 to 19999
Data columns (total 3 columns):
# Column Dtype
--- ------ -----
0 id int32
1 time int16
2 heartbeat_signals float32
dtypes: float32(1), int16(1), int32(1)
memory usage: 70.4 MB
None
id time heartbeat_signals
19999 119999 200 0.0
19999 119999 201 0.0
19999 119999 202 0.0
19999 119999 203 0.0
19999 119999 204 0.0
Memory usage of dataframe is 540.16 MB
Memory usage after optimization is: 135.61 MB
Decreased by 74.9%
Memory usage of dataframe is 108.03 MB
Memory usage after optimization is: 27.12 MB
Decreased by 74.9%
(100000, 707) (20000, 707) (100000,)
************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds
[100] valid_0's multi_logloss: 0.0390895
[200] valid_0's multi_logloss: 0.0403544
[300] valid_0's multi_logloss: 0.0460664
Early stopping, best iteration is:
[120] valid_0's multi_logloss: 0.0384817
val_y: (20000,)
val_y: (20000, 4)
预测的概率矩阵为:
[[9.99872421e-01 1.14566158e-04 1.04119738e-05 2.60074910e-06]
[2.99781740e-05 3.57574813e-05 9.99933330e-01 9.34458391e-07]
[1.33123890e-06 5.29917316e-06 6.80470320e-06 9.99986565e-01]
...
[9.68671325e-02 1.04000140e-04 9.03002578e-01 2.62898604e-05]
[9.99461999e-01 5.06142516e-04 2.46321321e-05 7.22680627e-06]
[9.87411802e-01 2.59772061e-03 3.32525719e-03 6.66522064e-03]]
[616.9668653555406]
************************************ 2 ************************************
Training until validation scores don't improve for 200 rounds
[100] valid_0's multi_logloss: 0.0411013
[200] valid_0's multi_logloss: 0.0435782
[300] valid_0's multi_logloss: 0.0498305
Early stopping, best iteration is:
[128] valid_0's multi_logloss: 0.0406417
val_y: (20000,)
val_y: (20000, 4)
预测的概率矩阵为:
[[9.99333579e-01 6.43609664e-04 1.83672346e-05 4.44378583e-06]
[2.14190087e-05 4.23261287e-05 9.99935527e-01 7.28015920e-07]
[7.10568621e-07 1.02579607e-06 8.49856913e-06 9.99989765e-01]
...
[6.40215168e-02 1.32011105e-04 9.35838236e-01 8.23633287e-06]
[9.99812305e-01 1.79436904e-04 7.10027554e-06 1.15781173e-06]
[9.63323467e-01 1.84819306e-03 2.49731242e-02 9.85521529e-03]]
[616.9668653555406, 584.5661600687715]
************************************ 3 ************************************
Training until validation scores don't improve for 200 rounds
[100] valid_0's multi_logloss: 0.0345854
[200] valid_0's multi_logloss: 0.0359173
[300] valid_0's multi_logloss: 0.0414517
Early stopping, best iteration is:
[129] valid_0's multi_logloss: 0.0334529
val_y: (20000,)
val_y: (20000, 4)
预测的概率矩阵为:
[[9.99670151e-01 3.10434420e-04 1.38141763e-05 5.60089113e-06]
[3.29422885e-05 5.07408138e-05 9.99914696e-01 1.62129796e-06]
[6.90331518e-07 5.30156240e-06 6.01034043e-06 9.99987998e-01]
...
[4.69086868e-02 1.15471922e-04 9.52964291e-01 1.15503859e-05]
[9.99859882e-01 1.31978953e-04 6.24375386e-06 1.89479306e-06]
[9.66072018e-01 5.37069335e-03 6.71278736e-03 2.18445009e-02]]
[616.9668653555406, 584.5661600687715, 524.2566165286664]
************************************ 4 ************************************
Training until validation scores don't improve for 200 rounds
[100] valid_0's multi_logloss: 0.0421553
[200] valid_0's multi_logloss: 0.0456984
[300] valid_0's multi_logloss: 0.0531679
Early stopping, best iteration is:
[101] valid_0's multi_logloss: 0.0420942
val_y: (20000,)
val_y: (20000, 4)
预测的概率矩阵为:
[[9.99438677e-01 5.05283413e-04 3.98975698e-05 1.61422673e-05]
[3.86655428e-05 1.33491932e-04 9.99823524e-01 4.31866830e-06]
[4.95004900e-06 1.28166538e-05 1.10356746e-05 9.99971198e-01]
...
[1.94444222e-01 1.86454698e-04 8.05304622e-01 6.47005102e-05]
[9.99590997e-01 3.25147150e-04 7.27911324e-05 1.10646745e-05]
[9.51256879e-01 3.22789672e-03 1.31294692e-02 3.23857556e-02]]
[616.9668653555406, 584.5661600687715, 524.2566165286664, 668.5802393240347]
************************************ 5 ************************************
Training until validation scores don't improve for 200 rounds
[100] valid_0's multi_logloss: 0.0369026
[200] valid_0's multi_logloss: 0.0378718
[300] valid_0's multi_logloss: 0.0440953
Early stopping, best iteration is:
[135] valid_0's multi_logloss: 0.0357339
val_y: (20000,)
val_y: (20000, 4)
预测的概率矩阵为:
[[9.99783458e-01 1.99305801e-04 1.34426163e-05 3.79360372e-06]
[4.08836356e-05 9.34827615e-05 9.99863910e-01 1.72378711e-06]
[1.44260947e-06 8.34397690e-06 5.16060345e-06 9.99985053e-01]
...
[1.64699038e-01 2.15540464e-04 8.35032806e-01 5.26147275e-05]
[9.99830946e-01 1.60164144e-04 7.30342961e-06 1.58611278e-06]
[9.73556586e-01 4.86287743e-03 1.00772090e-02 1.15033276e-02]]
[616.9668653555406, 584.5661600687715, 524.2566165286664, 668.5802393240347, 528.0451056618585]
lgb_scotrainre_list: [616.9668653555406, 584.5661600687715, 524.2566165286664, 668.5802393240347, 528.0451056618585]
lgb_score_mean: 584.4829973877744
lgb_score_std: 54.662614773384554
做了特征工程比之前的baseline提高了5分,特征生成了很多,筛选环节可能丢失了一些信息提升得不是很多。
对全部特征进行训练,然后对每个最大概率取值为1,最小的取值为0,得分386分,提升到33名,这里是源码:零基础入门数据挖掘-心跳信号分类预测-386分-33名 代码.rar:
参考
https://tianchi.aliyun.com/competition/entrance/531883/introduction
https://github.com/datawhalechina/team-learning-data-mining/tree/master/HeartbeatClassification
Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics, 1165–1188
https://blog.csdn.net/duxiaodong1122?spm=1011.2124.3001.5343&type=blog