【Kaggle】比赛Regression with a Mohs Hardness Dataset-解读PlayGround Series S3 E25 EDA and simple model


比赛地址: Regression with a Mohs Hardness Dataset | Kaggle
原项目地址: 🌟PS-S3-E25🌟 | 📊EDA | Model [EN/ES] | Kaggle

1 Intro

这个notebook的致力于提供一个综合性的的(comprehensive)数据探索性分析(EDA,exploratory data analysis)和一个简单的模型集合(不会被优化(optimized)),但是它能够给出一个对于如何根据给定数据集选择最优模型的大致的(vague)思路,以达到做出决定的最终(ultimate)目标。
通过EDA,我们将能够更加深刻地理解数据的结构(structure)、各种值之间的关系、缺失值,以及可能会影响到我们想要构建和选择用来预测或推荐模型的离群值。
通过执行EDA,我们可以识别潜在的(potential)陷阱(pitfall),然后做出决定和必要的后续处理(subsequent processing)来提升模型的性能和准确率。

2 Data information

这个比赛的数据集是由一个在用机器学习预测莫氏硬度的数据集上训练的深度学习模型生成的。特征分布和原来的接近但不完全相同。可以使用原来的数据集作为本次比赛的一部分,同时可以探索他们的不同,也看看把原来的合并(incorporate)训练能否提高模型性能。

3 Library import

import os 
import sys
import math
import time
import random
import warnings
import numpy as np 
import pandas as pd
import seaborn as sns
import lightgbm as lgb
import tensorflow as tf
import missingno as msno
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import matplotlib.colors as mcolors
import tensorflow_probability as tfp
import tensorflow_decision_forests as tfdf

from sklearn.base import clone
from lightgbm import LGBMRegressor
from sklearn.decomposition import PCA
from catboost import CatBoostRegressor
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold
from scipy.spatial.distance import squareform
from sklego.linear_model import LADRegression
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBRegressor, XGBClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.ensemble import HistGradientBoostingRegressor, VotingRegressor
from sklearn.metrics import median_absolute_error, roc_auc_score, roc_curve
# Put theme of notebook 
from colorama import Fore, Style

# Colors
red = Fore.RED + Style.BRIGHT
mgta = Fore.MAGENTA + Style.BRIGHT
yllw = Fore.YELLOW + Style.BRIGHT
cyn = Fore.CYAN + Style.BRIGHT
blue = Fore.BLUE + Style.BRIGHT

# Reset
res = Style.RESET_ALL
plt.style.use({"figure.facecolor": "#282a36"})

用于设置Jupyter Notebook的主题。它使用了colorama库来定义了一些颜色,并使用matplotlib的样式来设置背景颜色。

  1. 导入了colorama库中的Fore和Style模块,用于设置文本颜色和样式。
  2. 定义了五种颜色:红色(red)、品红色(mgta)、黄色(yllw)、青色(cyn)和蓝色(blue),并将它们设置为亮色(Style.BRIGHT)。
  3. 定义了一个重置样式(res),用于将颜色和样式恢复到默认值。
  4. 使用plt.style.use()方法设置了图形的背景颜色为"#282a36",这是一个深灰色。
# Colors
YELLOW = "#F7C53E"

CYAN_G = "#0CF7AF"
CYAB_DARK = "#11AB7C"

PURPLE = "#D826F8"
PURPLE_DARJ = "#9309AB"
PURPLE_L = "#b683d6"

BLUE = "#0C97FA"
RED = "#FA1D19"
ORANGE = "#FA9F19"
GREEN = "#0CFA58"
LIGTH_BLUE = "#01FADC"
S_BLUE = "#81c9e6"
DARK_BLUE = "#394be6"
# Palettes
PALETTE_2 = [CYAN_G, PURPLE]
PALETTE_3 = [YELLOW, CYAN_G, PURPLE]
PALETTE_4 = [YELLOW, ORANGE, PURPLE, LIGTH_BLUE]
PALETTE_5 = [PURPLE_DARJ, PURPLE_L, PURPLE, BLUE, LIGTH_BLUE]
PALETTE_6 = [BLUE, RED, ORANGE, GREEN, LIGTH_BLUE, PURPLE]

# Vaporwave palette by Francesc Oliveras
PALETTE_7 = [PURPLE_DARJ, PURPLE_L, PURPLE, BLUE, LIGTH_BLUE, DARK_BLUE, S_BLUE]
PALETTE_7_C = [PURPLE_DARJ, BLUE, PURPLE, LIGTH_BLUE, PURPLE_L, S_BLUE, DARK_BLUE]


sns.palplot(sns.color_palette(PALETTE_7))

# Set Style
sns.set_style("whitegrid")
sns.set_palette(PALETTE_7_C)
sns.despine(left=True, bottom=True)

cmap = mcolors.LinearSegmentedColormap.from_list("", PALETTE_2)
cmap_2 = mcolors.LinearSegmentedColormap.from_list("", [S_BLUE, PURPLE_DARJ])

font_family = dict(layout=go.Layout(font=dict(family="Franklin Gothic", size=10), width=1000, height=500))

warnings.filterwarnings('ignore')

在这里插入图片描述

用于设置Python数据可视化库Seaborn(sns)的颜色主题和样式。

  1. 定义了一系列颜色变量,包括黄色、青色、紫色等。这些颜色变量可以用于后续的数据可视化中。
  2. 定义了多个调色板(Palette),每个调色板包含一组颜色值。这些调色板可以用于设置Seaborn图表的颜色方案。
  3. 使用sns.palplot()函数绘制了一个调色板的可视化图。这可以帮助你直观地了解所选调色板的颜色分布情况。
  4. 使用sns.set_style()函数设置了Seaborn图表的样式为"whitegrid",即白色网格背景。
  5. 使用sns.set_palette()函数设置了Seaborn图表的颜色方案为PALETTE_7_C,即紫色、蓝色、浅蓝色等颜色的组合。
  6. 使用sns.despine()函数移除了图表的边框线。
  7. 创建了两个线性分段颜色映射(LinearSegmentedColormap),分别命名为cmapcmap_2,并指定了颜色列表。这些颜色映射可以用于将数据映射到相应的颜色上。
  8. 定义了一个字典font_family,用于设置图表的字体家族和大小。
  9. 使用warnings.filterwarnings('ignore')函数忽略了警告信息。

4 Constant

PS_PATH = "/kaggle/input/playground-series-s3e25"
ORIGINAL_PATH = "/kaggle/input/prediction-of-mohs-hardness-with-machine-learning/jm79zfps6b-1"

PS_TRAIN_FILENAME = "train.csv"
PS_TEST_FILENAME = "test.csv"
OR_ARTIFICIAL_CRYSTAL = "Artificial_Crystals_Dataset.csv"
OR_MINERAL_DATASET = "Mineral_Dataset_Supplementary_Info.csv"
SAMPLE_SUBMISSION_FILENAME = "sample_submission.csv"

TRAIN_PS_PATH = os.path.join(PS_PATH, PS_TRAIN_FILENAME)
TEST_PS_PATH = os.path.join(PS_PATH, PS_TEST_FILENAME)
CRYSTAL_ART_PATH = os.path.join(ORIGINAL_PATH, OR_ARTIFICIAL_CRYSTAL)
MINERAL_PATH = os.path.join(ORIGINAL_PATH, OR_MINERAL_DATASET)
SUBMISSION_PATH = os.path.join(PS_PATH, SAMPLE_SUBMISSION_FILENAME)

os.path.join用于将多个路径组合成一个路径。它会根据操作系统的类型自动选择正确的路径分隔符。

TARGET = "Hardness"
ID = "id"
ALLELECTRONS_AVERAGE = 'allelectrons_Average'

EARLY_STOPPING_ROUNDS = 50
VERBOSE_EVAL = 100
N_ROUND = 1000
TRAIN_LBL = "train"
TEST_LBL = "test"

INCLUDE_ORIGINAL = True
SEED = 500
FOLDS = 6 # 5
N_SPLITS = 6 # 5
TEST_SIZE = 0.18
TIMEOUT = 10 * 36007
ON = "on"
LR = 0.00001
SUBSAMPLE = 0.95

# nn Constants
ACTIVATION = "relu"
ESLN = 0.00001
LYS = 16
TEST_SIZE = 0.1
EPOCHS = 400

MONITOR = "val_loss"
MODE = "min"
callbacks_list = [
    tf.keras.callbacks.EarlyStopping(monitor=MONITOR, patience=50, verbose=2, mode=MODE ,restore_best_weights=True),
    tf.keras.callbacks.ReduceLROnPlateau(monitor=MONITOR, factor=0.8, patience=3, min_lr=LR),
    tf.keras.callbacks.TerminateOnNaN()
] 

使用TensorFlow库创建了一个回调函数列表。

  1. tf.keras.callbacks.EarlyStopping:这是一个用于在训练过程中提前停止模型的回调函数。当监控的指标(monitor)在连续50个epoch内没有改善时,模型将停止训练。restore_best_weights=True表示如果模型在验证集上的性能比之前的最佳性能要好,那么它将恢复最佳权重。
  2. tf.keras.callbacks.ReduceLROnPlateau:这是一个用于在训练过程中调整学习率的回调函数。当监控的指标(monitor)在连续3个epoch内没有改善时,学习率将乘以0.8。min_lr参数设置了学习率的最小值。
  3. tf.keras.callbacks.TerminateOnNaN:这是一个用于在训练过程中检测到NaN值时终止训练的回调函数。这通常发生在计算过程中出现了错误,例如除以零。

5 Functions

def show_corr_heatmap(df, title):
    
    corr = df.corr()
    mask = np.zeros_like(corr)
    mask[np.triu_indices_from(mask)] = True

    plt.figure(figsize = (15, 10))
    plt.title(title)
    # sns.heatmap(corr, annot = False, linewidths=.5, fmt=".2f", square=True, mask = mask, cmap=cmap_2)
    if df.shape[1] < 25:
        sns.heatmap(corr, annot=True, linewidths=.5, fmt=".2f", square=True, mask=mask, cmap=cmap_2)
    else:
        sns.heatmap(corr, annot=False, linewidths=.5, square=True, mask=mask, cmap=cmap_2)

    plt.show()

首先计算了数据框的相关系数矩阵corr,然后创建了一个与corr形状相同的掩码矩阵mask,并将上三角部分的元素设置为True,只显示下三角部分的相关系数。
根据数据框的列数(df.shape[1]),函数选择是否在热力图中显示相关系数的值。如果列数小于25,则使用sns.heatmap()函数绘制带有注释的相关系数热力图;否则,绘制不带注释的热力图。

def data_description(df):
    print("Data description")
    print(f"Total number of records {df.shape[0]}")
    print(f'number of features {df.shape[1]}\n\n')
    columns = df.columns
    data_type = []
    
    # Get the datatype of features
    for col in df.columns:
        data_type.append(df[col].dtype)
        
    n_uni = df.nunique()
    # Number of NaN values
    n_miss = df.isna().sum()
    
    names = list(zip(columns, data_type, n_uni, n_miss))
    variable_desc = pd.DataFrame(names, columns=["Name","Type","Unique levels","Missing"])
    print(variable_desc)

用于打印数据框(DataFrame)的描述信息,输出以下内容:

  1. 数据的总记录数(行数)和特征数(列数)。
  2. 每个特征的数据类型。
  3. 每个特征的唯一值数量。
  4. 每个特征中缺失值的数量。
def plot_cont(col, ax, color=PALETTE_7[0]):
    sns.histplot(data=comb_df, x=col,
                hue="set",ax=ax, hue_order=labels,
                common_norm=False, **histplot_hyperparams)
    
    ax_2 = ax.twinx()
    ax_2 = plot_cont_dot(
        comb_df.query(f"set=={TRAIN_LBL}"),
        col, TARGET, ax_2,
        color=color
    )
    
    ax_2 = plot_cont_dot(
        comb_df, col,
        TARGET, ax_2,
        color=color
    )

用于绘制连续变量的直方图和散点图。

def pie_plot(df: pd.DataFrame, hover_temp: str = "Status: ",
            feature=TARGET, palette=[LIGTH_BLUE,"#221e8f"], color=[BLUE ,PURPLE_DARJ],
            title_="Target distribution"):
#     df[feature] = df[feature].replace({0: "Not cancelled ", 1: "Cancelled"})
    target = df[[feature]].value_counts(normalize=True).sort_index().round(decimals=3)*100
    fig = go.Figure()
    
    fig.add_trace(go.Pie(labels=target.index, values=target, hole=.4,
                        sort=False, showlegend=True, marker=dict(colors=color, line=dict(color=palette,width=2)),
                        hovertemplate = "%{label} " + hover_temp + ": %{value:.2f}%<extra></extra>"))
    
    fig.update_layout(template=font_family, title=title_, 
                  legend=dict(traceorder="reversed",y=1.05,x=0),
                  uniformtext_minsize=15, uniformtext_mode="hide",height=600)
    fig.show()

用于绘制饼图的函数,它接受一个pandas DataFrame作为输入,并根据指定的特征绘制目标分布的饼图。

def cat_distribution(cat_features, df, title = "Distribution of categorical\nfeatures in train dataset\n\n\n"):
    fig, ax = plt.subplots(4, 2, figsize = (16, 20), dpi = 300)
    #ax = ax.flatten()

    for i, column in enumerate(cat_features):

        ax[i][0].pie(
            df[column].value_counts(), 
            shadow = True, 
            explode = [.1 for i in range(df[column].nunique())], 
            autopct = '%1.f%%',
            textprops = {'size' : 14, 'color' : 'white'}
        )

        sns.countplot(data = df, y = column, ax = ax[i][1], palette = PALETTE_7_C, order = df[column].value_counts().index)
        ax[i][1].yaxis.label.set_size(20)
        plt.yticks(fontsize = 12)
        ax[i][1].set_xlabel('Count in Train', fontsize = 15)
        ax[i][1].set_ylabel(f'{column}', fontsize = 15)
        plt.xticks(fontsize = 12)

    fig.suptitle(title, fontsize = 25, fontweight = "bold")
    plt.tight_layout()

用于绘制训练数据集中分类特征的分布情况。

def dist_tree(data, label = ""):
    corr = data.corr(method = "spearman")
    d_lk = linkage(squareform(1 - abs(corr)), "complete")
    
    plt.figure(figsize = (8, 6), dpi = 250)
    dendro = dendrogram(d_lk, labels=data.columns, leaf_rotation=75)
    plt.title(f"Feature Distance in {label}", weight = "bold", size = 23)
    plt.show()

计算数据集中特征之间的相关性,并绘制一个树状图来展示这些相关性。具体步骤如下:

  1. 使用corr()方法计算data中各列之间的相关性,使用斯皮尔曼相关系数(Spearman correlation)作为度量方法。
  2. 使用linkage()函数将相关性矩阵转换为层次聚类树状图所需的链接矩阵。这里使用了"complete"方法,表示完全连接的层次聚类。
  3. 使用squareform()函数将链接矩阵转换为方阵形式。
  4. 使用abs()函数计算相关性矩阵的绝对值,然后取其补数(1 - abs(corr))。
  5. 使用dendrogram()函数绘制树状图,其中labels参数设置为data.columns,表示使用数据集的列名作为树状图的标签;leaf_rotation参数设置为75,表示叶子节点的旋转角度。
def loss_fn(y_true, y_pred):
    return tfp.stats.percentile(tf.abs(y_true - y_pred), q=50)

使用了TensorFlow Probability库(tfp)中的percentile函数来计算预测值(y_pred)与真实值(y_true)之间的绝对误差的50百分位数。

def metric_fn(y_true, y_pred):
    return tfp.stats.percentile(tf.abs(y_true - y_pred), q=100) - tfp.stats.percentile(tf.abs(y_true - y_pred), q=0)

首先计算两个值之间的绝对误差,然后计算这些误差的0%和100%百分位数之差。

# Seen in https://www.kaggle.com/code/tonyyunyang99/luck-is-all-you-need thx <3
def nn_model(): 
    input_l = tf.keras.Input(shape=(len(features), ))
    layer_1 = tf.keras.layers.BatchNormalization(epsilon=ESLN)(input_l)
    layer_2 = tf.keras.layers.Dense(LYS, activation=ACTIVATION)(layer_1)
    layer_3 = tf.keras.layers.Dense(LYS*2, activation=ACTIVATION)(layer_2)
    output_l = tf.keras.layers.Dense(1)(layer_3)   
    
    model = tf.keras.Model(inputs=input_l, outputs=output_l)
    
    model.compile(optimizer=tf.keras.optimizers.Adam(0.013, beta_1=0.5),
              loss=loss_fn,
              metrics=metric_fn)
    
    return model
  1. 首先,定义输入层input_l,其形状为特征数量(len(features))。
  2. 然后,添加一个批量归一化层layer_1,并设置epsilon参数为ESLN
  3. 接着,添加一个全连接层layer_2,神经元数量为LYS,激活函数为ACTIVATION
  4. 再添加一个全连接层layer_3,神经元数量为LYS*2,激活函数为ACTIVATION
  5. 最后,添加一个输出层output_l,神经元数量为1。
  6. 使用输入层和输出层构建模型model
  7. 编译模型,优化器为Adam,学习率为0.013,beta_1参数为0.5,损失函数为loss_fn,评估指标为metric_fn
train_ps_df = pd.read_csv(TRAIN_PS_PATH, index_col="id")
test_ps_df = pd.read_csv(TEST_PS_PATH, index_col="id")
crystal_df = pd.read_csv(CRYSTAL_ART_PATH)
mineral_df = pd.read_csv(MINERAL_PATH, index_col=0)
submission_df = pd.read_csv(SUBMISSION_PATH)
def cvs(estimator, cv = KFold(shuffle = True, random_state = SEED), m_name = "", concat_org = True, train_or_df = mineral_df):
    
    X = train_df.copy()
    y = X.pop(TARGET)
    
    prediction_vals = np.zeros((len(train_df)))
    scrs, validation_srcs = [], []
    
    for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
        
        model = clone(estimator)
        
        X_train = X.iloc[train_idx]
        y_train = y.iloc[train_idx]
        X_val = X.iloc[val_idx]
        y_val = y.iloc[val_idx]
        
        if concat_org:
            X_train = pd.concat([X_train, train_or_df.drop(TARGET, axis = 1)]).reset_index(drop = True)
            y_train = pd.concat([y_train, train_or_df.Hardness]).reset_index(drop = True)
        
        model.fit(X_train, y_train)   
        train_preds = model.predict(X_train)
        
        pred_vals = model.predict(X_val)   
        prediction_vals[val_idx] += pred_vals
        
        train_src = median_absolute_error(y_train, train_preds)
        validation_src = median_absolute_error(y_val, pred_vals)
        
        scrs.append(train_src)
        validation_srcs.append(validation_src)
    
    print(f"\n\n{blue}{m_name.upper()}{res}\n" + f"{mgta}={res}" * 40 + f"\n{cyn}Validation Score:{res} {red}{np.mean(validation_srcs):.5f} ± {np.std(validation_srcs):.5f}{res} \n{cyn}Train Score:{res} {red}{np.mean(scrs):.5f} ± {np.std(scrs):.5f}{res}")
    
    return validation_src, prediction_vals

6 EDA and data modification

data_description(train_ps_df)
data_description(test_ps_df)
data_description(crystal_df)
data_description(mineral_df)

在这里插入图片描述

train_df = pd.concat(objs=[train_ps_df,mineral_df]).reset_index(drop=True)
train_df.shape

在这里插入图片描述

display(show_corr_heatmap(train_df, "Combined dataframe correlation map"))
display(show_corr_heatmap(test_ps_df, "Original test dataframe correlation map"))

在这里插入图片描述

num_features = list(test_ps_df)
fig, ax = plt.subplots(int((len(num_features)/2)+1), 2, figsize = (12, 25), dpi = 250)
ax = ax.flatten()

for i, column in enumerate(num_features):
        
    sns.kdeplot(train_df[column], ax=ax[i], color=PALETTE_7_C[0])
    sns.kdeplot(mineral_df[column], ax=ax[i], color=PALETTE_7_C[2], warn_singular = False)
    sns.kdeplot(test_ps_df[column], ax=ax[i], color=PALETTE_7_C[1])
    
    ax[i].set_title(f"Distribution of {column} column", size = 12)
    ax[i].set_xlabel(None)
    
fig.suptitle("Distribution of Featurea\nX Dataset\n", fontsize = 22, fontweight = "bold")
fig.legend(["Combined", "Test original", "Test PS"])
plt.tight_layout()

在这里插入图片描述

display(dist_tree(train_df, "combined train dataframe"))
display(dist_tree(test_ps_df, "test dataframe"))

在这里插入图片描述

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))

sns.scatterplot(data=train_df, x=ALLELECTRONS_AVERAGE, y='R_cov_element_Average', hue=TARGET, ax=axes[0, 0])
sns.scatterplot(data=train_df, x=ALLELECTRONS_AVERAGE, y='R_vdw_element_Average', hue=TARGET, ax=axes[0, 1])
sns.scatterplot(data=train_df, x=ALLELECTRONS_AVERAGE, y='atomicweight_Average' , hue=TARGET, ax=axes[1, 0])
sns.scatterplot(data=train_df, x=ALLELECTRONS_AVERAGE, y='density_Average', hue=TARGET, ax=axes[1, 1])

plt.tight_layout()
plt.show()

在这里插入图片描述

7 NN data preparation

features = ['allelectrons_Total', 'density_Total', ALLELECTRONS_AVERAGE,
            'val_e_Average', 'atomicweight_Average', 'ionenergy_Average',
            'el_neg_chi_Average', 'R_vdw_element_Average', 'R_cov_element_Average',
            'zaratio_Average', 'density_Average', TARGET]
X = train_df[features].drop(columns=TARGET)
y = train_df.Hardness

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=TEST_SIZE, random_state=SEED)

8 Model

X = train_df.reset_index(drop = True)
y = X.pop(TARGET).reset_index(drop = True)
score_list, oof_list = pd.DataFrame(), pd.DataFrame()


models = [
    ("CB_REG", CatBoostRegressor(
                            iterations=300,
                            depth=9,
                            verbose = 0,
                            subsample=SUBSAMPLE,
                            reg_lambda=0.075,
                            objective = "MAE",
                            random_state = SEED,
                            learning_rate=LR,
                            min_child_samples=96,
                            colsample_bylevel=0.55,
    )),
    ("XGB_REG", XGBRegressor(
                        gamma=0.022,
                        max_depth=9,
                        subsample=SUBSAMPLE,
                        reg_alpha=0.003,
                        reg_lambda=0.001,
                        n_estimators=169, 
                        random_state = SEED, 
                        min_child_weight=10, 
                        learning_rate=LR,
                        colsample_bytree=0.95,
                        objective = "reg:absoluteerror"
    )),
    ("GB_REG", GradientBoostingRegressor(
                                    alpha=0.45,
                                    max_depth=8, 
                                    subsample=SUBSAMPLE,
                                    n_estimators=271, 
                                    min_samples_leaf=9, 
                                    learning_rate=LR,
                                    random_state = SEED,
                                    min_samples_split=15,
                                    loss = "absolute_error"
    )),
    ("HGB_REG", HistGradientBoostingRegressor(
                                        max_iter=251,
                                        max_depth=10, 
                                        max_leaf_nodes=776,
                                        learning_rate=LR,
                                        random_state = SEED,
                                        min_samples_leaf=16, 
                                        l2_regularization=1.2,
                                        loss = "absolute_error",
    )),   
]

for (label, model) in models:
     score_list[label], oof_list[label] = cvs(model, m_name = label, train_or_df = mineral_df)

在这里插入图片描述

weights = LADRegression().fit(oof_list, train_df.Hardness).coef_
pd.DataFrame(weights, index = oof_list.columns, columns = ["Model/Weight"])

使用LAD回归(局部加权回归)对数据进行拟合,并计算每个特征的权重。首先,它使用LADRegression().fit()方法对oof_listtrain_df.Hardness进行拟合。然后,通过访问coef_属性获取每个特征的权重。最后,将权重存储在一个名为weights的变量中,并将其转换为一个DataFrame。
在这里插入图片描述

9 Prediction

v_reg = VotingRegressor(models, weights = weights)
nu = cvs(v_reg)

在这里插入图片描述

f_model = clone(v_reg)
f_model.fit(X, y)

在这里插入图片描述

submission_df[TARGET] = f_model.predict(test_ps_df)
pre_LGBM_model = LGBMRegressor()
pre_LGBM_model.fit(X, y)
X_NN = X.copy()
X_NN["Hardness_p"] = pre_LGBM_model.predict(X)
model = nn_model()
history = model.fit(X_NN.astype("float32"), y.astype("float32"),
                    epochs=EPOCHS,
                    class_weight=pre_LGBM_model.class_weight,
                    callbacks=callbacks_list,
                    validation_split=0.1)
test_df = pd.read_csv(TEST_PS_PATH)
test_df["Hardness_p"] = pre_LGBM_model.predict(test_df.astype("float32").drop(columns=ID))
test_df[TARGET] = model.predict(test_df.astype("float32").drop(columns=ID))
submission_df = test_df[[ID, TARGET]]
submission_df.to_csv("submission.csv", index=False)

10 个人总结

先用lgbm在训练集训练,将预测结果作为新的特征合并入数据集,再输入神经网络模型再次训练。
搞半天,还数据探索性分析,没怎么预处理,提交结果也并没有他说的0.29,而是0.36,不过他构建的nn模型值得学习。
有一说一,图画的很漂亮!!!前段代码可以保存下来,以后直接用。
只能说不算浪费时间,有点收获。。。真坑。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

还重名就过分了啊

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值