XGBoost回归实战:一个Kaggle上的例子


比赛网址: https://www.kaggle.com/competitions/allstate-claims-severity

任务说明:实际上是一个回归问题,标签变量为loss,特征中含有大量分类特征和数值型特征,模型评估指标为平均绝对误差MAE

以下代码使用xgboost进行预测。

%matplotlib inline
import math, time, random, datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import missingno # 用于可视化数据集缺失值
import seaborn as sns
plt.style.use('seaborn-whitegrid')

from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize

import catboost
from sklearn.model_selection import train_test_split
from sklearn import model_selection, tree, preprocessing, metrics, linear_model
from sklearn.svm import LinearSVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LinearRegression, LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier, Pool, cv

import warnings
warnings.filterwarnings('ignore')
import pandas as pd 
train = pd.read_csv("/kaggle/input/allstate-claims-severity/train.csv")
test = pd.read_csv("/kaggle/input/allstate-claims-severity/test.csv")

# Print all rows and columns. Dont hide any
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
print(train.shape)
print(test.shape)
(188318, 132)
(125546, 131)

观察标签的分布情况

以便考虑是否取对数操作。

# Save the test id for later submission 
# and drop the id from train and test. As id is unique for all rows and don't carry any information 
test_id = test["id"]
test.drop("id", axis = 1, inplace = True)
train.drop("id", axis = 1, inplace = True)
print(train.loss.describe()) # 目标变量名叫 loss
print("")
print("The loss/target skewness : ",train.loss.skew())
count    188318.000000
mean       3037.337686
std        2904.086186
min           0.670000
25%        1204.460000
50%        2115.570000
75%        3864.045000
max      121012.250000
Name: loss, dtype: float64

The loss/target skewness :  3.7949583775378604
# since the skewness of loss is about 3, which is higher. Hence it should be corrected. 
# I have choosen to go with log1p

# Before skew correction 
sns.violinplot(data=train,y="loss") # 小提琴图:https://www.cnblogs.com/metafullstack/p/17658735.html  
plt.show()

train["loss"] = np.log1p
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值