比赛网址: https://www.kaggle.com/competitions/allstate-claims-severity
任务说明:实际上是一个回归问题,标签变量为loss
,特征中含有大量分类特征和数值型特征,模型评估指标为平均绝对误差MAE。
以下代码使用xgboost
进行预测。
%matplotlib inline
import math, time, random, datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import missingno # 用于可视化数据集缺失值
import seaborn as sns
plt.style.use('seaborn-whitegrid')
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize
import catboost
from sklearn.model_selection import train_test_split
from sklearn import model_selection, tree, preprocessing, metrics, linear_model
from sklearn.svm import LinearSVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LinearRegression, LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier, Pool, cv
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
train = pd.read_csv("/kaggle/input/allstate-claims-severity/train.csv")
test = pd.read_csv("/kaggle/input/allstate-claims-severity/test.csv")
# Print all rows and columns. Dont hide any
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
print(train.shape)
print(test.shape)
(188318, 132)
(125546, 131)
观察标签的分布情况
以便考虑是否取对数操作。
# Save the test id for later submission
# and drop the id from train and test. As id is unique for all rows and don't carry any information
test_id = test["id"]
test.drop("id", axis = 1, inplace = True)
train.drop("id", axis = 1, inplace = True)
print(train.loss.describe()) # 目标变量名叫 loss
print("")
print("The loss/target skewness : ",train.loss.skew())
count 188318.000000
mean 3037.337686
std 2904.086186
min 0.670000
25% 1204.460000
50% 2115.570000
75% 3864.045000
max 121012.250000
Name: loss, dtype: float64
The loss/target skewness : 3.7949583775378604
# since the skewness of loss is about 3, which is higher. Hence it should be corrected.
# I have choosen to go with log1p
# Before skew correction
sns.violinplot(data=train,y="loss") # 小提琴图:https://www.cnblogs.com/metafullstack/p/17658735.html
plt.show()
train["loss"] = np.log1p