2021CCF BDCI
今年CCF又来了,每年都有大佬选手夺冠,也有黑马新人突出重围,对于新人来说一份baseline是很好的起点,可以更快入门数据竞赛。(大佬请忽略!!!)
基于UEBA的用户上网异常行为分析
结构化数据比较好入手,由于贷款违约数据有问题,所以选择了另外一个结构化赛题写了一个baseline,我只提交了一次,线上有0.8994,虽然比不上前排大佬的分数,但是对于入门来说还是可以参考下,而且提升空间还很大!
比赛地址链接:https://www.datafountain.cn/competitions/520
数据列表
话不多说,直接上代码:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import lightgbm as lgb
import xgboost as xgb
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
import math
from sklearn import ensemble
from datetime import datetime
df_train = pd.read_csv('2021CCF用户train_data.csv',encoding = 'gb2312')
df_test = pd.read_csv('2021CCF用户A_test_data.csv',encoding = 'gb2312')
data=pd.concat([df_train,df_test],axis=0)
for col in data.columns:
if col not in ['ret','time','id']:
le = LabelEncoder()
data[col] = le.fit_transform(data[col])
data['time'] = pd.to_datetime(data['time'],format='%Y-%m-%d')
data['month']=data['time'].dt.month
data['day']=data['time'].dt.day
data['weekday']=data['time'].dt.weekday
train=data[data['ret'].notnull()]
test=data[data['ret'].isnull()]
feature=[x for x in train.columns if x not in ['ret', 'time', 'id']]
#lgb
clf = lgb.LGBMRegressor(
learning_rate=0.05,
n_estimators=50230,
# num_leaves=31,
max_depth=7,
subsample=0.8,
# colsample_bytree=0.8,
metric='rmse'
)
train_x=train[feature]
target=train['ret']
test_x=test[feature]
oof1 = np.zeros(len(train))
answers = []
score = 0
n_fold = 5
folds = KFold(n_splits=n_fold, shuffle=True,random_state=2000)
for fold_n, (train_index, valid_index) in enumerate(folds.split(train_x)):
X_train, X_valid = train_x.iloc[train_index], train_x.iloc[valid_index]
y_train, y_valid = target[train_index], target[valid_index]
clf.fit(X_train,y_train,eval_set=[(X_valid, y_valid)],verbose=100,early_stopping_rounds=200)
y_pre=clf.predict(X_valid)
oof1[valid_index]=y_pre
y_pred_valid = clf.predict(test_x)
answers.append(y_pred_valid)
lgb_pre=sum(answers)/n_fold
print('score-----------', (1/((math.sin(math.atan(np.sqrt(mean_squared_error(oof1,target)))))+1)))
sub=df_test[['id']]
sub['ret']=lgb_pre
sub.to_csv('2021CCF用户submit.csv',index=False)
baseline采用最简单的数据预处理方式,只是机械地将数据进行编码,没有考虑相关关系,也没有深度探索每个数据的业务意义,也没有进行数据可视化,这些都是后续优化的方向,优化的方向特别多,这里就不一一列举,希望baseline能给初学者带来帮助,当然,大佬就不用看了。。。
写在最后
本人才疏学浅,如果有理解不到位或者错误的地方请指正!