xgboost自定义目标函数及评价函数
xgboost是支持自定义目标函数和评价函数的,官方给的demo如下:
# user define objective function, given prediction, return gradient and second order gradient
# this is log likelihood loss
def logregobj(preds, dtrain):
labels = dtrain.get_label()
preds = 1.0 / (1.0 + np.exp(-preds))
grad = preds - labels
hess = preds * (1.0 - preds)
return grad, hess
# user defined evaluation function, return a pair metric_name, result
# NOTE: when you do customized loss function, the default prediction value is margin
# this may make builtin evaluation metric not function properly
# for example, we are doing logistic loss, the prediction is score before logistic transformation
# the builtin evaluation error assumes input is after logistic transformation
# Take this in mind when you use the customization, and maybe you need write customized evaluation function
def evalerror(preds, dtrain):
labels = dtrain.get_label()
# return a pair metric_name, result. The metric name must not contain a colon (:) or a space
# since preds are margin(before logistic transformation, cutoff at 0)
return 'my-error', float(sum(labels != (preds > 0.0))) / len(labels)
# training with customized objective, we can also do step by step training
# simply look at xgboost.py's implementation of train
bst = xgb.train(param, dtrain, num_round, watchlist, obj=logregobj, feval=evalerror)
自定义可导目标函数逼近MAE
如官方文档注释,这里的用户自定义目标函数需要一阶导数和二阶导数,而mae不可导,所以不能直接使用mae作为自定义目标函数。所以我们需要使用目标函数逼近mae函数。以下是stack overflow上提供的一些逼近MAE的函数。
从上图我们可以看到,xgb自带的MSE函数对于MAE的逼近并不理想,所以我们可以用其他几个函数来逼近MAE,注意,这里在使用其他几个自定义函数时,我们是无法使用gpu加速xgboost。以下为几个自定义目标函数的python实现代码。参考来源stack overflow
def huber_approx_obj(preds, dtrain):
d = preds - dtrain.get_label()
h = 1 #h is delta in the graphic
scale = 1 + (d / h) ** 2
scale_sqrt = np.sqrt(scale)
grad = d / scale_sqrt
hess = 1 / scale / scale_sqrt
return grad, hess
def fair_obj(preds, dtrain):
"""y = c * abs(x) - c**2 * np.log(abs(x)/c + 1)"""
x = preds - dtrain.get_label()
c = 1
den = abs(x) + c
grad = c*x / den
hess = c*c / den ** 2
return grad, hess
def log_cosh_obj(preds, dtrain):
x = preds - dtrain.get_label()
grad = np.tanh(x)
hess = 1 / np.cosh(x)**2
return grad, hess
以下自定义近似MAE导数参考kaggle讨论
from numba import jit
@jit
def grad(preds, dtrain):
labels = dtrain.get_label()
n = preds.shape[0]
grad = np.empty(n)
hess = 500 * np.ones(n)
for i in range(n):
diff = preds[i] - labels[i]
if diff > 0:
grad[i] = 200
elif diff < 0:
grad[i] = -200
else:
grad[i] = 0
return grad, hess