# 问题的引入

• 其中y_true是真实值，y_pred是预测值
y_true = [0,1,3]
y_pred = [1,2,1]
log_loss(y_true, y_pred)

ValueError: y_true and y_pred contain different number of classes 3, 2. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [0 1 3]


from sklearn.metrics import log_loss
from sklearn.preprocessing import OneHotEncoder

one_hot = OneHotEncoder(n_values=4, sparse=False)

y_true = one_hot.fit_transform([0,1,3])
y_pred = one_hot.fit_transform([1,2,1])
log_loss(y_true, y_pred)


# logloss计算详解

$logloss = -\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{M}y_{i,j}log(p_{i,j})$

• N：样本数
• M：类别数，比如上面的多类别例子，M就为4
• yij：第i个样本属于分类j时为为1，否则为0
• pij：第i个样本被预测为第j类的概率

• y_true = [0,1,3]
• y_pred = [1,2,1]

p = array([[  1.00000000e-15,   1.00000000e+00,   1.00000000e-15,
1.00000000e-15,   1.00000000e-15,   1.00000000e-15,
1.00000000e+00,   1.00000000e-15,   1.00000000e-15,
1.00000000e+00,   1.00000000e-15,   1.00000000e-15]])


$logloss = -\sum_{i=1}^{N}\sum_{j=1}^{M}y_{i,j}log(\frac{1}{N}p_{i,j})$

y_pred /= y_pred.sum(axis=1)[:, np.newaxis]


# 上面的p除以3后就是这个p
p=array([[  3.33333333e-16,   3.33333333e-01,   3.33333333e-16,
3.33333333e-16,   3.33333333e-16,   3.33333333e-16,
3.33333333e-01,   3.33333333e-16,   3.33333333e-16,
3.33333333e-01,   3.33333333e-16,   3.33333333e-16]])

y = array([[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]])


loss = -(y * np.log(p)).sum(axis=1)


# 总结

• 为了方便计算，sklearn中会将数字0转换为1e-15
• sklearn中对logloss的计算，与传统的logloss公式有一点点区别