第一步:
把loss进行拆分,打印loss相关的值
total_loss,xy_loss,wh_loss,obj_loss,noobj_loss,class_loss = loss_fn(label, output)
tf.print(['loss:xy=',xy_loss,',wh=',wh_loss,',obj=',obj_loss,',noobj=',noobj_loss,',cls=',class_loss])
if tf.math.is_nan(total_loss) is True:
tf.print('ERROR:loss is nan')
sys.exit(1)
pred_loss.append(total_loss)
如下是打印出来的实际值
['loss:xy=', [0], ',wh=', [-nan(ind)], ',obj=', [0], ',noobj=', [0], ',cls=', [0]]
['loss:xy=', [0], ',wh=', [-nan(ind)], ',obj=', [0], ',noobj=', [0], ',cls=', [0]]
['loss:xy=', [0.355872124], ',wh=', [-nan(ind)], ',obj=', [0.392578483], ',noobj=', [0], ',cls=', [3.08788514]]
说明loss值出现nan的问题跟 wh loss有关
第二步
true_wh = true_wh / anchors
true_wh = tf.math.log(true_wh)
true_wh = tf.where(tf.math.is_inf(true_wh),tf.zeros_like(true_wh), true_wh)
wh_loss = obj_mask * box_loss_scale * tf.reduce_sum(tf.square(true_wh - pred_wh), axis=-1)
从wh loss函数定义可以看出,同log函数有关,true_wh经过log变换后,可能出现nan和inf值,需要对此损失函数进行clip处理
true_wh = true_wh / anchors
true_wh = tf.math.log(true_wh)
true_wh = tf.where(tf.math.is_inf(true_wh),tf.zeros_like(true_wh), true_wh)
true_wh = tf.where(tf.math.is_nan(true_wh),tf.zeros_like(true_wh), true_wh)
或者
true_wh = true_wh / anchors
true_wh = tf.clip_by_value(true_wh,1e-8,tf.reduce_max(true_wh))
true_wh = tf.math.log(true_wh)
经过clip修改后,loss打印出来就没再出现nan值
['loss:xy=', [0], ',wh=', [0], ',obj=', [0], ',noobj=', [0], ',cls=', [0]]
['loss:xy=', [0.64036262], ',wh=', [0.346808195], ',obj=', [0.562800765], ',noobj=', [0], ',cls=', [3.18114781]]
['loss:xy=', [0], ',wh=', [0], ',obj=', [0], ',noobj=', [0], ',cls=', [0]]
I0701 11:05:04.097478 21808 train.py:147] 1_train_0, 20.449954986572266, [0.0, 8.679803, 0.0]