【分布式Tensorflow(0.11.0)问题 未解决】Segmentation fault (core dumped)

在使用TensorFlow进行分布式训练时,作者遇到`Segmentation fault (core dumped)`错误。测试了三个不同的模型(alexnet、alexnet_v2、inception_v1),所有模型在主函数执行时均出现该错误。尽管数据集(960831张图片)和训练流程相同,但在执行到特定步骤时发生崩溃。目前尚未找到具体出错原因。
摘要由CSDN通过智能技术生成

有三个测试,主函数是基本一样的,就是模型不同,但是均以 Segmentation fault (core dumped) 出错。


在我上一篇问题记录里,是以dummy数据集测试的,只有前向计算,没有参数更新和优化等操作,因此重新写了一个脚本,使用真实的数据集。


train数据集:

960831张图片(224*224),已转换为97个tfrecords文件,如下所示:

[root@dl1 train]# ls
train_224_0.tfrecords   train_224_32.tfrecords  train_224_55.tfrecords  train_224_78.tfrecords
train_224_10.tfrecords  train_224_33.tfrecords  train_224_56.tfrecords  train_224_79.tfrecords
train_224_11.tfrecords  train_224_34.tfrecords  train_224_57.tfrecords  train_224_7.tfrecords
train_224_12.tfrecords  train_224_35.tfrecords  train_224_58.tfrecords  train_224_80.tfrecords
train_224_13.tfrecords  train_224_36.tfrecords  train_224_59.tfrecords  train_224_81.tfrecords
train_224_14.tfrecords  train_224_37.tfrecords  train_224_5.tfrecords   train_224_82.tfrecords
train_224_15.tfrecords  train_224_38.tfrecords  train_224_60.tfrecords  train_224_83.tfrecords
train_224_16.tfrecords  train_224_39.tfrecords  train_224_61.tfrecords  train_224_84.tfrecords
train_224_17.tfrecords  train_224_3.tfrecords   train_224_62.tfrecords  train_224_85.tfrecords
train_224_18.tfrecords  train_224_40.tfrecords  train_224_63.tfrecords  train_224_86.tfrecords
train_224_19.tfrecords  train_224_41.tfrecords  train_224_64.tfrecords  train_224_87.tfrecords
train_224_1.tfrecords   train_224_42.tfrecords  train_224_65.tfrecords  train_224_88.tfrecords
train_224_20.tfrecords  train_224_43.tfrecords  train_224_66.tfrecords  train_224_89.tfrecords
train_224_21.tfrecords  train_224_44.tfrecords  train_224_67.tfrecords  train_224_8.tfrecords
train_224_22.tfrecords  train_224_45.tfrecords  train_224_68.tfrecords  train_224_90.tfrecords
train_224_23.tfrecords  train_224_46.tfrecords  train_224_69.tfrecords  train_224_91.tfrecords
train_224_24.tfrecords  train_224_47.tfrecords  train_224_6.tfrecords   train_224_92.tfrecords
train_224_25.tfrecords  train_224_48.tfrecords  train_224_70.tfrecords  train_224_93.tfrecords
train_224_26.tfrecords  train_224_49.tfrecords  train_224_71.tfrecords  train_224_94.tfrecords
train_224_27.tfrecords  train_224_4.tfrecords   train_224_72.tfrecords  train_224_95.tfrecords
train_224_28.tfrecords  train_224_50.tfrecords  train_224_73.tfrecords  train_224_96.tfrecords
train_224_29.tfrecords  train_224_51.tfrecords  train_224_74.tfrecords  train_224_9.tfrecords
train_224_2.tfrecords   train_224_52.tfrecords  train_224_75.tfrecords  train_224_image_mean.npy
train_224_30.tfrecords  train_224_53.tfrecords  train_224_76.tfrecords
train_224_31.tfrecords  train_224_54.tfrecords  train_224_77.tfrecords


Main函数:
def main(_):
  ps_hosts = FLAGS.ps_hosts.split(",")
  worker_hosts = FLAGS.worker_hosts.split(",")
  cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
  server =   tf.train.Server(cluster,job_name=FLAGS.job_name,task_index=FLAGS.task_index)

  issync = FLAGS.issync
  if FLAGS.job_name == "ps":
    server.join()
  elif FLAGS.job_name == "worker":
    images, labels = ...
    with tf.device(tf.train.replica_device_setter(
                worker_device="/job:worker/task:%d" % FLAGS.task_index,
                cluster=cluster)):
      global_step = tf.Variable(0, name='global_step', trainable=False)
      # 修改这里,调用不同的模型
      logits, parameters = inference(images)
      logits = tf.contrib.layers.flatten(logits)

      cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels, logits=logits, name='xentropy')

      loss_value = tf.reduce_mean(cross_entropy, name='xentropy_mean')       
      optimizer = tf.train.GradientDescentOptimizer(learning_rate)       
      grads_and_vars = optimizer.compute_gradients(loss_value)

      if issync == 1:
        # Synchronous mode
        rep_op = tf.train.SyncReplicasOptimizer(optimizer,
                                            replicas_to_aggregate=len(
                                              worker_hosts),
                                            replica_id=FLAGS.task_index,
                                            total_num_replicas=len(
                                              worker_hosts),
                                            use_locking=True)
        train_op = rep_op.apply_gradients(grads_and_vars, global_step=global_step)
        init_token_op = rep_op.get_init_tokens_op()
        chief_queue_runner = rep_op.get_chief_queue_runner()

      else:
        # Asynchronous mode
        train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)

      init_op = tf.initialize_all_variables()

      saver = tf.train.Saver()
      tf.summary.scalar('cost', loss_value)
      summary_op = tf.summary.merge_all()


    sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
                        logdir="./alexnet_checkpoint",
                        init_op=init_op,
                        summary_op=None,
                        saver=saver,
                        global_step=global_step,
                        save_model_secs=60)

    with sv.prepare_or_wait_for_session(server.target) as sess:
      # Sync
      if FLAGS.task_index == 0 and issync == 1:
        sv.start_queue_runners(sess, [chief_queue_runner])
        sess.run(init_token_op)
      step = 0
      while not sv.should_stop():
        try:
          start_time = time.time()     
          _, loss_v, step = sess.run([train_op, loss_value, global_step])
          if step > 1000:
            break
          duration = time.time() - start_time
          if step >= 10:
            if not step % 10:             
              print ('%s: step %d, duration = %.3f' % (da
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值