做rl_abs过程中遇到的问题

问题一

运行

train_abstractor.py就出现这个问题
nohup: ignoring input
start training with the following hyper-parameters:
{'net': 'base_abstractor', 'net_args': {'vocab_size': 30004, 'emb_dim': 128, 'n_hidden': 256, 'bidirectional': True, 'n_layer': 1}, 'traing_params': {'optimizer': ('adam', {'lr': 0.001}), 'clip_grad_norm': 2.0, 'batch_size': 32, 'lr_decay': 0.5}}
Start training
/root/anaconda3/envs/jjenv_pytorch/lib/python3.6/site-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
  warnings.warn(warning.format(ret))
Traceback (most recent call last):
  File "train_abstractor.py", line 220, in <module>
    main(args)
  File "train_abstractor.py", line 166, in main
    trainer.train()
  File "/data/rl_abs_other/fast_abs_rl/training.py", line 211, in train
    log_dict = self._pipeline.train_step()
  File "/data/rl_abs_other/fast_abs_rl/training.py", line 107, in train_step
    log_dict.update(self._grad_fn())
  File "/data/rl_abs_other/fast_abs_rl/training.py", line 20, in f
    grad_norm = grad_norm.item()
AttributeError: 'float' object has no attribute 'item'
然后把pytorch0.4.1换为0.4.0版本后就解决了如上的问题.

 

问题二

运行

train_full_rl.py的过程中出现如下问题,
train step: 6993, reward: 0.4023
train step: 6994, reward: 0.4021
train step: 6995, reward: 0.4025
train step: 6996, reward: 0.4023
train step: 6997, reward: 0.4024
train step: 6998, reward: 0.4025
train step: 6999, reward: 0.4024
train step: 7000, reward: 0.4024
start running validation...Traceback (most recent call last):
  File "train_full_rl.py", line 228, in <module>
    train(args)
  File "train_full_rl.py", line 182, in train
    trainer.train()
  File "/data/rl_abs_other/fast_abs_rl/training.py", line 216, in train
    stop = self.checkpoint()
  File "/data/rl_abs_other/fast_abs_rl/training.py", line 185, in checkpoint
    val_metric = self.validate()
  File "/data/rl_abs_other/fast_abs_rl/training.py", line 171, in validate
    val_log = self._pipeline.validate()
  File "/data/rl_abs_other/fast_abs_rl/rl.py", line 178, in validate
    return a2c_validate(self._net, self._abstractor, self._val_batcher)
  File "/data/rl_abs_other/fast_abs_rl/rl.py", line 26, in a2c_validate
    for art_batch, abs_batch in loader:
  File "/root/anaconda3/envs/jjenv_pytorch040/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 451, in __iter__
    return _DataLoaderIter(self)
  File "/root/anaconda3/envs/jjenv_pytorch040/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 239, in __init__
    w.start()
  File "/root/anaconda3/envs/jjenv_pytorch040/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/root/anaconda3/envs/jjenv_pytorch040/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/root/anaconda3/envs/jjenv_pytorch040/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/root/anaconda3/envs/jjenv_pytorch040/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/root/anaconda3/envs/jjenv_pytorch040/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
nohup: ignoring input
Warning: METEOR is not configured
loading checkpoint ckpt-2.530862-102000...
loading checkpoint ckpt-2.498729-24000...
loading checkpoint ckpt-2.530862-102000...
Traceback (most recent call last):
  File "train_full_rl.py", line 228, in <module>
    train(args)
  File "train_full_rl.py", line 148, in train
    os.makedirs(join(abs_dir, 'ckpt'))
  File "/root/anaconda3/envs/jjenv_pytorch040/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/data/rl_abs_other/fast_abs_rl/full_rl_model/abstractor/ckpt'

  经过google后,说是线程开太多导致的,然后我就train_full_rl.py把其中的

从4改为1. 这样再次运行就可以了。

 

问题三

nohup: ignoring input
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Warning: METEOR is not configured
loading checkpoint ckpt-2.530862-102000...
loading checkpoint ckpt-2.498729-24000...
loading checkpoint ckpt-2.530862-102000...
start training with the following hyper-parameters:
{'net': 'rnn-ext_abs_rl', 'net_args': {'abstractor': {'net': 'base_abstractor', 'net_args': {'vocab_size': 30004, 'emb_dim': 128, 'n_hidden': 256, 'bidirectional': True, 'n_layer': 1}, 'traing_params': {'optimizer': ['adam', {'lr': 0.001}], 'clip_grad_norm': 2.0, 'batch_size': 32, 'lr_decay': 0.5}}, 'extractor': {'net': 'ml_rnn_extractor', 'net_args': {'vocab_size': 30004, 'emb_dim': 128, 'conv_hidden': 100, 'lstm_hidden': 256, 'lstm_layer': 1, 'bidirectional': True}, 'traing_params': {'optimizer': ['adam', {'lr': 0.001}], 'clip_grad_norm': 2.0, 'batch_size': 32, 'lr_decay': 0.5}}}, 'train_params': {'optimizer': ('adam', {'lr': 0.0001}), 'clip_grad_norm': 2.0, 'batch_size': 32, 'lr_decay': 0.5, 'gamma': 0.95, 'reward': 'rouge-l', 'stop_coeff': 1.0, 'stop_reward': 'rouge-1'}}
Start training
train step: 1, reward: 0.1918^MTraceback (most recent call last):
  File "train_full_rl.py", line 228, in <module>
    train(args)
  File "train_full_rl.py", line 182, in train
    trainer.train()
  File "/data/rl_abs_other/fast_abs_rl/training.py", line 211, in train
    log_dict = self._pipeline.train_step()
  File "/data/rl_abs_other/fast_abs_rl/rl.py", line 173, in train_step
    self._stop_reward_fn, self._stop_coeff
  File "/data/rl_abs_other/fast_abs_rl/rl.py", line 64, in a2c_train_step
    summaries = abstractor(ext_sents)
  File "/data/rl_abs_other/fast_abs_rl/decoding.py", line 94, in __call__
    decs, attns = self._net.batch_decode(*dec_args)
  File "/data/rl_abs_other/fast_abs_rl/model/copy_summ.py", line 63, in batch_decode
    attention, init_dec_states = self.encode(article, art_lens)
  File "/data/rl_abs_other/fast_abs_rl/model/summ.py", line 81, in encode
    init_enc_states, self._embedding
  File "/data/rl_abs_other/fast_abs_rl/model/rnn.py", line 41, in lstm_encoder
    lstm_out = reorder_sequence(lstm_out, reorder_ind, lstm.batch_first)
  File "/data/rl_abs_other/fast_abs_rl/model/util.py", line 58, in reorder_sequence
    sorted_ = sequence_emb.index_select(index=order, dim=batch_dim)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCStorage.cu:58

应该是显存不够了。之前上一次运行都行,不可能这一次就突然显存不够了,于是重启云GPU。然后重新运行。问题消失,程序跑出正确结果。

 

转载于:https://www.cnblogs.com/www-caiyin-com/p/10217093.html

RTX 5是NVIDIA推出的一款高性能显卡系列,RL_FS是该显卡系列的一个技术特性。RL代表的是实时光线追踪(Real-Time Ray Tracing),FS代表的是全局光照效果(Global Illumination)。RTX 5及其特性RL_FS的出现,在实时渲染领域带来了重大突破。 实时光线追踪是一种模拟光线在场景的传播和反射的技术,通过对每个像素进行光线追踪计算,能够实现更加真实和精细的光照效果。可以模拟光线的折射、反射和散射等物理现象,呈现出更真实的阴影、反射和折射效果。与传统的光栅化渲染相比,实时光线追踪能够提供更高质量的图像。 而全局光照效果是光照在整个场景的传播和影响,能够实现自然的光照效果。通过模拟光在空间的传播、散射和反射等过程,使得场景的物体能够更加真实地感受到光的影响。全局光照效果能够提供逼真的阴影、反射和折射效果,让人们感受到光线的自然演化。 RTX 5RL_FS技术结合了实时光线追踪与全局光照效果,使得显卡能够实时渲染出高质量的图像。这项技术的出现对于游戏、电影和设计等领域都具有重要的意义。在游戏RL_FS能够提供更加逼真的画面表现,增强沉浸感和观赏性;在电影制作上,RL_FS可以加速视觉效果的制作过程,提高制作效率;在设计领域,RL_FS能够帮助设计师更精确地展现设计想法和效果。 总之,RTX 5RL_FS技术对于实时渲染领域的发展具有重要意义,为图像表现提供了更高的保真度和视觉品质。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值