train的时候遇到的坑:
1、trainroot、valroot找不到
对应改成和opt相同的就好了。改成trainRoot、valRoot
2、训练的时候(112train。112val)
[0/10000][495/10000] Loss: 25.622013
[0/10000][496/10000] Loss: 17.766382
[0/10000][497/10000] Loss: 19.240152
[0/10000][498/10000] Loss: 17.980661
[0/10000][499/10000] Loss: 14.364298
[0/10000][500/10000] Loss: 15.110470
Start val
Traceback (most recent call last):
File "train.py", line 208, in <module>
val(crnn, test_dataset, criterion)
File "train.py", line 158, in val
preds = preds.transpose(1, 0).contiguous().view(-1)
RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)
python train.py --trainRoot /export/gpudata/fujingling/projects/crnn.pytorch-master/dataset/lmdb_capcha_ten_thousand/ --valRoot /export/gpudata/fujingling/projects/crnn.pytorch-master/dataset/lmdb_xjxl_112/ --ngpu 1 --random_sample --cuda --displayInterval 1 --saveInterval 1000 --batchSize 1 --nepoch 10000 --lr 0.0001 --workers 1 --expr_dir expr_captcha --n_test_disp 50
Namespace(adadelta=False, adam=False, alphabet='12345678abcdefghijkmnopqrstuvwxyzABDEFHKLMNRTY', batchSize=1, beta1=0.5, cuda=True, displayInterval=1, expr_dir='expr_captcha', imgH=32, imgW=100, keep_ratio=False, lr=0.0001, manualSeed=1234, n_test_disp=50, nepoch=10000, ngpu=1, nh=256, pretrained='', random_sample=True, saveInterval=1000, trainRoot='/export/gpudata/fujingling/projects/crnn.pytorch-master/dataset/lmdb_capcha_ten_thousand/', valInterval=500, valRoot='/export/gpudata/fujingling/projects/crnn.pytorch-master/dataset/lmdb_xjxl_112/', workers=1)
具体是什么问题导致的,我现在还不知道,我估计是n_test_disp太大了
因为改成下面这样就不报错了(我必须好好读读源码了):
python train.py --trainRoot /export/gpudata/fujingling/projects/crnn.pytorch-master/dataset/lmdb_xjxl_112/ --valRoot /export/gpudata/fujingling/projects/crnn.pytorch-master/dataset/lmdb_xjxl_112/ --ngpu 1 --random_sample --cuda --displayInterval 10 --saveInterval 100 --batchSize 10 --nepoch 10000 --lr 0.0001 --workers 1 --expr_dir expr_captcha --n_test_disp 10 --valInterval 10
test.py是我从demo.py改了改,搞的
在测试的时候摊的坑如下
1、我在load模型的时候,每一层的参数名称对不上,我自己的模型参数统一多了一个module
以下是我的测试代码
import torch
from torch.autograd import Variable
import utils
import dataset
from PIL import Image
from glob import glob
import models.crnn as crnn
model_path = './expr/netCRNN_0_1000.pth'
img_path = './dataset/test/output_catptcha_20181106_cut/'
alphabet = '12345678abcdefghijkmnopqrstuvwxyzABDEFHKLMNRTY'
nclass = len(alphabet) + 1
nc = 1
nh=256
imgH = 32
model = crnn.CRNN(imgH, nc, nclass, nh)
if torch.cuda.is_available():
model = model.cuda()
#print('loading pretrained model from %s' % model_path)
#model.load_state_dict(torch.load(model_path))
# original saved file with DataParallel
state_dict = torch.load(model_path)
# create new OrderedDict that does not contain `module.`
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
name = k[7:] # remove `module.`
new_state_dict[name] = v
# load params
model.load_state_dict(new_state_dict)
converter = utils.strLabelConverter(alphabet)
transformer = dataset.resizeNormalize((100, 32))
image_dirs = glob(img_path + '*.jpg')
for img_path in image_dirs:
image = Image.open(img_path).convert('L')
image = transformer(image)
if torch.cuda.is_available():
image = image.cuda()
image = image.view(1, *image.size())
image = Variable(image)
model.eval()
preds = model(image)
_, preds = preds.max(2)
preds = preds.transpose(1, 0).contiguous().view(-1)
preds_size = Variable(torch.IntTensor([preds.size(0)]))
raw_pred = converter.decode(preds.data, preds_size.data, raw=True)
sim_pred = converter.decode(preds.data, preds_size.data, raw=False)
print(img_path)
print('%-20s => %-20s' % (raw_pred, sim_pred))
100万训练样本,10万验证集,验证效果几乎接近100%了,但是用业务方给的数据测试效果很不好
我又用业务方给的100张测试数据训练,并用它做验证,效果还可以。
这说明我造的数据和业务方的数据特征分布不一样
现在尝试把所有的图都转成灰度图进行训练和测试,看看效果
//TODO
一直不打印验证日志;
batchsize设置的太大了,--valInterval 设太大了,假设训练数据100张,batch size设置为64,那么只有两个bach。必须保证valInterval < 训练样本总数/batchsize
训练数据有问题,具体什么问题不确定
Traceback (most recent call last):
File "train.py", line 198, in <module>
cost = trainBatch(crnn, criterion, optimizer)
File "train.py", line 177, in trainBatch
t, l = converter.encode(cpu_texts)
File "/export/gpudata/fujingling/projects/crnn.pytorch-master/utils.py", line 51, in encode
text, _ = self.encode(text)
File "/export/gpudata/fujingling/projects/crnn.pytorch-master/utils.py", line 45, in encode
for char in text
KeyError: ' '
Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f569f92c150>> ignored
跑着跑着就报这个错,每次都报
Traceback (most recent call last):
File "train.py", line 198, in <module>
cost = trainBatch(crnn, criterion, optimizer)
File "train.py", line 173, in trainBatch
data = train_iter.next()
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 280, in __next__
idx, batch = self._get_batch()
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 259, in _get_batch
return self.data_queue.get()
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/multiprocessing/queues.py", line 378, in get
return recv()
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv
return pickle.loads(buf)
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/pickle.py", line 864, in load
dispatch[key](self)
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd
fd = multiprocessing.reduction.rebuild_handle(df)
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/multiprocessing/reduction.py", line 157, in rebuild_handle
new_handle = recv_handle(conn)
File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/multiprocessing/reduction.py", line 83, in recv_handle
return _multiprocessing.recvfd(conn.fileno())
OSError: [Errno 4] Interrupted system call
Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f1ebbb37f50>> ignored
解决方案:https://github.com/pytorch/pytorch/issues/4220
我没有找到是哪里导致报错的,但是我按链接里的方法改了以后就不报错了
GPU 上训练的模型,加载是时用CPU
Traceback (most recent call last):
File "test_chinese.py", line 57, in <module>
preds = model(image)
File "/export/gpudata/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/export/gpudata/fujingling/projects/crnn.pytorch-master/models/crnn.py", line 70, in forward
conv = self.cnn(input)
File "/export/gpudata/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/export/gpudata/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/export/gpudata/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/export/gpudata/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
下面是方案,具体我没shi