Nabu 是一套实现了LAS(listen, attend, spell)的语音识别系统。https://github.com/vrenkens/nabu
LAS是一个基于Attention mechanism的encoder-decoder语音识别模型。相比与传统模型,LAS模型的输出结果是变长序列(decoder部分非常类似于RNN语言模型)。关于LAS的细节可以看论文:
https://arxiv.org/abs/1508.01211 关于Attention的作用机制可以参考:https://arxiv.org/abs/1409.0473 (Attention开始在机器翻译领域应用多)。想要进一步挖掘Attention可以看下Google的 https://arxiv.org/abs/1706.03762 。
nabu,包含:config data docs doxygen.cfg images LICENCE.md main.md nabu py_filter README.md run这几个文件/文件夹,都是用python写的,基于tensorflow。
这里以TIMIT数据集进行流程说明:
数据准备:
新建一个database.conf,这是数据库配置 ,它包含一个部分,指定在文件系统中读取和写入数据的位置,以及应使用哪个处理器来处理数据。
在数据准备阶段,准备所有数据(特征计算,目标归一化等)以进行训练和测试。
nabu/scripts/prepare_data.py:
执行方法:run data --recipe=/path/to/recipe --expdir=/path/to/expdir --computing=<computing>
- recipe:指向包含要为其准备数据的配方的目录。
- expdir:您可以写入的目录的绝对路径。在此目录中,将存储所有文件,如配置和日志
- computing:
run data --recipe=/path/to/recipe --expdir=/path/to/expdir --computing=
/config/recipes/DNN/WSJ/database.cfg
bash -x run data --recipe=./config/recipes/DNN/WSJ/ --expdir=./test --computing=‘standard’
1. 数据准备阶段
把读入的音频,按照我们设置的配置信息,转换成特征序列。
prepare_data.py解释
bash -x run data --recipe=./config/recipes/DNN/WSJ/ --expdir=./test --computing=‘standard’
第一步 把配置文件读取进去,这步调用了nabu/processing/processors里的很多python脚本,有用到了audio_processor.py、feature_computers/feature_computer_factory.py、base.py、processor_factory.py、fbank.py(这个文件执行了把wav数据(sig,rate)转换成fbank特征序列 (sig是wav对应的data数据一维矩阵,rate是采样率16000))
这一步读取的配置文件有:config/…/database.cfg、feature_processor.cfg、/feature_computers/defaults/fbank.cfg,所谓的配置文件,就是我们人为自己配置的。根据不同情况要设置不同的值。
路径:nabu/scripts/prepare_data.py
database.conf要改成database.cfg。
for name in parsed_cfg.sections(): //读取database.cfg的[]里的内容,对于WSJ来说,就是trainfbank、test92fbank、test93fbank、devfbank、trainalignments、devalignments。
for item in conf: //conf是一个dict,比如database.cfg中,key是[]里的,其他的作为value,所以item是type、datafiles、dir、processor_config
句子变短 训练效果变好
fbank与mfcc特征序列:
fbank:40维,(不是39维多1维,而是用了40个滤波器),进行了离散傅里叶变换、取log(都是调用numpy库)没有进行mfcc的ifft
mfcc:39维
- python nabu/scripts/prepare_data.py --recipe=./config/recipes/DNN/WSJ/ --expdir=./test --computing=standard
processing trainfbank
Traceback (most recent call last):
File “nabu/scripts/prepare_data.py”, line 89, in
main(FLAGS.expdir, FLAGS.recipe, FLAGS.computing)
File “nabu/scripts/prepare_data.py”, line 73, in main
data.main(os.path.join(expdir, name))
File “/data/yelong/nabu/nabu/scripts/data.py”, line 56, in main
processed = processor(dataline)
File “/data/yelong/nabu/nabu/processing/processors/audio_processor.py”, line 44, in call
rate, utt = _read_wav(dataline)
File “/data/yelong/nabu/nabu/processing/processors/audio_processor.py”, line 105, in _read_wav
(rate, utterance) = wav.read(wavfile)
File “/home/yelong/anaconda2/lib/python2.7/site-packages/scipy/io/wavfile.py”, line 236, in read
file_size, is_big_endian = _read_riff_chunk(fid)
File “/home/yelong/anaconda2/lib/python2.7/site-packages/scipy/io/wavfile.py”, line 168, in _read_riff_chunk
“understood.”.format(repr(str1)))
ValueError: File format ‘\xea\xff\xeb\xff’… not understood.
[yelong@gpu36 nabu]$
linux python调试工具:pdb
2.linux python调试,我一直用的是 print输出中间变量,下面注释掉,也可以用exit 0
2018.12.5
audio_processor.py中,import scipy.io.wavfile as wav
了,后面 rate, utt = _read_wav(dataline)
,if os.path.exists(wavfile):(rate, utterance) =wav.read(wavfile)
这只能处理wav音频格式,flac和pcm都不行。。。所以要先转换成wav格式?或者换一种读取音频数据方式,不用wav.read
了
feature_computer.py:
self.conf = dict(conf.items('feature')) # self.conf={'feature': 'fbank'}
apply_defaults(self.conf, default) # (调用了default_conf.py),把fbank.cfg的读取到conf里(fbank.cfg是可以改的 一开始有初始值,路径在/nabu/nabu/processing/processors/feature_computers/defaults/fbank.cfg),所以self.conf={'winstep': '0.01', 'lowfreq': '0', 'dynamic': 'ddelta', 'feature': 'fbank', 'preemph': '0.97', 'nfilt': '40', 'winlen': '0.025', 'include_energy': 'True', 'nfft': '512'}
nabu/scripts/data.py:
把wav里的数据读取出来,保存成一个个data格式的数据文件。最终得到的文件有:
- data(文件夹 里是一个个对应wav的特征序列文件)
- dim(123维)
- max_length(1104)(最长帧?)
- pointers.scp(音频文件名(包括.wav)和对应的数据路径的scp,比如
011c0201.wav /data/yelong/nabu/datatest/features/data/file0
) - sequence_length_histogram.npy(numpy格式的数据)(查看方式:在python命令行中:>>> c = np.load( “sequence_length_histogram.npy” ) >>> c 结果为array([0, 0, 0, …, 0, 0, 1], dtype=int32))(结果都是[0,0,…,1]几乎全是0??1105个)??之后再看dase.py吧
(在本次test里,保存的路径在/datatest/features)
报错与解决:
- python nabu/scripts/prepare_data.py --recipe=./config/recipes/DNN/WSJ/ --expdir=./test --computing=standard
processing trainfbank
Traceback (most recent call last):
File “nabu/scripts/prepare_data.py”, line 89, in
main(FLAGS.expdir, FLAGS.recipe, FLAGS.computing)
File “nabu/scripts/prepare_data.py”, line 73, in main
data.main(os.path.join(expdir, name))
File “/data/yelong/nabu/nabu/scripts/data.py”, line 56, in main
processed = processor(dataline)
File “/data/yelong/nabu/nabu/processing/processors/audio_processor.py”, line 47, in call
features = self.comp(utt, rate)
File “/data/yelong/nabu/nabu/processing/processors/feature_computers/feature_computer.py”, line 43, in call
feat = self.comp_feat(sig, rate)
File “/data/yelong/nabu/nabu/processing/processors/feature_computers/fbank.py”, line 28, in comp_feat
feat, energy = base.logfbank(sig, rate, self.conf)
File “/data/yelong/nabu/nabu/processing/processors/feature_computers/base.py”, line 132, in logfbank
feat, energy = fbank(signal, samplerate, conf)
File “/data/yelong/nabu/nabu/processing/processors/feature_computers/base.py”, line 92, in fbank
highfreq = int(conf[‘highfreq’])
KeyError: ‘highfreq’
因为base.py里需要用到highfreq,但是原来的fbank.cfg里没有,因此要在fbank.cfg添加highfreq = 5000。
-
待看:base.py ~~~
-
WSJ的database.cfg里有个alignment_processor,不知道怎么设置路径,因为datafiles = /path/to/pdfs。所以是要生成pdf了再操作?(查看alignment_processor.py,发现是为了kaldi对齐用的。)
能跑通的:
bash -x run data --recipe=./config/recipes/DBLSTM/TIMIT/ --expdir=./test --computing=‘standard’
- python nabu/scripts/prepare_data.py --recipe=./config/recipes/DBLSTM/TIMIT/ --expdir=./test --computing=standard
processing trainfbank
processing traintext
Traceback (most recent call last):
File “nabu/scripts/prepare_data.py”, line 89, in
main(FLAGS.expdir, FLAGS.recipe, FLAGS.computing)
File “nabu/scripts/prepare_data.py”, line 73, in main
data.main(os.path.join(expdir, name))
File “/data/yelong/nabu/nabu/scripts/data.py”, line 32, in main
proc_cfg.get(‘processor’, ‘processor’))(proc_cfg)
File “/data/yelong/nabu/nabu/processing/processors/text_processor.py”, line 29, in init
if conf.get(‘processor’, ‘nonesymbol’) != ‘None’:
File “/home/yelong/anaconda2/lib/python2.7/ConfigParser.py”, line 618, in get
raise NoOptionError(option, section)
ConfigParser.NoOptionError: No option ‘nonesymbol’ in section: ‘processor’
以上是对于database.cfg的[trainfbank].
text_processor 配置载入
针对DBLSTM/TIMIT/database.cfg的[traintext],在nabu/processing/processors/text_processor.py有一段:
"""
if conf.get('processor', 'nonesymbol') != 'None':
self.nonesymbol = conf.get('processor', 'nonesymbol')
else:
self.nonesymbol = ''
"""
#执行的时候会报错。
#注释掉上诉代码,直接写:
self.nonesymbol = ''
# 只针对text_processor.cfg里,没有nonesymbol这个变量的情况
或者直接在text_processor.cfg里写nonesymbol=None
运行完bash -x run data --recipe=./config/recipes/DBLSTM/TIMIT/ --expdir=./test --computing='standard'
我们得到(/datatest/normalized/train):
- alphabet :sil aa ae ah aw ay b ch d dh dx eh er ey f g hh ih iy jh k l m n ng ow oy p r s sh t th uh uw v w y z 39维
- data :文件夹 里面是wav对应的text。。二进制格式。。。
- dim :39维
- max_length :这个是text最长的长度
- nonesymbol
- pointers.scp :举例:011c0201 sil /data/yelong/nabu/datatest/normalized/train/data/file0 (sil是对应文字的开头第一个音素)
- sequence_length_histogram.npy : array([0, 0, 0, 0, 9, 0, 0, 1], dtype=int32) len©=8
在 run
里添加脚本:
if [ $command = 'data' ]; then
rm -rf datatest/normalized/
rm -rf datatest/features/
fi
2. 训练
2018.12.6
在训练阶段,将训练模型以最小化损失函数。在训练期间,可以评估模型以在必要时调整学习速率。培训期间使用配方中的多个配置文件:
用到的python文件有:prepare_train.py、local_cluster.py、train.py、create_cluster.py
能跑通的:
bash -x run train --recipe=./config/recipes/DBLSTM/TIMIT/ --expdir=/data/yelong/nabu/test2 --mode=‘single_machine’ --computing=‘standard’
或:
bash -x run train --recipe=./config/recipes/DBLSTM/TIMIT/ --expdir=/data/yelong/nabu/test1
prepare_train.py
computing_cfg = dict(parsed_computing_cfg.items(‘computing’)) # {‘gpus’: ‘0 1’, ‘numps’: ‘1’, ‘numworkers’: ‘2’}
- label
- ber算法 tensor2tensor里
思考输入输出是什么
train里 输入输出是什么.
bash -x run train --recipe=./config/recipes/DBLSTM/TIMIT/ --expdir=/data/yelong/nabu/test2 --mode=‘non_distributed’ --computing=‘standard’
trainer.py
conf = dict(conf.items('trainer')) # {'trainer': 'standard', 'loss': 'CTC', 'trainlabels': '1', 'text': 'traintext', 'valid_frequency': '500', 'batch_size': '8', 'num_tries': '5', 'num_epochs': '30', 'learning_rate_decay': '1', 'targets': 'text', 'features': 'trainfbank'}
num_replicas = len(cluster.as_dict()['worker']) # 2
input_dataconfs:[[{'datafiles': '/data/yelong/nabu/wav_text/train/wav.scp', 'type': 'audio_feature', 'dir': '/data/yelong/nabu/datatest1/features/train', 'processor_config': 'config/recipes/DBLSTM/TIMIT/feature_processor.cfg'}]]
target_dataconfs:[[{'datafiles': '/data/yelong/nabu/wav_text/train/wsj_train.txt', 'type': 'string', 'dir': '/data/yelong/nabu/datatest1/normalized/train', 'processor_config': 'config/recipes/DBLSTM/TIMIT/text_processor.cfg'}]]
881行 出错了
nabu/neuralnetworks/trainers/trainer.py :329行
生成wav.scp
ls | grep ".WAV" | while read line;do echo $line '/data/yelong/nabu/wav_text_timit/train/timit_train/'$line >>./wav.scp; done
拷贝wav音频
ls | grep ".WAV" | while read line;do cp $line /data/yelong/nabu/wav_text_timit/train/wav_train/$line;done
生成text.TXT
ls | grep ".TXT" | while read line; do cat ${line} >> /data/yelong/nabu/wav_text_timit/train/timit_train.TXT; done
拷贝text文件
ls | grep ".TXT" | while read line;do cp $line /data/yelong/nabu/wav_text_timit/train/text/$line;done
bash -x run data --recipe=./config/recipes/LAS/TIMIT/ --expdir=./exp_timit --computing='standard'
2018.12.7
/speech/open_data/kaldi/下的tool,拷贝到自己目录下的tool
大写变成小写转换:
cat wsj_train.txt | sed ‘s/[A-Z]/\l&/g’ > test.txt
bash -x run data --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test5 --computing=‘standard’
bash -x run train --recipe=./config/recipes/LAS/TIMIT/ --expdir=/data/yelong/nabu/test6 --mode=‘single_machine’ --computing=‘standard’
2018.12.8
011c020f.wav-0 was not found in all sets of data, ignoring this example,报错原因可能是因为生成之前生成的是file0、file1,这里却引用的是011c020f.wav-0
vim中跳到某个路径:gf
nabu/processing/tfwriters/array_writer.py:
shape_feature : bytes_list { value: “\257\002\000\000\000\000\000\000{\000\000\000\000\000\000\000” }
tfwriter.py:
def write(self, data, name): #data:二维array数组,1064*123
有 @abstractmethod 标志,是抽象方法,还会跳转的!!!。。
把filename = os.path.join(self.write_dir, 'file%d' % self.filenum)
#在这里变成file0的,改成filename = os.path.join(self.write_dir, name)
,但这样有一个bug,就是之前:
把txt一整个文件分成单独每行的txt
cat wsj_train_text | while read line;do awk -F’ ’ ‘{print $0 > $1".txt"}’;done
cat wsj_train_text | while read line;do awk -F’ ’ ‘{print $0 > $1".txt"}’;done
[traintext]
datafiles = #是.txt而不是一个文件夹
- 建模单元
normalized的pointers.scp
011c0202 the /data/yelong/nabu/datatest1/normalized/train/data/file0
011c0203 long /data/yelong/nabu/datatest1/normalized/train/data/file1
011c0204 i /data/yelong/nabu/datatest1/normalized/train/data/file2
011c0205 it /data/yelong/nabu/datatest1/normalized/train/data/file3
011c0206 i /data/yelong/nabu/datatest1/normalized/train/data/file4
011c0207 i /data/yelong/nabu/datatest1/normalized/train/data/file5
011c0208 a /data/yelong/nabu/datatest1/normalized/train/data/file6
011c0209 however /data/yelong/nabu/datatest1/normalized/train/data/file7
011c020a instead /data/yelong/nabu/datatest1/normalized/train/data/file8
input_pipeline.py
将(n, f) = line.strip().split('\t')
更改为:
n = line.strip().split('\t')[0]
f = line.strip().split('\t')[-1]
因为normalized的pointers.scp里是 011c0202 the /data/yelong/nabu/datatest1/normalized/train/data/file0
- embeming…?
2018.12.9 先跑kaldi的timit数据集,得到wav.scp和text,再作为传给nabu的配置文件。
text说明:不是一个音频id 一串英文单词,这样是对应不到label的,label是用的,这里label的建模单元 是timit数据集。
bash -x run data --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test --computing=‘standard’
bash -x run train --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test3 --mode=‘single_machine’ --computing=‘standard’
2018.12.10
2018-12-10 15:07:46.824751: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
self.gen.next()
File “/home/yelong/.local/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py”, line 467, in raise_exception_on_not_ok_status
c_api.TF_GetCode(status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [sil s cl p eh sh el cl t ae s cl k f ao r s ix z r eh s cl k y uw hh aa s cl ix vcl jh ix z f m cl k ih vcl n ae cl p er s sil]
为什么fbank是123维dim?
target_list_as_strings, status, None)
File “/home/yelong/anaconda2/lib/python2.7/contextlib.py”, line 24, in exit
self.gen.next()
File “/home/yelong/.local/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py”, line 467, in raise_exception_on_not_ok_status
c_api.TF_GetCode(status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [sil s cl p eh sh el cl t ae s cl k f ao r s ix z r eh s cl k y uw hh aa s cl ix vcl jh ix z f m cl k ih vcl n ae cl p er s sil]
[[Node: validate/validation/evaluate/input_pipeline/read_data/reader_1/StringReaderEOS/assert_equal/Assert/Assert = Assert[T=[DT_STRING], summarize=3, _device="/job:worker/replica:0/task:0/cpu:0"](validate/validation/evaluate/input_pipeline/read_data/reader_1/StringReaderEOS/assert_equal/All_G71, validate/validation/evaluate/input_pipeline/read_data/reader_1/StringReaderEOS/ParseSingleExample/Squeeze_data)]]
改服务器配置:
vim config/computing/standard/single_machine.cfg
#the number of parameter servers (minimum 1)
numps = 1
#the number of workers (minimum 1)
numworkers = 4
#the IDS of the worker GPUs, set to non existing IDs to use CPU
gpus = 0 1 2 3 #改成四卡(我们有四卡)
env 查看
export CUDA_VISIBLE_DEVICES=0,1,2,3
现在都还没生成图。。
2018-12-10 18:05:46.034928: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 029059bad0bf6918 with config:
/home/yelong/.local/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
bash -x run train --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test3 --mode=‘single_machine’ --computing=‘standard’
trainer.py(919)
def _create_graph(self):
'''
create the trainer computational graph
Returns:
- a dictionary of graph outputs
'''
cluster = tf.train.ClusterSpec(self.server.server_def.cluster)这句卡住了
Invalid argument: assertion failed: [sil s cl p eh sh el cl t ae s cl k f ao r s ix z r eh s cl k y uw hh aa s cl ix vcl jh ix z f m cl k ih vcl n ae cl p er s sil]
首先要解决的不是为什么一直报2018-12-10 14:45:46.796101: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
这样session一直在等卡的问题,而是dense tensor of unknown shape
的问题。
sed 's/ /\n/g ’ test.txt | grep -wvf a.txt - | cat
2018.12.11
tf调试:只能sess.run才能看变量,所以不好调试。
网络结构写完,用tf.print看参数对不对。
sil aa ae ah aw ay b ch d dh dx eh er ey f g hh ih iy jh k l m n ng ow oy p r s sh t th uh uw v w y z
sil ax s uw m f ao r ix vcl z ae m cl p uh l ax s ix cl ch uw ey sh en w eh er f aa r m hh eh z ax cl p ae cl k iy ng sh eh vcl d sil ae n vcl d f iy l vcl s sil
timit=/data/yelong/wav_source/data/lisa/data/timit/raw/TIMIT
local/timit_data_prep.sh $timit
bash -x run data --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test_data_39 --computing=‘standard’
bash -x run train --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test_train_39 --mode=‘single_machine’ --computing=‘standard’
找label是怎么传进去的,打log看,看看转换id。
aa ae ah ao aw ax ax er ay b vcl ch d vcl dh dx eh el m en ng epi er ey f g vcl sil hh hh ih ix iy jh k cl l m n ng n ow oy p sil cl r s sh t cl th uh uw uw v w y z zh:
[sil l eh cl t s cl t ey cl k ax m hh ow m sil]
bash -x run data --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test_data_48 --computing=‘standard’
bash -x run train --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test_train_48 --mode=‘single_machine’ --computing=‘standard’
2018.12.12
把text_processor.cfg里的nonesymbol写为nonesymbol = sil,因为在processing/processors/textfile_processor.py里加了一行self.alphabet + [self.nonesymbol]),之前写的’ ‘,可能会有影响,现在测试一下。更新:又改回了’’
昨天一直在解决dense tensor的问题,一行行看代码,没看明白,看不出输入输出,并且我想设断点一行行执行看的时候,因为gpu在使用,因此执行得很卡,想看变量都敲不进去,谷神提醒不需要一行行看,只需要打log看输入输出。后来桃桃姐提醒,主要还是要看alphabet转换成什么了,因此重点在alphabet上,先搜哪里用到了alphabet,找出alphabet用到的那些地方,看变量转换成什么,哪里用到了。
看alphabet怎么转成id号的,输入输出。
Invalid argument: assertion failed: [sil s cl p eh sh el cl t ae s cl k f ao r s ix z r eh s cl k y uw hh aa s cl ix vcl jh ix z f m cl k ih vcl n ae cl p er s sil] [[Node: validate/validation/evaluate/input_pipeline/read_data/reader_1/StringReaderEOS/assert_equal/Assert/Assert = Assert[T=[DT_STRING], summarize=3, _device="/job:worker/replica:0/task:0/cpu:0"]
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/home/yelong/.local/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
-----------------------------------neuralnetworks/decoders/beam_search_decoder.py---------------alphabet
[‘sil’, ‘aa’, ‘ae’, ‘ah’, ‘aw’, ‘ay’, ‘b’, ‘ch’, ‘d’, ‘dh’, ‘dx’, ‘eh’, ‘er’, ‘ey’, ‘f’, ‘g’, ‘hh’, ‘ih’, ‘iy’, ‘jh’, ‘k’, ‘l’, ‘m’, ‘n’, ‘ng’, ‘ow’, ‘oy’, ‘p’, ‘r’, ‘s’, ‘sh’, ‘t’, ‘th’, ‘uh’, ‘uw’, ‘v’, ‘w’, ‘y’, ‘z’]
-----------------------------------neuralnetworks/decoders/beam_search_decoder.py---------------alphabet
[‘sil’, ‘aa’, ‘ae’, ‘ah’, ‘aw’, ‘ay’, ‘b’, ‘ch’, ‘d’, ‘dh’, ‘dx’, ‘eh’, ‘er’, ‘ey’, ‘f’, ‘g’, ‘hh’, ‘ih’, ‘iy’, ‘jh’, ‘k’, ‘l’, ‘m’, ‘n’, ‘ng’, ‘ow’, ‘oy’, ‘p’, ‘r’, ‘s’, ‘sh’, ‘t’, ‘th’, ‘uh’, ‘uw’, ‘v’, ‘w’, ‘y’, ‘z’]
sil aa ae ah aw ay b ch d dh dx eh er ey f g hh ih iy jh k l m n ng ow oy p r s sh t th uh uw v w y z
aa ae ah ao aw ax ax er ay b vcl ch d vcl dh dx eh el m en ng epi er ey f g vcl sil hh hh ih ix iy jh k cl l m n ng n ow oy p sil cl r s sh t cl th uh uw uw v w y z z
aa ae ah ao aw ax ay b ch cl d dh dx eh el en epi er ey f g hh ih ix iy jh k l m n ng ow oy p r s sh sil t th uh uw v vcl w y z zh
把 model.cfg 的 output_dims = 39改成了 output_dims = 48 不知道对不对
把 recognizer.cfg 的 alphabet 改成了48维的
之前把nonesymbol设成sil了,看有一行代码是alphabet = [nonesymbol] + alphabet,把这个也添加进去了,而alphabet本来有sil了,然后把nonesymbol设成’'好像就可以了
修改
成功跑通,没有再出现 Converting sparse IndexedSlices to a dense Tensor of unknown shape:修改处:
0. 先跑kaldi的timit下的 timit_data_prep.sh ,里面cat ${x}.trans | $local/timit_norm_trans.pl -i - -m $conf/phones.60-48-39.map -to 48 | sort > $x.text || exit 1;是生成48个phones,就是把 normalization过程理解为把text转换成phones的过程。(timit有词典的,在timit数据集下,doc中的TIMITDIC.TXT)(当然,也可以在这里把代码的48改成39,就不需要下面步骤了)
- 把text_processor.cfg里的nonesymbol改为’’
- 把 model.cfg 的 output_dims = 39改成了 output_dims = 48
- 把所有的配置文件中alphabet从 alphabet = sil aa ae ah aw ay b ch d dh dx eh er ey f g hh ih iy jh k l m n ng ow oy p r s sh t th uh uw v w y z改为 alphabet = aa ae ah ao aw ax ay b ch cl d dh dx eh el en epi er ey f g hh ih ix iy jh k l m n ng ow oy p r s sh sil t th uh uw v vcl w y z zh 。39的phones改为48的phones,这是转成label id的phones。
据说看tensorflow代码,不用看with…as里面的,没啥用,只要关注用了什么函数,输入输出是什么即可。
train用到的函数:
bash -x run train --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test_train_48.2 --mode=‘single_machine’ --computing=‘standard’
nohup ./run train --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test_train_48.2 --mode=‘single_machine’ --computing=‘standard’ > myout.out 2>&1 &
3.test 测试
nohup ./run test --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test_train_48.2 --computing=‘standard’ > myout.out 2>&1 &
输出:test_train_48.2/test/result loss = 0.259889
修改了prepare_test.py:
软链接有点问题,这里就直接注释掉一些代码,直接把model复制到test/model下(在test_train_48.2/test下:cp -r ../model ./
)。
#create the testing dir
#if os.path.isdir(os.path.join(expdir, 'test')):
#shutil.rmtree(os.path.join(expdir, 'test'))
#os.makedirs(os.path.join(expdir, 'test'))
#copy the config files
shutil.copyfile(database_cfg_file,
os.path.join(expdir, 'test', 'database.conf'))
shutil.copyfile(evaluator_cfg_file,
os.path.join(expdir, 'test', 'test_evaluator.cfg'))
#create a link to the model
#os.symlink(os.path.join(expdir, 'model'),
#os.path.join(expdir, 'test', 'model'))
tf.trainable_variables()
[<tf.Variable ‘Listener/features/layer0/BLSTM/bidirectional_rnn/fw/layer_norm_basic_lstm_cell/kernel:0’ shape=(251, 512) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer0/BLSTM/bidirectional_rnn/fw/layer_norm_basic_lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer0/BLSTM/bidirectional_rnn/bw/layer_norm_basic_lstm_cell/kernel:0’ shape=(251, 512) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer0/BLSTM/bidirectional_rnn/bw/layer_norm_basic_lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer1/BLSTM/bidirectional_rnn/fw/layer_norm_basic_lstm_cell/kernel:0’ shape=(640, 512) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer1/BLSTM/bidirectional_rnn/fw/layer_norm_basic_lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer1/BLSTM/bidirectional_rnn/bw/layer_norm_basic_lstm_cell/kernel:0’ shape=(640, 512) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer1/BLSTM/bidirectional_rnn/bw/layer_norm_basic_lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer2/bidirectional_rnn/fw/layer_norm_basic_lstm_cell/kernel:0’ shape=(640, 512) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer2/bidirectional_rnn/fw/layer_norm_basic_lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer2/bidirectional_rnn/bw/layer_norm_basic_lstm_cell/kernel:0’ shape=(640, 512) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer2/bidirectional_rnn/bw/layer_norm_basic_lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>, <tf.Variable ‘Speller/memory_layer/kernel:0’ shape=(256, 128) dtype=float32_ref>, <tf.Variable ‘Speller/decoder/attention_wrapper/multi_rnn_cell/cell_0/lstm_cell/kernel:0’ shape=(433, 512) dtype=float32_ref>, <tf.Variable ‘Speller/decoder/attention_wrapper/multi_rnn_cell/cell_0/lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>, <tf.Variable ‘Speller/decoder/attention_wrapper/multi_rnn_cell/cell_1/lstm_cell/kernel:0’ shape=(256, 512) dtype=float32_ref>, <tf.Variable ‘Speller/decoder/attention_wrapper/multi_rnn_cell/cell_1/lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>, <tf.Variable ‘Speller/decoder/attention_wrapper/bahdanau_attention/query_layer/kernel:0’ shape=(128, 128) dtype=float32_ref>, <tf.Variable ‘Speller/decoder/attention_wrapper/bahdanau_attention/attention_v:0’ shape=(128,) dtype=float32_ref>, <tf.Variable ‘Speller/decoder/dense/kernel:0’ shape=(384, 49) dtype=float32_ref>, <tf.Variable ‘Speller/decoder/dense/bias:0’ shape=(49,) dtype=float32_ref>]
for param in tf.trainable_variables():
Listener/features/layer0/BLSTM/bidirectional_rnn/fw/layer_norm_basic_lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>
Listener/features/layer1/BLSTM/bidirectional_rnn/fw/layer_norm_basic_lstm_cell/kernel:0’ shape=(640, 512) dtype=float32_ref>
4. decoding 解码
nohup ./run decode --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test_train_48.2 --computing=‘standard’ > myout4.out 2>&1 &
修改同test,也需要新建一个decode文件夹,把model拷贝到该路径。
计算wer:
建成了 decoded 和 reference 文件。
decoded来自路径:/data/yelong/nabu1/nabu/test_train_48.2/decode/decoded
reference来自路径:/data/yelong/kaldi-master/egs/timit/s5/data/local/data/test.text
decoded文件内容:(192个,每个文件取第一个,beam search最佳值)(要写脚本生成decoded:find ./ -not -name "*.npy" |sort -u | while read line; do awk 'NR==1{print}' $line >> decoded;done
)
-0.721640 sil dh ih zh er ax l iy v iy z ax cl p r aa cl k s ix m ey dx ix vcl d w eh dx iy w ix s f iy l iy ng w ix th ih n ih m s eh l f sil
-1.229635 sil y iy s ao vcl ax m ao l w iy z vcl d ix vcl g eh dh er vcl b ow z iy er z sil
-0.516062 sil b eh cl t s eh cl ch cl k ey s ix z w er er ix n dh ax cl p ae s ae n y uw zh uw el sil
reference文件内容:(192个,来自test.text)(在decoded路径下:find ./ -not -name “*.npy” |sort -u | sed -n ‘2, 100p’ | awk -F’/’ ‘{print $2}’ | while read line ;do grep -r $line /data/yelong/wav_source/librispeech/wavscp/testid5k.txt >> reference ;done)
FDHC0_SI1559 sil v ih zh uw ax l iy dh iy z ix cl p r aa cl k s ix m ey dx ix vcl w ah dx iy w ix s f iy l iy n w ih cl th ih n ih m s eh l f sil
FDHC0_SI2189 sil y uw s ao dh ah m ao w iy z cl t ix vcl g eh dh er vcl dh ow z y ih er z sil
FDHC0_SI929 sil b ah cl t s ah cl ch cl k ey s ih z epi w er ix n dh ax cl p ae s cl t ah n y uw zh uw el sil
nabu里有wer程序,但是在kaldi后面才用到,这里自己改写了wer的程序,wer计算的原理是最小编辑距离,viterbi算法,因此搜索长度不宜太长,不然计算量太大,一开始的写法是把所有文本合并成很长很长的一句话,再去计算wer,计算量太大了,思路不对。后来用了逐行比较最小编辑距离,累计编辑距离,再求wer。
改写了wer程序:
import numpy as np
def main(reference, decoded):
'''main function
args:
reference: the file containing the reference utterances
decoded: the directory containing the decoded utterances
'''
substitutions = 0
insertions = 0
deletions = 0
numwords = 0
with open(reference) as fid, open(decoded) as did:
for r1, r2 in zip(fid, did):
reftext = r1.strip().split()[1:]
output = r2.strip().split()[1:]
s, i, d = wer(reftext, output)
substitutions += s
insertions += i
deletions += d
numwords += len(reftext)
substitutions /= numwords
deletions /= numwords
insertions /= numwords
error = substitutions + deletions + insertions
print (
'word error rate: %s\n\tsubstitutions: %s\n\tinsertions: %s\n\t'
'deletions: %s' % (error, substitutions, insertions, deletions))
def wer(reference, decoded):
'''
compute the word error rate
args:
reference: a list of the reference words
decoded: a list of the decoded words
returns
- number of substitutions
- number of insertions
- number of deletions
'''
errors = np.zeros([len(reference) + 1, len(decoded) + 1, 3])
errors[0, :, 1] = np.arange(len(decoded) + 1)
errors[:, 0, 2] = np.arange(len(reference) + 1)
substitution = np.array([1, 0, 0])
insertion = np.array([0, 1, 0])
deletion = np.array([0, 0, 1])
for r, ref in enumerate(reference):
for d, dec in enumerate(decoded):
errors[r+1, d+1] = min((
errors[r, d] + (ref != dec)*substitution,
errors[r+1, d] + insertion,
errors[r, d+1] + deletion), key=np.sum)
return tuple(errors[-1, -1])
if __name__ == '__main__':
main('reference', 'decoded')
计算得:wer为30.08% (word error rate: 0.30076230076230076,substitutions: 0.21025641025641026,insertions: 0.02993762993762994,deletions: 0.06056826056826057)
libspeech数据集
把timit数据集换成libspeech做
- bpe算法(用的github上sentencepiece直接拿来用)
- word piece模型
- beam search算法看懂
beam search算法学习
nohup ./run data --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test_data_test --computing=‘standard’ > myout_data.out 2>&1 &
Parameter search
调整网络结构,通过改变model.cfg里的网络层数、结点数来调整。都尝试一遍,最后看哪个模型效果好。
新建一个sweep.txt,比如:
4layers_1024units
model.cfg encoder num_layers 4
model.cfg encoder num_units 1024
4layers_1024units
model.cfg encoder num_layers 4
model.cfg encoder num_units 2048
5layers_1024units
model.cfg encoder num_layers 5
model.cfg encoder num_units 1024
5layers_1024units
model.cfg encoder num_layers 5
model.cfg encoder num_units 2048
执行:run sweep --command=<command> --sweep=/path/to/sweepfile --expdir=/path/to/exdir <command option>
2018.12.15
看一下kaldi的timit,输入文本,是怎么通过建模单元的映射的:答:timit数据集已经有.phn文件了,对应多少帧是什么phone,已经是建模单元了。(数据集自带的)
subword-nmt learn-bpe -s 100000 < lib.txt > code.txt
subword-nmt apply-bpe -c code.txt < test.txt > out.txt
sed -r ‘s/(@@ )|(@@ ?$)//g’
删除第一列:sed -e 's/[^ ]* //' file
2018.12.16
tokenization 就是把某个单词作为label,[单词]
SentencePiece将输入文本视为一系列Unicode字符。空格也作为普通符号处理。
2018.12.17
拷贝前100个文件:
ls | head -100 | xargs -i cp -r {} /data/yelong/bpe_test/wav/train
显示一个文件的某几行(中间几行):
sed -n '5,10p' filename 这样你就可以只查看文件的第5行到第10行。
pcm转wav:
sox -t raw -c 1 -e signed-integer -b 16 -r 16000 test.pcm test.wav
ls | grep “.pcm”| awk -F’.’ ‘{print $1}’ | while read line ;do sox -t raw -c 1 -e signed-integer -b 16 -r 16000 l i n e . p c m . / w a v / {line}.pcm ./wav/ line.pcm./wav/{line}.wav;done
生成wav.scp
/data/yelong/bpe_test/wav/dev/wav
ls | while read line;do echo l i n e ′ / d a t a / y e l o n g / b p e t e s t / w a v / t e s t / w a v / ′ line '/data/yelong/bpe_test/wav/test/wav/' line′/data/yelong/bpetest/wav/test/wav/′line’ |’>>…/wav.scp; done
bash -x run data --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test_data --computing=‘standard’
nohup ./run data --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test_data --computing=‘standard’ > data.out 2>&1 &
bash -x run train --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test_train --mode=‘single_machine’ --computing=‘standard’
nohup ./run train --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test_train --mode=‘single_machine’ --computing=‘standard’ > train.out 2>&1 &
- 先把pcm转成wav:sox -t raw -c 1 -e signed-integer -b 16 -r 16000 test.pcm test.wav
- 生成wav.scp和text.txt
bash -x run train --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test_train_temp --mode=‘single_machine’ --computing=‘standard’
sed ‘s/,//g’
今天一直报xx.wav-0 was not found in all sets of data, ignoring this example,这是因为我的wav.scp写错了 我写的是1001-134707-0022.wav /data/yelong/bpe_test/wav/train/wav/1001-134707-0022.wav
,名称多了.wav,应该是1001-134707-0022 /data/yelong/bpe_test/wav/train/wav/1001-134707-0022.wav
删除第一列的.wav:
sed 's/.wav//g' wav.scp | awk '{print $1" "$2".wav"}'>wav2.scp !错误 这样会把路径中的wav也删除了
周会:
训练稳定?
speech las 加
cmvn 不是一定要加吗?
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("/data/yelong/bpe_test/bpe_test.model")
with open(’/data/yelong/bpe_test/wav/dev/text.txt’, ‘a’) as fid, open(’/data/yelong/bpe_test/wav/dev/dev.txt’) as did:
for line in did:
a = line.strip().split()[1:] # eg. “TWO COME MUSE MIGRATE”
aa = ’ ‘.join([t for t in a])
listid = sp.EncodeAsIds(aa)
strid = ’ ‘.join([str(t) for t in listid])
b = line.strip().split()[:1]
b =’’.join([t for t in b])
fid.write(b+’ ‘+strid+’\n’)
2018.12.18
训练的时候 出现:
OOM when allocating tensor with shape[32,320032,128]
这个问题是因为显存不足,把配置文件recognizer.cfg、test_evaluator.cfg、 trainer.cfg、validation_evaluator.cfg中的的batch_size改小一点。现在把trainer.cfg改成64,其他的验证集等,暂时设为4。
- 这里的320032怎么来的呢。。
nvidia-smi命令,tensorflow运行时,默认把显存全占用(GPU Memory Usage都用了,就是把地址先占用了),但是并没有使用起来,右边的 Volatile GPU-Util,显示的是使用情况。(占用不等于使用)
迭代次数,与batch_size有关。
- 怎么喂数据palceholder
哪里Session
训练迭代数 多少合适
- CTC弄懂
- nabu和kaldi后面几步 走过一遍试试
- tensorboard 学习
训练测试验证集比例:8:1:1
一句话输出20k的维度?
要解决UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "的问题,不然一开始会分配一个很大的显存,后面就分配得少了,虽然这个问题出现了,依然能训练。但还是要解决。
这是因为没有设置传进某一个参数,导致构建稀疏矩阵时,比如1N的矩阵,tensorflow里需要转换成MM的矩阵,但是这个M是不确定的,所以tensorflow分配了一个很大的值,网上搜索这个问题是由什么引起的。然后手动传进去此参数值。
构建矩阵,与训练集数量无关,只和 batch_size、特征维度dim、输入话的长度(帧数)…有关,batch_size就是一次处理多少个,一批一批处理,(比如训练集为1280个,batch_size=128,则会循环10次,每个构建的矩阵(或者叫张量)大小是固定的(las是不固定的,每句话是变长的…))
测试:
bash -x run test --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test_train --computing=‘standard’
(不知道为啥 后台nohup就不行,直接bash -x就可以。。?不科学)
解码:
bash -x run decode --recipe=./config/recipes/LAS/TIMIT/ --expdir=./test_train --computing=‘standard’ --mode=‘single_machine’
这步执行特别快
Tensorflow数据读取的三种方式:
参考 https://blog.csdn.net/gsww404/article/details/78083169
- Preloaded data: 预加载数据
- Feeding: Python产生数据,再把数据喂给后端。(feed_dict 最慢,小训练集适用,大的就不行了)
- Reading from file: 从文件中直接读取(队列)
train:1 - 224993 (281241*0.8)(224993个)
test:224994 - 253117 (224993 + 28124 = 253117)(28124个)
dev:253118 - 281241 (253117 + 28124 = 281241)(28124个)
ls | head -224993 | xargs -i cp -r {} /data/yelong/wav_source/librispeech/train
ls | sed -n ‘224994,253117p’| xargs -i cp -r {} /data/yelong/wav_source/librispeech/test
ls | sed -n ‘253118,281241p’| xargs -i cp -r {} /data/yelong/wav_source/librispeech/dev
配置文件validation_evaluator.cfg,解码部分,这里the maximum number of output steps,max_steps = 100,要怎么理解呢,因为确实输出向量只有100维(最长的)(小于100维是可变长的),而不是20000。
好像是没关系的,在解码的tf.contrib.seq2seq.dynamic_decode用到,是最大解码步数。
recognizer.cfg:好像就是和test_evaluator.cfg配置差不多,用于解码它的识别器的配置包含与评估器配置类似的字段。
ls | sed -n ‘189656,253117p’| xargs -i cp -r {} /data/yelong/wav_source/librispeech/train
ls | sed -n ‘224994,253117p’| xargs -i mv {} /data/yelong/wav_source/librispeech/test
生成wav.scp
ls | awk -F’.’ ‘{print $1" /data/yelong/wav_source/librispeech/train/"$1".wav"}’ > /data/yelong/wav_source/librispeech/wavscp/train_wav.scp
生成text
sed -n ‘253118,281241p’ lib_sort.txt > /data/yelong/wav_source/librispeech/wavscp/dev.txt
ls | awk -F’.’ ‘{print $1" /data/yelong/wav_source/librispeech/test/"$1".wav"}’ > /data/yelong/wav_source/librispeech/wavscp/test_wav.scp
ls | awk -F’.’ ‘{print $1}’ | while read line ;do sox -t raw -c 1 -e signed-integer -b 16 -r 16000 l i n e . p c m / d a t a / y e l o n g / w a v s o u r c e / l i b r i s p e e c h / t e s t 1 / {line}.pcm /data/yelong/wav_source/librispeech/test1/ line.pcm/data/yelong/wavsource/librispeech/test1/{line}.wav;done > /data/yelong/wav_source/librispeech/out.out 2>&1 &
>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=/data/yelong/bpe_test/lib.txt --model_prefix=/data/yelong/bpe_test/bpe --vocab_size=20000 --model_type=bpe')
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("/data/yelong/bpe_test/bpe.model")
with open('/data/yelong/wav_source/librispeech/wavscp/testid.txt', 'a') as fid, open('/data/yelong/wav_source/librispeech/wavscp/test.txt') as did:
for line in did:
a = line.strip().split()[1:] # eg. "TWO COME MUSE MIGRATE"
aa = ' '.join([t for t in a])
listid = sp.EncodeAsIds(aa)
strid = ' '.join([str(t) for t in listid])
b = line.strip().split()[:1]
b =''.join([t for t in b])
fid.write(b+' '+strid+'\n')
2018.12.19
feature的dim是123维 41*3
tensorflow里op是什么
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,320032,128]
[[Node: validate/validation/evaluate/evaluate_decoder/beam_search_decoder/Speller_1/decoder/while/beam_search/expand_beam/concat_6 = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:worker/replica:0/task:0/device:GPU:0"](validate/validation/evaluate/evaluate_decoder/beam_search_decoder/Speller_1/decoder/while/beam_search/expand_beam/Reshape_7, validate/validation/evaluate/evaluate_decoder/beam_search_decoder/Speller_1/decoder/while/Switch_5:1, validate/validation/evaluate/evaluate_decoder/beam_search_decoder/Speller_1/decoder/while/beam_search/expand_beam/concat_9/axis)]]
[[Node: validate/validation/evaluate/evaluate_decoder/div_S331 = _Recvclient_terminated=false, recv_device="/job:ps/replica:0/task:0/cpu:0", send_device="/job:worker/replica:0/task:0/device:GPU:0", send_device_incarnation=5832999558069970679, tensor_name=“edge_1561_validate/validation/evaluate/evaluate_decoder/div”, tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/cpu:0"]]
step 0/118610
step 0/3558300
step 0/582500
把trainer.cfg的num_epochs = 300改成100
2018-12-19 11:50:10.750165: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session a976d0ee459a760a with config:
/home/yelong/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/home/yelong/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/home/yelong/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/home/yelong/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
model.cfg的输出维度output_dims = 20000解释
这里并不是每一个样本对应一个20000维的输出序列,而是,比如每个输入xi,对应输出yj,这里yj是一个多类判决softmax,有20000种判别结果,选概率最大的一个作为输出yj,再去计算loss。
建模单元理解
比如词汇表是[3,4],只有2个建模单元,输出序列,(不是整个输入序列,这里的label只是一帧),一帧的label是[3,3,3,4],但是训练的输出是[3,4,4,4]或[4,3,3,3]或其他,就有一个误差了,然后更新误差。
所以这里面完全可以把建模单元的词汇表写为[1,2],对应的label[1,1,1,2],这种是没有差别的。
WORKER 0: validating model
tensorflow/core/common_runtime/bfc_allocator.cc:277]
2018-12-19 11:19:44.240584: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ******************************************************************************************xxxxxxxxxx
237 2018-12-19 11:19:44.240606: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,320032,128]
238 2018-12-19 11:19:44.244473: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,320032,128]
239 [[Node: validate/validation/evaluate/evaluate_decoder/beam_search_decoder/Speller_1/decoder/while/beam_search/expand_beam/concat_6 = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:worker/replica:0/task:0/device:GPU:0
"](validate/validation/evaluate/evaluate_decoder/beam_search_decoder/Speller_1/decoder/while/beam_search/expand_beam/Reshape_7, validate/validation/evaluate/evaluate_decoder/beam_search_decoder/Speller_1/decoder/while/Switch_5:1,
validate/validation/evaluate/evaluate_decoder/beam_search_decoder/Speller_1/decoder/while/beam_search/expand_beam/concat_9/axis)]]
240 [[Node: validate/validation/evaluate/evaluate_decoder/dense_sequence_to_sparse_1/get_indices/Where_G797 = _Recvclient_terminated=false, recv_device="/job:worker/replica:0/task:0/cpu:0", send_device="/job:worker/replica:0/ta
sk:0/device:GPU:0", send_device_incarnation=5832999558069970679, tensor_name=“edge_1552_validate/validation/evaluate/evaluate_decoder/dense_sequence_to_sparse_1/get_indices/Where”, tensor_type=DT_INT64, _device="/job:worker/repli
ca:0/task:0/cpu:0"]]
20180.12.20
- 看懂trainer.py
- 解决稀疏到稠密矩阵的warning
- feature的dim是123 why
tensorflow的 op:里面写的是很多能进行张量运算的功能。
target:就是目标输出y
data_queue_elements : 224993,28124,这是把所有句子读进去了,
调试方法(python -m pdb):
python -m pdb nabu/scripts/train.py --clusterfile=./test_train/cluster/cluster --job_name=worker --task_index=0 --sh_command=None --expdir=./test_train
(pdb:) b 24 # 在24行设置断点
(pdb:) c #一直执行,直到断点处(这个方法很好用,一般循环执行,要按n很多次,在循环后面设置一个断点,这样就可以直接运行到下面)
data_queue_elements, _ = input_pipeline.get_filenames(input_dataconfs + target_dataconfs) 把所有句子读进来
参数服务器ps
分为参数服务器ps和计算节点worker,worker是执行运算,负责干活和更新参数;参数服务节点ps则负责存储参数,进行分配的,比如让每个worker分别干什么活,这是ps干的。
input_LOG: {‘features’: <tf.Tensor ‘validate/validation/evaluate/input_pipeline/batch:0’ shape=(4, ?, 123) dtype=float32>} input_seq_length : {‘features’: <tf.Tensor ‘validate/validation/evaluate/input_pipeline/batch:1’ shape=(4,) dtype=int32>}
nabu/neuralnetworks/components:各个关系
trainer.py → model.py → listener.py/ed_encoder.py → layer.py/ → ops.py
model.py(76):
logits, logit_seq_length,
logits
{‘text’: <tf.Tensor ‘train/Speller/decoder/transpose:0’ shape=(?, ?, 20001) dtype=float32>}
(Pdb) logit_seq_length
{‘text’: <tf.Tensor ‘train/Speller/decoder/while/Exit_12:0’ shape=(?,) dtype=int32>}
outputs[‘num_steps’]
118610
outputs['update_op'] = self._update(
187 loss=outputs['loss'],
188 -> learning_rate=outputs['learning_rate'],
189 cluster=cluster)
然后跳进了:
561 #compute the gradients
562 grads_and_vars = optimizer.compute_gradients(
563 loss=loss,
564 var_list=trainable)
又跳进了 /home/yelong/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py:
410 grads = gradients.gradients(
411 loss, var_refs, grad_ys=grad_loss,
412 gate_gradients=(gate_gradients == Optimizer.GATE_OP),
413 aggregation_method=aggregation_method,
414 colocate_gradients_with_ops=colocate_gradients_with_ops)
报出了:
/home/yelong/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
loss’: <tf.Tensor ‘train/add:0’ shape=() dtype=float32>
所以loss是0,
trainable
[<tf.Variable ‘Listener/features/layer0/BLSTM/bidirectional_rnn/fw/layer_norm_basic_lstm_cell/kernel:0’ shape=(251, 512) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer0/BLSTM/bidirectional_rnn/fw/layer_norm_basic_lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer0/BLSTM/bidirectional_rnn/bw/layer_norm_basic_lstm_cell/kernel:0’ shape=(251, 512) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer0/BLSTM/bidirectional_rnn/bw/layer_norm_basic_lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer1/BLSTM/bidirectional_rnn/fw/layer_norm_basic_lstm_cell/kernel:0’ shape=(640, 512) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer1/BLSTM/bidirectional_rnn/fw/layer_norm_basic_lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer1/BLSTM/bidirectional_rnn/bw/layer_norm_basic_lstm_cell/kernel:0’ shape=(640, 512) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer1/BLSTM/bidirectional_rnn/bw/layer_norm_basic_lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer2/bidirectional_rnn/fw/layer_norm_basic_lstm_cell/kernel:0’ shape=(640, 512) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer2/bidirectional_rnn/fw/layer_norm_basic_lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer2/bidirectional_rnn/bw/layer_norm_basic_lstm_cell/kernel:0’ shape=(640, 512) dtype=float32_ref>, <tf.Variable ‘Listener/features/layer2/bidirectional_rnn/bw/layer_norm_basic_lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>, <tf.Variable ‘Speller/memory_layer/kernel:0’ shape=(256, 128) dtype=float32_ref>, <tf.Variable ‘Speller/decoder/attention_wrapper/multi_rnn_cell/cell_0/lstm_cell/kernel:0’ shape=(20385, 512) dtype=float32_ref>, <tf.Variable ‘Speller/decoder/attention_wrapper/multi_rnn_cell/cell_0/lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>, <tf.Variable ‘Speller/decoder/attention_wrapper/multi_rnn_cell/cell_1/lstm_cell/kernel:0’ shape=(256, 512) dtype=float32_ref>, <tf.Variable ‘Speller/decoder/attention_wrapper/multi_rnn_cell/cell_1/lstm_cell/bias:0’ shape=(512,) dtype=float32_ref>, <tf.Variable ‘Speller/decoder/attention_wrapper/bahdanau_attention/query_layer/kernel:0’ shape=(128, 128) dtype=float32_ref>, <tf.Variable ‘Speller/decoder/attention_wrapper/bahdanau_attention/attention_v:0’ shape=(128,) dtype=float32_ref>, <tf.Variable ‘Speller/decoder/dense/kernel:0’ shape=(384, 20001) dtype=float32_ref>, <tf.Variable ‘Speller/decoder/dense/bias:0’ shape=(20001,) dtype=float32_ref>]
这个问题暂时无法解决了,只能当做是一个warning了
WORKER 0: validating model,这里运算量太大,一直输出不了结果,因为tensorflow这里设置成同步,所以worker 1,2,3会等worker 0 运行完一次之后,才会一起继续下一轮。
1.是否要进行异步训练。
2.是否可以加载数据不全部加载,而是得到一个loss,储存起来,得到一批结果之后,统一进行反向传播。。这里没太懂
先把tensorflow弄熟吧。
训练很慢的问题:
- 生成可视化,summary,很耗时。
summary 能否注释掉?答:可以,全部注释掉了,好像快了一点了 每个step大概5s,并且用到了worker 0 。
outputs_LOG: outputs[‘loss’] = Tensor(“train/average_cross_entropy_loss/Sum_1:0”, shape=(), dtype=float32, device=/job:worker/task:2) , self.conf[‘loss’] = average_cross_entropy
2018.12.21
logits_LOG:targets = {‘text’: <tf.Tensor ‘train/get_batch/fifo_queue_1_Dequeue:0’ shape=(?, ?) dtype=int32>} logits = {‘text’: <tf.Tensor ‘train/Speller/decoder/transpose:0’ shape=(?, ?, 20001) dtype=float32>} , logit_seq_legth = {‘text’: <tf.Tensor ‘train/Speller/decoder/while/Exit_12:0’ shape=(?,) dtype=int32>} , target_seq_legth = {‘text’: <tf.Tensor ‘train/get_batch/fifo_queue_3_Dequeue:0’ shape=(?,) dtype=int32>}
outputs_LOG: outputs[‘loss’] = Tensor(“train/average_cross_entropy_loss/Sum_1:0”, shape=(), dtype=float32, device=/job:worker/task:1) , self.conf[‘loss’] = average_cross_entropy
logits_LOG:targets = {‘text’: <tf.Tensor ‘train/get_batch/fifo_queue_1_Dequeue:0’ shape=(?, ?) dtype=int32>} logits = {‘text’: <tf.Tensor ‘train/Speller/decoder/transpose:0’ shape=(?, ?, 20001) dtype=float32>} , logit_seq_legth = {‘text’: <tf.Tensor ‘train/Speller/decoder/while/Exit_12:0’ shape=(?,) dtype=int32>} , target_seq_legth = {‘text’: <tf.Tensor ‘train/get_batch/fifo_queue_3_Dequeue:0’ shape=(?,) dtype=int32>}
logits_LOG:targets = {‘text’: <tf.Tensor ‘train/get_batch/fifo_queue_1_Dequeue:0’ shape=(?, ?) dtype=int32>} logits = {‘text’: <tf.Tensor ‘train/Speller/decoder/transpose:0’ shape=(?, ?, 20001) dtype=float32>} , logit_seq_legth = {‘text’: <tf.Tensor ‘train/Speller/decoder/while/Exit_12:0’ shape=(?,) dtype=int32>} , target_seq_legth = {‘text’: <tf.Tensor ‘train/get_batch/fifo_queue_3_Dequeue:0’ shape=(?,) dtype=int32>}
outputs_LOG: outputs[‘loss’] = Tensor(“train/average_cross_entropy_loss/Sum_1:0”, shape=(), dtype=float32, device=/job:worker/task:3) , self.conf[‘loss’] = average_cross_entropy
outputs_LOG: outputs[‘loss’] = Tensor(“train/average_cross_entropy_loss/Sum_1:0”, shape=(), dtype=float32, device=/job:worker/task:2) , self.conf[‘loss’] = average_cross_entropy
logits_LOG:targets = {‘text’: <tf.Tensor ‘train/get_batch/fifo_queue_1_Dequeue:0’ shape=(?, ?) dtype=int32>} logits = {‘text’: <tf.Tensor ‘train/Speller/decoder/transpose:0’ shape=(?, ?, 20001) dtype=float32>} , logit_seq_legth = {‘text’: <tf.Tensor ‘train/Speller/decoder/while/Exit_12:0’ shape=(?,) dtype=int32>} , target_seq_legth = {‘text’: <tf.Tensor ‘train/get_batch/fifo_queue_3_Dequeue:0’ shape=(?,) dtype=int32>}
outputs_LOG: outputs[‘loss’] = Tensor(“train/average_cross_entropy_loss/Sum_1:0”, shape=(), dtype=float32, device=/job:worker/task:0) , self.conf[‘loss’] = average_cross_entropy
WORKER 0: validating model
WORKER 0: validation loss: 0.000000
- 验证集loss为什么是0。。。。
- GPU 使用量低
一直报在等待task 0 的原因:worker 0 终止后没有启动,导致其他的也没能启动,预感是不长久的方法:
- 把验证的终止条件注释掉;
- 验证后,重新开启worker 0。
GPU 使用量低:
参考 https://blog.csdn.net/qing101hua/article/details/78978791
尝试把所有的float32改成float64
这样有点问题了。又改回来了。
把所有summary注释掉之后,好像也没有提高速度。
2018.12.22
看train.out,后来好像有loss了。
- time steps:是帧长
input: [batch_size * time steps * feature_dims] # 比如 [32 * 10帧 * 123]
振有哥帮忙改:
nabu/neuralnetworks/trainers/trainer.py:
158 outputs['inputs'] = inputs
159 outputs['input_seq_length'] = input_seq_length
160 outputs['targets'] = targets
161 outputs['target_seq_length'] = target_seq_length
775 for _ in range(local_steps):
776 #update the model
777 _, loss, lr, global_step, memory, limit, summary, inputs, input_len, targets, target_len = \
778 sess.run(
779 fetches=[outputs['update_op'],
780 outputs['loss'],
781 outputs['learning_rate'],
782 outputs['global_step'],
783 outputs['memory_usage'],
784 outputs['memory_limit'],
785 outputs['training_summaries'],
786 outputs['inputs'],
787 outputs['input_seq_length'],
788 outputs['targets'],
789 outputs['target_seq_length']
790 ])
792 # TODO
793 print('inputs:', inputs)
794
795 print('inputs_len:', input_len)
796
797 print(';;;;', np.array(inputs.values()).shape)
798
799 print('targets:', targets)
800
801 print(np.array(targets.values()).shape)
802
803 print('target_len:', target_len)
804 exit()
振有哥找到了问题所在:
675 for i in range(outputs['valbatches']):
676 _, summary = sess.run(fetches=[
677 outputs['update_loss'],
678 outputs['eval_summaries']])
679 if summary is not None:
680 summary_writer.add_summary(summary, i)
681 summary, global_step = sess.run(fetches=[
682 outputs['val_loss_summary'],
683 outputs['global_step']
684 ])
这里是worker0在算loss,因为outputs[‘valbatches’]太大了,有7k多,计算量很大,所以一直阻塞在这里,释放不掉。
验证集 : 测试集 : 训练集的比例:98:1:1 (而不是8:1:1!!)
我这里想用:99.8%:0.1%:0.1%
train:1 - 280678 (281241 * 0.998 = 280678)
dev:280679 -280959 (281241* 0.001 = 282)
test:280960 - 281241(281241* 0.001 = 281)
280678 - 224993 = 55685
280678 - 253117 = 27561
cat testid5k.txt >> trainid5k_99.8.txt
head -27561 devid5k.txt >> trainid5k_99.8.txt
sed -n '27562,27843p' devid5k.txt >> testid5k_0.1.txt
sed -n '27844,28124p' devid5k.txt >> devid5k_0.1.txt
WORKER 2: step 32865/118610 loss: 2.728879, learning rate: 0.000528
time elapsed: 7.048758 sec
peak memory usage: 1210/11992 MB
ls |head -27561 | xargs -i mv {} ../train
ls |sed -n '27562,27843p' | xargs -i mv {} ../train
2018.12.23
embedding是做什么用的??
ivector:说话人的输出是一个向量,代表这个人的特征(和语音识别输出是概率不同)
bucket:在gpu cuda运算都是矩阵运算,构建矩阵,维度为batch_sizelengthdim,这个length是一批中最长的那个值,比如有5个text,长度分别为2,12,20,25,13,则length为25,那么构建的矩阵就为32255000,长度不足的,会padding补0补齐,这样对于长度小的,就不值得了,白计算了。所以先按由长至短排序,长的和长的在一起,短的和短的在一起。tensorflow中,bucket就是做这个的,设置bucket值,比如10,就是按大小分成10团。
2018.12.24
学习Listen-Attend-and-Spell-Pytorch,github地址:https://github.com/Alexander-H-Liu/Listen-Attend-and-Spell-Pytorch。要在python3进行,我就没有运行了,直接看了代码。
看了util/librispeech_preprocess.py,里头对text的处理,就是分成字符(a-z,和空格,27个字符)
代码为:
先按字符长度由长至短排序:这一步其实是有必要的,这一步因为涉及到对应前面id号的原因,我就没进行了。其实是有必要的。
# Sort data by signal length (long to short)
audio_len = [len(x) for x in X]
tr_file_list = [tr_file_list[idx] for idx in reversed(np.argsort(audio_len))]
tr_text = [tr_text[idx] for idx in reversed(np.argsort(audio_len))]
之后,librispeech全是大写字母,'号和空格,没有多余的了,我就直接构建了char_map。
char_map = {} # {'P': 17, 'Z': 26, 'L': 13, 'T': 21, 'K': 12, 'Y': 25, 'I': 10, 'G': 8, 'V': 23, ' ': 28, 'F': 7, 'H': 9, 'R': 19, 'J': 11, 'Q': 18, 'E': 6, '<eos>': 1, 'O': 16, 'M': 14, 'N': 15, '<sos>': 0, 'A': 2, 'D': 5, 'U': 22, 'B': 3, 'C': 4, 'W': 24, "'": 27, 'S': 20}
### 以下没做
char_map['<sos>'] = 0
char_map['<eos>'] = 1
char_idx = 2
# map char to index
for text in f_list:
for char in text:
if char not in char_map:
char_map[char] = char_idx
char_idx +=1
rev_char_map = {v:k for k,v in char_map.items()} #{0: '<sos>', 1: '<eos>', 2: 'A', 3: 'B', 4: 'C', 5: 'D', 6: 'E', 7: 'F', 8: 'G', 9: 'H', 10: 'I', 11: 'J', 12: 'K', 13: 'L', 14: 'M', 15: 'N', 16: 'O', 17: 'P', 18: 'Q', 19: 'R', 20: 'S', 21: 'T', 22: 'U', 23: 'V', 24: 'W', 25: 'Y', 26: 'Z', 27: "'", 28: ' '}
### 以上没做
tmp_list = []
for text in f_list:
tmp = []
for char in text:
tmp.append(char_map[char])
tmp_list.append(tmp)
print(tmp_list) #<class 'list'>: [[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 20, 25, 26, 27, 28]]
程序中把字符转换成了数字,我这里没有这么做,因为没必要。
在alphat中,需要保留空格这个label,我没有将字符转成数字,而是把程序中的split(’ ‘)改成split(’,’) 错误!这样不行 因为读进来text是以空格为分割的?
alphabet = “A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,S,Y,Z, ,’”
路径在nabu_char下。
以上不行,还是得转换成数字,因为我把空格用某个字符,比如小写字符或者标点符号代替,这样出来的normalize/max_length是1,还是要转换成数字。因为里头是以空格为分割的。
tensorflow 多线程 num_thread
tf.train.batch
训练稳定:
训练稳定的意思:在steps迭代的过程中,比如1万步得到一个模型,10万步有10个模型(还未迭代完),把这10个模型都经过测试集,如果测试集的loss和wer相差不多,说明这个模型已经训练稳定了。
–
周会:先验??
embedding层
nabu5k解码,wer:
参数:train.cfg:
[trainer]
#name of the trainer that should be used
trainer = standard
#the loss function to be minimized
loss = average_cross_entropy
#the amount of training labels that need to be added to the output
trainlabels = 1
#link the input names defined in the classifier config to sections defined in
#the database config
features = trainfbank
#a space seperated list of target names used by the trainer
targets = text
#a mapping between the target names and database sections
text = traintext
#number of passes over the entire database
#num_epochs = 300
num_epochs = 10
#exponential weight decay parameter
learning_rate_decay = 0.1
#size of the minibatch (#utterances)
batch_size = 128
###VALIDATION PART###
#frequency of evaluating the validation set.
valid_frequency = 5000
#the number of times validation performance can be worse before terminating training, set to None to disable early stopping
num_tries = None
validation_evaluator.cfg:
[evaluator]
#name of the evaluator that should be used
evaluator = decoder_evaluator
#the number of utterances that are processed simultaniously
#batch_size = 32
batch_size = 32
#link the input names defined in the classifier config to sections defined in
#the database config
features = devfbank
#a space seperated list of target names used by the evaluator
targets = text
#a mapping between the target names and database sections
text = devtext
[decoder]
#name of the decoder that should be used
decoder = beam_search_decoder
#the maximum number of output steps
max_steps = 100
#the beam width
beam_width = 5
#if you want to visualize the alignments set to True
visualize_alignments = True
#the alphabet used by the decoder
model.cfg:
[io]
#a space seperated list of input names
inputs = features
#a space seperated list of output names
outputs = text
#a space seperated list of model output dimensions (exluding eos)
#output_dims = 39
output_dims = 5000
[encoder]
#type of encoder
encoder = listener
#the standard deviation of the Gaussian input noise added during training
#input_noise = 0.6
input_noise = 0.1
#number of pyramidal layers a non-pyramidal layer is added at the end
num_layers = 2
#number of units in each layer
#num_units = 128
num_units = 128
#number of timesteps to concatenate in each pyramidal layer
pyramid_steps = 2
#dropout rate
#dropout = 0.5
dropout = 0.7
[decoder]
#type of decoder
decoder = speller
#number of layers
num_layers = 2
#number of units
#num_units = 128
num_units = 128
#the dropout rate in the rnn
dropout = 0.5
naba5k/test_train_new/decode/:
word error rate: 0.232240437158
substitutions: 0.158469945355
insertions: 0.00945775535939
deletions: 0.0643127364439
2018.12.27
空格用_表示 然后每个字符间增加空格
sed ‘s/ /_/g’ a.txt |sed ‘s/./& /g’
- awk ‘{print $1}’ dev_char_.txt > tmp1.txt
- awk ‘{print $2}’ dev_char_.txt | sed ‘s/./& /g’ > tmp2.txt
- paste -d tmp1.txt tmp2.txt > dev_char_space.txt (但这个分隔符好像是\t)
(分隔符是空格:paste -d: tmp1.txt tmp2.txt | sed ‘s/? /g’ > dev_char_space.txt)
O R _ W A S _ S U E _ S I M P L Y _ S O _ P E R V E R S E _ T H A T _ S H E _ W I L F U L L Y _ G A V E _ H E R S E L F _ A N D _ H I M _ P A I N _ F O R _ T H E _ O D D _ A N D _ M O U R N F U L _ L U X U R Y _ O F _ P R A C T I S I N G _ L O N G _ S U F F E R I N G _ I N _ H E R _ O W N _ P E R S O N _ A N D _ O F _ B E I N G _ T O U C H E D _ W I T H _ T E N D E R _ P I T Y _ F O R _ H I M _ A T _ H A V I N G _ M A D E _ H I M _ P R A C T I S E _ I T _ H E _ C O U L D _ P E R C E I V E _ T H A T _ H E R _ F A C E _ W A S _ N E R V O U S L Y _ S E T
O R _ W A S _ S U E _ S I M P L Y _ S O _ P E R V E R S E _ T H A T _ S H E _ W I L F U L L Y _ G A V E _ H E R S E L F _ A N D _ H I M _ P A I N _ F O R _ T H E _ O D D _ A N D _ M O U R N F U L _ L U X U R Y _ O F _ P R A C T I S I N G _ L O N G _ S U F F E R I N G _ I N _ H E R _ O W N _ P E R S O N _ A N D _ O F _ B E I N G _ T O U C H E D _ W I T H _ T E N D E R _ P I T Y _ F O R _ H I M _ A T _ H A V I N G _ M A D E _ H I M _ P R A C T I S E _ I T _ H E _ C O U L D _ P E R C E I V E _ T H A T _ H E R _ F A C E _ W A S _ N E R V O U S L Y _ S E T
[[Node: validate/validation/evaluate/input_pipeline/read_data/reader_1/StringReaderEOS/assert_equal/Assert/Assert = Assert[T=[DT_STRING], summarize=3, _device="/job:worker/replica:0/task:0/cpu:0"](validate/validation/evaluate/input_pipeline/read_data/reader_1/StringReaderEOS/assert_equal/All_G71, validate/validation/evaluate/input_pipeline/read_data/reader_ 1/StringReaderEOS/ParseSingleExample/Squeeze_data)]]
T W O U L D N ’ T _ B E _ S A F E _ T O _ G O _ W I T H O U T _ T H E M _ S A I D _ J A S P E R _ S H A K I N G _ H I S _ H E A D _ U N L E S S _ W E _ H A D _ N A I L S _ D R I V E N _ I N _ O U R _ S H O E S _ I ’ D _ M U C H _ R A T H E R _ H A V E _ T H E _ N A I L S _ C R I E D _ P O L L Y _ O H _ M U C H _ R A T H E R _ J A S P E R _ W E L L
临时,用小批数据集测试一下:
head -11 train_char_space.txt |sed “s/’/2/g” | sed ‘s/_/1/g’ > train_test.txt
train_test.txt:
1001-134707-0008 D E M A N D S 1 Y O U 1 T H R E E 1 R E S P O N S I V E 1 T O 1 O U R 1 S U M M O N S 1 O R 1 R A T H E R 1 T O 1 H E R 1 L O N G 1 N U R S 2 D 1 I N C L I N A T I O N
迭代步数 怎么计算的?
train:10 steps为100
processor_config = config/recipes/LAS/TIMIT/feature_processor.cfg
一直报错:tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed。
发现是alphabet没写全,有些text在alphabet找不到对应的。
最后确认alphabet为:(用_代替空格)
alphabet = A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ’ _
WORKER 2: step 19/71400 loss: 2.999545, learning rate: 0.000999
time elapsed: 21.783085 sec
peak memory usage: 3698/11992 MB
与steps与验证集频率无关 与验证集测试集设置的batch_size都无关,与训练集设置的batch_size有关
把trainer的batch_size从128改为256,steps要迭代的次数从71400变为52/35580
怎么输出中间模型?:
想法:在trainer.py:
if self.task_index == 10000: #每一万步
#store the model file
modelfile = os.path.join(self.expdir, 'model', 'model.pkl')
with open(modelfile, 'wb') as fid:
pickle.dump(self.model, fid)
network.ckpt
2019.1.1
是什么能改变迭代次数
用char(28个字符)得loss:0.73,wer:73%(主要是删除占大部分wer),猜测是因为没有收敛。
- 怎么在这个模型下 重新训练N回? (查 tensorflow加载模型,继续训练)
迭代
trainer的batch_size是256次时,训练每次迭代时长20s,一共需要迭代35580,改成128次,时长缩短为7s,一共需要迭代71400次。
每次迭代,是处理一次batch size多个的数据点。
2018.1.3
用char做(28个字符)训练的loss始终降不下来,很早之前就是1.7了,又训练了几万步,还是1.7。。(要怎么可视化?(用tensor board))
2018.1.6
问桃姐/连志哥 las的wer=10%是怎么弄得的。
2019.1.7
改大迭代次数: trainer.py:outputs['num_steps'] = outputs['num_steps']*int(2)
这里扩大2倍,也可以改成更大的。
改成2倍,训练nabu5k:(steps:116500)loss:0.18
word error rate: 0.18406570146473625
substitutions: 0.1307255947845229
insertions: 0.009059351452036237
deletions: 0.04428075522817712
现在在训练nabu_char,迭代次数也是扩大2倍,为142800。
发现在配置文件中,之前设置的#the maximum number of output steps, max_steps = 100,这样就造成了训练后decode文本输出的最大长度是100,不能超过了,但这样有问题啊,我们的文本有的句子已经超过这个长度了,所以要改一下。这里把nabu_char
的max_steps
改成了525
初始学习率设置在nabu/neuralnetworks/trainers/defaults
???:
#the probability that the network will sample from the output during training
sample_prob = 0.1
steps:289360
看las librisppech char的论文里是怎么设置的网络结构
《Attention-based sequence-to-sequence model for speech recognition development of state-of-the-art system on LibriSpeech and its application to non-native English》
3 pyramid BLSTM layers with 1024 hidden units in each direction. The decoder is a 2- layer LSTM with 512 hidden units per layer.
??? nabu里怎么没有layer.py里怎么没有mlp这个
2019.1.8
可以通过看wb权重参数,来调整去优化吗
??问:现在得到了这个模型,怎么去找问题? 有问题不知道怎么找。。
2919.1.9
model.pkl保存的是网络结构,所以只要保存一次即可。
一直在想怎么save model的,因为nabu里用的是tf.train.MonitoredTraingSession,有tf.train.Saver然后还要构建一个钩子hook,绑定起来,才能新建类对象。
一开始尝试saver_1 = tf.train.Saver(self.variables, sharded=True)
,然后就直接saver_1.save(sess, "model1/network.ckpt")
,这样不行,即使在chief_only_hooks=[saver_1]
。以为添加了hook,会报错。。
然后尝试像验证集 validation_hook = hooks.ValidationSaveHook()
新建一个saver,但怎么会报错(但还是有保存),显示没有新建_saver对象。。
然后问振有哥,现在尝试他的方法:
saver = tf.train.Saver(max_to_keep=300)
saver_hook = tf.train.CheckpointSaverHook(checkpoint_dir=os.path.join(self.expdir,'logdir'), save_steps=10000,saver=saver)
然后添加chief_only_hooks=[save_hook, validation_hook, saver_hook]
里面用的是adam的梯度下降,是否要换成全局梯度下降的 随机梯度下降GradientDescentOptimizer。。
运行到48000步时,之前loss一直在1.6,1.5左右,下降不了,突然开始上升,上升到很大,几十,后面又下降,现在
sequence_length_histogram.npy
保存的是最长的对应的那个array(1*max_length)
- 把数据久一点保存一个:
在tf.train.MonitoredTrainingSession()把默认的save_checkpoint_secs=600(10min)改为save_checkpoint_secs=1800(30min)保存一个。
问题: 每步训练时间长,10s左右,不知道怎么改,数据加载是通过tfrecord,应该没问题?
连志师兄说:先要查出是哪一步造成了时间很久,是 数据加载 or queue长度 阻塞(打印queue的size,看是不是capacity) or sess.run的时间很久。挨个打印时间。
2019.1.10
(‘num_steps’, 14468)
(‘seq_length’, [<tf.Tensor ‘train/get_batch/input_pipeline/bucket_by_sequence_length/bucket/dequeue_top:3’ shape=(?,) dtype=int32>, <tf.Tensor ‘train/get_batch/input_pipeline/bucket_by_sequence_length/bucket/dequeue_top:5’ shape=(?,) dtype=int32>])
(‘max_length’, 2973)
(‘len(data)’, 2)
num_steps:the number of steps in each epoch :14468 .每个epoch迭代steps次数有14468次。outputs['num_steps'] = num_steps*int(self.conf['num_epochs'])
,所以一共有144680次
seq_length:the sequence lengths as a list of [batch_size] tensor:2973 每句话的帧数,有2973帧,所以输入是642973123
feature维度:123
不是数据加载慢,数据加载很块0.00几秒。
101730
学习tensorboard
看文献调整,尝试降低wer:
《An Overview of End2End for ASR.pdf》:
Attention模型的训练技巧
目前至少证明有以下几个技巧对于训练基于Attention的模型有帮助。
- 采用大的建模单元,sub-word, word等,这样的建模单元更加稳定并且有助
于语言建模。 - 引入Scheduled Sampling[22],解决Attention模型中训练和推理过程的不匹配
问题。
3.Multi-Head Attention,MHA方法能够生成不同的注意力分布,在Decoder解
码过程中,关注上下文向量的不同方面。 - 采用Label smoothing[23]方法来避免模型对于自己的预测结果过于自信。
- 整合额外的语言模型(在更大的文本语料上训练的),因为Attention模型的
受限于语料规模,没有语言模型的结构还是不能与常见的模型相匹敌。 - 使用最小化词错误率[24]的方式进行区分性训练。
- 模型除了训练和推理过程中的不匹配问题,还存在损失函数之间的不匹配,训练时通常使用CE,而在评价阶段使用WER等,[25, 26]中引入强化学习中的Policy
Gradient方法,直接优化WER,来进行端到端的训练。
2019.1.11
之前用字母做可能有点问题。。,现在改成数字试试。
sed 's/A/1/g' test_char_space.txt |sed 's/B/2/g' - | sed 's/C/3/g' - |sed 's/D/4/g' - |sed 's/E/5/g' - |sed 's/F/6/g' - |sed 's/G/7/g' - |sed 's/H/8/g' - |sed 's/I/9/g' - |sed 's/J/10/g' - |sed 's/K/11/g' - |sed 's/L/12/g' - |sed 's/M/13/g' - |sed 's/N/14/g' - |sed 's/O/15/g' - |sed 's/P/16/g' - |sed 's/Q/17/g' - |sed 's/R/18/g' - |sed 's/S/19/g' - |sed 's/T/20/g' - |sed 's/U/21/g' - |sed 's/V/22/g' - |sed 's/W/23/g' - |sed 's/X/24/g' - |sed 's/Y/25/g' - |sed 's/Z/26/g' - | sed 's/_/27/g' - |sed "s/'/28/g" > test_char_space_number.txt
桃桃姐说,出现decode, -4.339998 A N D _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T
这样的情况,多是label的问题,要去看文本的label和里头转换的label能不能一一对应。
2 -
看tensorboard:train:5805 nodes,其中,listener 40 tensors,speller 23 tensors
- 进一步理解tensorboard
- 学会tfpdb
- 学会timeline
- 看label转换
2019.1.12
查看label
在train.py里sess.run 一些参数,logits和target。
_, loss, lr, global_step, inputs, input_len, targets, target_len = \
sess.run(
fetches=[outputs['update_op'],
outputs['loss'],
outputs['learning_rate'],
outputs['global_step'],
outputs['inputs'],
outputs['input_seq_length'],
outputs['targets'],
outputs['target_seq_length']
])
print('inputs:', inputs)
print('inputs_len:', input_len)
print('np.array(inputs.values()).shape', np.array(inputs.values()).shape)
print('targets:', targets)
print("np.array(targets.values()).shape)",np.array(targets.values()).shape)
print('target_len:', target_len)
输出:
inputs是dict,key是features,value是np矩阵;
inputs_len是dict,分别是每句语音的长度(时间步长)
('inputs_len:', {'features': array([1283, 1276, 1286, 1332, 1318, 1319, 1307, 1289, 1285, 1314, 1305,
1291, 1296, 1299, 1300, 1276, 1288, 1286, 1296, 1307], dtype=int32)})
('np.array(inputs.values()).shape', (1, 20, 1332, 123)) #1*20句*最长句子的时间步长(这20句中最长的有1332帧,填补成稠密矩阵,用最长的填补)*123维特征
('targets:', {'text': array([[19, 7, 4, ..., 0, 0, 0],
[22, 7, 8, ..., 0, 0, 0],
[ 8, 13, 18, ..., 13, 6, 28],
...,
[ 0, 19, 26, ..., 0, 0, 0],
[ 8, 26, 7, ..., 0, 0, 0],
[ 8, 19, 26, ..., 0, 0, 0]], dtype=int32)})
('np.array(targets.values()).shape)', (1, 20, 237))
('target_len:', {'text': array([199, 179, 237, 223, 180, 159, 199, 175, 186, 225, 172, 224, 169,
184, 205, 155, 177, 141, 207, 200], dtype=int32)})
另一个steps:
(‘np.array(logits.values()).shape)’, (1, 22, 215, 29)),这里的29是alphabet数量,28 + 1 nonesymble = 29
(‘np.array(targets.values()).shape)’, (1, 22, 215))
loss计算:
targets:比如targets是一个36*61的矩阵,每行填充满了之后,会补充一个alphabet数量,这里就是5000
[ 84 458 147 59 4996 4981 127 571 4991 1215 148 393 2288 853
2079 6 654 55 658 33 93 2967 85 6 4814 37 80 328
387 1459 824 4991 2197 26 253 3938 68 2457 25 6 361 1671
23 2410 30 37 91 3545 1938 3516 5000 0 0 0 0 0
0 0 0 0 0]
(‘np.array(logits.values()).shape)’, (1, 17, 79, 5001))
(‘logit_seq_length’, {‘text’: array([77, 67, 49, 64, 65, 59, 62, 55, 79, 56, 59, 44, 49, 29, 57, 51, 69],dtype=int32)})
(‘np.array(targets.values()).shape)’, (1, 17, 79))
(‘target_len:’, {‘text’: array([77, 67, 49, 64, 65, 59, 62, 55, 79, 56, 59, 44, 49, 29, 57, 51, 69],dtype=int32)})
logit_seq_length和target_seq_length 一样的。
import tensorflow as tf
import numpy as np
target = np.load( "target.npy" ) # (1, 17, 79))
logit = np.load( "logits.npy" ) # (1, 17, 79, 5001))
#print("target[0][0]",targets[0][0])
#k=logits[0:17][0][0]
#print(k[77])
#print(k[549])
#print(np.where(k==max(k)))
logit_seq_length = {'text': np.array([77, 67, 49, 64, 65, 59, 62, 55, 79, 56, 59, 44, 49, 29, 57, 51, 69],dtype='int32')}
target_seq_length = {'text': np.array([77, 67, 49, 64, 65, 59, 62, 55, 79, 56, 59, 44, 49, 29, 57, 51, 69],dtype='int32')}
sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)
targets = {'text':target}
logits = {'text':logit}
t = 'text'
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=logits[t],
labels=tf.cast(targets[t], tf.int32)) # 数据类型转换 转换成int32
loss1 = tf.where(tf.sequence_mask(logit_seq_length[t], tf.shape(targets[t])[1]),loss,tf.zeros_like(loss))
losses = {}
losses[t] = tf.reduce_sum(loss, 1) # (1,17)
loss1 = {}
loss1[t] = tf.reduce_mean(losses['text']/target_seq_length['text']) # (1,17)
loss2 = tf.reduce_sum(loss1[t])
#losses = {t: tf.reduce_mean(losses[t]/target_seq_length[t]) for t in losses}
#loss = tf.reduce_sum(losses.values())
with tf.Session() as sess:
print(sess.run(loss2))
里头使用 average_cross_entropy,不是交叉熵,然后梯度下降用的Adam
把capacity=int(conf['batch_size'])*2
换成3,worker0,1的时间没有明显减小,看来不是因为队列阻塞。
outputs[‘increment_step’].run(session=sess)???
outputs[‘valbatches’]:", 8
看中间神经元节点数。。
step慢慢+1:
outputs['increment_step'] = outputs['global_step'].assign_add(1).op
outputs['increment_step'].run(session=sess)
能不能用多线程来读取数据??:
# Start input enqueue threads.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
而且现在的队列是一个一个读数据的
sequence_length_histogram.npy:
一个numpy array一维向量,size是最大长度,比如最大长度max_length=524,则数组size为525(长度为0也算一个),每个元素值是长度为i的个数有多少个,是数量。
比如下面这个,长度为0的有0个,长度为1的句子有34个。。长度为524的句子有1个。
[ 0 34 86 135 288 542 824 1190 1477 1643 1818 1792 1939 1871
1859 1907 1849 1860 1823 1971 1970 1996 1919 1866 1917 2030 2036 2119
2326 2407 2519 2651 2891 3207 3416 3760 4138 4337 4856 5288 5497 6042
6514 6611 7023 7123 7152 7426 7383 7295 7171 6978 6517 6254 5790 5497
5042 4514 4009 3591 3168 2715 2341 1984 1702 1425 1136 908 767 586
532 403 287 248 187 156 108 81 64 43 41 28 19 12
5 13 15 6 2 7 3 4 1 1 0 1 1 1
1 0 0 0 1 0 0 0 0 0 0 1 0 1
0 0 1 0 0 0 0 0 1]
nabu/processing/tfreaders/audio_feature_reader.py
nabu/processing/processors/text_processor.py
nabu/processing/processors/audio_processor.py
audio_feature
string_eos
with open(os.path.join(datadir, 'sequence_length_histogram.npy'),
'w') as fid:
np.save(fid, self.sequence_length_histogram)
os.path.join(self.expdir,‘logdir1’)
import sys
import os
import shutil
expdir='./test_train_48.2'
#if os.path.isdir(os.path.join(expdir, 'test')):
# shutil.rmtree(os.path.join(expdir, 'test'))
#os.makedirs(os.path.join(expdir, 'test'))
#os.makedirs(os.path.join(expdir, 'test','model'))
#os.symlink(os.path.join(expdir, 'model'), os.path.join(expdir, 'test', 'model'))
src = './src_temp'
dst = './dst_temp'
os.symlink(src, dst)
谷神说出现这个问题是因为decode部分有问题:
A N D _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T H A T _ T H E Y _ W E R E _ T
之前理解错了,不能把max_steps设成target的最大长度,这样的话,logit就设限了,现在把logits的长度,max_steps设置成1000去训练(针对target的最大长度是524,设为1000,如果是nabu5k,则target的最大长度为120,则设为200),但是又要考虑验证集的问题,设大了会出现OOM when allocating tensor with shape[16,25010,2048]
想到nabu_char解码出来的T H E Y _ W E R E _ T H A T _ T H E Y重复序列,是否能推测出,输入序列前面几句的权重占很大,很相关。。。?是不是说明attention作用后 这几个输入很相关 权重很大啊,试一下不加attention训练一下?
是不是注意力机制权重项的比重太大了???。看公式推导,看nabu程序中的注意力机制的部分是不是写对了。
看网上说 aishell1,2用他上面的测试集比较好 。那librispeech也分了dev和test,我拷过来的时候是一个总的文件,我自己分的dev和test,影响大吗?也没有分clean和非clean就去训练了
2019.1.17
nabu中,队列用的单线程,没有用tf.Coordinator和tf.QueueRunner,改成多线程试一试
2019.1.18
队列 管道 的关系 queue input_pipeline的关系。。。data_queue = tf.train.string_input_producer
2019.1.20
按照流利说的 把fbank改成40维,测试集验证集用clean做。
现在step 177007,loss=5,降不下去,learning rate: 0.001990。配置参数为learning rate decay=0.9,initial_learning_rate = 2e-3
[io]
#a space seperated list of input names
inputs = features
#a space seperated list of output names
outputs = text
#a space seperated list of model output dimensions (exluding eos)
output_dims = 5000
[encoder]
#type of encoder
encoder = listener
#the standard deviation of the Gaussian input noise added during training
input_noise = 0.6
#number of pyramidal layers a non-pyramidal layer is added at the end
num_layers = 3
#number of units in each layer
num_units = 1024
#number of timesteps to concatenate in each pyramidal layer
pyramid_steps = 2
#dropout rate
dropout = 0.5
[decoder]
#type of decoder
decoder = speller
#number of layers
num_layers = 2
#number of units
num_units = 512
#the dropout rate in the rnn
dropout = 0.5
问 cut_sequence_length
最终采用了2640维度特征(前后各拼接5帧),是因为我们采用了1024个隐层节点,通常情况下应该保证输入特征维度不小于隐层节点个数???我只用了121维特征和1024个节点 好像不太对了。
2019.1.22
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shared queue ‘data_queue’ has capacity 256 but requested capacity was 384
[[Node: train/get_batch/data_queue = FIFOQueueV2capacity=384, component_types=[DT_STRING], container="", shapes=[[]], shared_name=“data_queue”, _device="/job:ps/replica:0/task:0/cpu:0"]]
loss:0.2
word error rate: 0.2194564389128778
substitutions: 0.1552789772246211
insertions: 0.011599356532046398
deletions: 0.052578105156210315
2019.1.24
之前wer的算法有问题,wer.py里的len()是数组,所以是1,用了最简单的reference abc decoded acc 测试发现的。
现在添加了两行reftext = ‘’.join(reftext),output = ‘’.join(output),后,计算的wer就是正确的。但也没有好多少。现在wer是18%。