前言
最近需要用一个text2speech的网络模型做工程, 了解到2020的一篇论文的工作就是这个, 正好作者也开源了项目, 所以来研究研究
paper
Flowtron: an Autoregressive Flow-based Network for Text-to-Mel-spectrogram Synthesis
github
https://github.com/NVIDIA/flowtron
环境配置
Setup
注意需要安装pytorch (都是搞AI的人, 不会没装pytorch的叭
git clone https://github.com/NVIDIA/flowtron.git
cd flowtron
git submodule update --init; cd tacotron2; git submodule update --init
pip install -r requirements.txt
异常:
(1) 直接https下载然后解压, 会报git没配置的错误
(2) scipy版本太高
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
paddlepaddle-gpu 1.7.2.post107 requires scipy<=1.3.1; python_version >= "3.5", but you have scipy 1.5.2 which is incompatible.
先卸载scipy
D:\flowtron\flowtron>python -m pip uninstall scipy
Found existing installation: scipy 1.5.2
Uninstalling scipy-1.5.2:
Would remove:
g:\python\lib\site-packages\scipy-1.5.2.dist-info\*
g:\python\lib\site-packages\scipy\*
Proceed (Y/n)? y
Successfully uninstalled scipy-1.5.2
然后安装1.3.1
版本scipy
D:\flowtron\flowtron>python -m pip install scipy==1.3.1
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Collecting scipy==1.3.1
Downloading https://mirrors.aliyun.com/pypi/packages/50/eb/defa40367863304e1ef01c6572584c411446a5f29bdd9dc90f91509e9144/scipy-1.3.1-cp37-cp37m-win_amd64.whl (30.3 MB)
|████████████████████████████████| 30.3 MB 57 kB/s
Requirement already satisfied: numpy>=1.13.3 in g:\python\lib\site-packages (from scipy==1.3.1) (1.19.2)
Installing collected packages: scipy
Successfully installed scipy-1.3.1
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the 'G:\Python\python.exe -m pip install --upgrade pip' command.
代码
看参数设置
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('-c', '--config', type=str,
help='JSON file for configuration')
parser.add_argument('-p', '--params', nargs='+', default=[])
parser.add_argument('-f', '--flowtron_path',
help='Path to flowtron state dict', type=str)
parser.add_argument('-w', '--waveglow_path',
help='Path to waveglow state dict', type=str)
parser.add_argument('-t', '--text', help='Text to synthesize', type=str)
parser.add_argument('-i', '--id', help='Speaker id', type=int)
parser.add_argument('-n', '--n_frames', help='Number of frames',
default=400, type=int)
parser.add_argument('-o', "--output_dir", default="results/")
parser.add_argument("-s", "--sigma", default=0.5, type=float)
parser.add_argument("-g", "--gate", default=0.5, type=float)
parser.add_argument("--seed", default=1234, type=int)
args = parser.parse_args()
# Parse configs. Globals nicer in this case
with open(args.config) as f:
data = f.read()
global config
config = json.loads(data)
update_params(config, args.params)
data_config = config["data_config"]
global model_config
model_config = config["model_config"]
# Make directory if it doesn't exist
if not os.path.isdir(args.output_dir):
os.makedirs(args.output_dir)
os.chmod(args.output_dir, 0o775)
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = False
infer(args.flowtron_path, args.waveglow_path, args.output_dir, args.text,
args.id, args.n_frames, args.sigma, args.gate, args.seed)
看看infer
源码
def infer(flowtron_path, waveglow_path, output_dir, text, speaker_id, n_frames,
sigma, gate_threshold, seed):
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
# load waveglow
waveglow = torch.load(waveglow_path)['model'].cuda().eval()
waveglow.cuda().half()
for k in waveglow.convinv:
k.float()
waveglow.eval()
# load flowtron
model = Flowtron(**model_config).cuda()
state_dict = torch.load(flowtron_path, map_location='cpu')['state_dict']
model.load_state_dict(state_dict)
model.eval()
print("Loaded checkpoint '{}')" .format(flowtron_path))
ignore_keys = ['training_files', 'validation_files']
trainset = Data(
data_config['training_files'],
**dict((k, v) for k, v in data_config.items() if k not in ignore_keys))
speaker_vecs = trainset.get_speaker_id(speaker_id).cuda()
text = trainset.get_text(text).cuda()
speaker_vecs = speaker_vecs[None]
text = text[None]
with torch.no_grad():
residual = torch.cuda.FloatTensor(1, 80, n_frames).normal_() * sigma
mels, attentions = model.infer(
residual, speaker_vecs, text, gate_threshold=gate_threshold)
for k in range(len(attentions)):
attention = torch.cat(attentions[k]).cpu().numpy()
fig, axes = plt.subplots(1, 2, figsize=(16, 4))
axes[0].imshow(mels[0].cpu().numpy(), origin='bottom', aspect='auto')
axes[1].imshow(attention[:, 0].transpose(), origin='bottom', aspect='auto')
fig.savefig(os.path.join(output_dir, 'sid{}_sigma{}_attnlayer{}.png'.format(speaker_id, sigma, k)))
plt.close("all")
with torch.no_grad():
audio = waveglow.infer(mels.half(), sigma=0.8).float()
audio = audio.cpu().numpy()[0]
# normalize audio for now
audio = audio / np.abs(audio).max()
print(audio.shape)
write(os.path.join(output_dir, 'sid{}_sigma{}.wav'.format(speaker_id, sigma)),
data_config['sampling_rate'], audio)
应用
example
先下载pre-trained model
google drive上的flowtron_ljs.pt
文件
https://drive.google.com/file/d/1Cjd6dK_eFz6DE0PKXKgKxrzTUqzzUDW-/view
以及waveglow_256channels_v4.pt
模型文件
waveglow的github地址
https://github.com/NVIDIA/WaveGlow
最新的v5版本模型google drive
https://drive.google.com/file/d/1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF/view
python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_v4.pt -t "It is well know that deep generative models have a rich latent space!" -i 0
改一下参数设置
python inference.py -c config.json -f models/flowtron_ljs.pt -w models/waveglow_256channels_universal_v5.pt -t "It is well know that deep generative models have a rich latent space!" -i 0
异常
ImportError: cannot import name 'betabinom' from 'scipy.stats'
scipy版本问题, 所以上面的报错有误导性 (blog不删除上面的报错是为了做比对
python -m pip install scipy --upgrade
接着报错
ModuleNotFoundError: No module named 'numba.decorators'
numba
版本问题, 卸载后安装0.48版本 (好像可以不卸载)
pip uninstall numba
pip install numba==0.48
接着报错
ValueError: 'bottom' is not a valid value for origin; supported values are 'upper', 'lower'
具体看代码, 改一下code
for k in range(len(attentions)):
attention = torch.cat(attentions[k]).cpu().numpy()
fig, axes = plt.subplots(1, 2, figsize=(16, 4))
axes[0].imshow(mels[0].cpu().numpy(), origin='lower', aspect='auto') # origin='bottom'
axes[1].imshow(attention[:, 0].transpose(), origin='lower', aspect='auto') # origin='bottom'
fig.savefig(os.path.join(output_dir, 'sid{}_sigma{}_attnlayer{}.png'.format(speaker_id, sigma, k)))
plt.close("all")
效果
results文件夹生成了语音文件, 我听是效果不错, 可以生成"It is well know that deep generative models have a rich latent space!"
这句话的语音, (CSDN可以放语音文件么, 不能直观演示
原理浅析
之后补充…
总结
直接运行代码通常会报错误, 一般来说是依赖库版本问题, google / Stack Overflow一下就行
后记
一个二进制玩家逐渐转移到数据科学和AI安全领域…我也没想到哇
不过 AI 和 IoT 一样都是大势所趋, 所以只能顺势而为, 来都来了就老老实实干, 不想那些有的没的了 (有空再搞搞二进制