pytorch版本利用vggish预训练网络提取音频特征

最新推荐文章于 2025-03-01 20:00:00 发布

AI大杂烩

最新推荐文章于 2025-03-01 20:00:00 发布

阅读量3.8k

点赞数 9

分类专栏： python 深度学习文章标签： pytorch 音视频 python

本文链接：https://blog.csdn.net/yanchujian88/article/details/127428999

版权

python 同时被 2 个专栏收录

11 篇文章

订阅专栏

深度学习

6 篇文章

订阅专栏

介绍

最近开始搞多模态，需要提取视频中的音频特征，调研了下，基本都用的vggish，但是官方用的tensorflow版本，且搜索文章没有一篇完整的文章将音频提取特征的，因此在这里简单写下：
首先用的torch版本的预训练网络来自：https://github.com/tcvrick/audioset-vggish-tensorflow-to-pytorch
该链接已经把tensorflow版本的vggish预训练网络转化为pytorch,大家可自行下载。

代码
有了上述的预训练网络，我们就可以来提取音频特征，批量提取视频中音频文件代码如下所示：

import os
import subprocess
video_path='videos'
audio_path='audio'

for item in os.listdir(video_path):
    v_path=os.path.join(video_path,item)
    name=item[:-4]
    d_path=os.path.join(audio_path,name)
    d_name=d_path+'.wav'
    command ="ffmpeg -i {} -ab 160k -ac 2 -ar 44100 -vn {}".format(v_path,d_name)
    subprocess.call(command, shell=True)

其中ffmpeg需自行安装。
提取音频特征代码如下：

from torchvggish.vggish import VGGish
from torchvggish.audioset import vggish_input
import torch
import os
import numpy as np

device="cuda:1" if torch.cuda.is_available() else 'cpu'
model=VGGish()
state_dict=torch.load('pretrained/pytorch_vggish.pth',map_location=device)

model.load_state_dict(state_dict)

model.eval().to(device)

audio_path='audio'
for item in os.listdir(audio_path):
        print(item)
        input_wavfile=os.path.join(audio_path,item)
        wav_preprocess=vggish_input.wavfile_to_examples(input_wavfile)
        wav_preprocess = torch.from_numpy(wav_preprocess).unsqueeze(dim=1)
        input_wav = wav_preprocess.float().to(device)
        audio_feature_name=os.path.basename(input_wavfile)
        audio_feature_name=audio_feature_name[:-4]
        with torch.no_grad():
            output=model(input_wav)
            output=output.squeeze(0) 
            output=output.cpu().detach().numpy()
            np.save('./audio_feature/'+audio_feature_name+'.npy',output)