基于LRW-1000(CAS-VSR-W1k)数据集来进行中文唇语数据集识别任务以构建一个全面的唇语识别系统,包括数据集准备、模型定义、训练和结果评估。以下是所有相关的代码文件
LRW-1000(又叫CAS-VSR-W1k)
中文唇语识别数据集。
目前最大的中文公开唇语识别数据。
数据来之不易
可以直接使用。
可用于训练数字人 wav2lip等
基于LRW-1000(CAS-VSR-W1k)数据集来进行中文唇语识别任务。,适合用于训练数字人、Wav2Lip等应用。
以下文章的代码仅供参考:
环境准备
确保您已经安装了以下软件和库:
- Python 3.8 或更高版本
- PyTorch 1.9 或更高版本
- torchvision 0.10 或更高版本
- OpenCV
- numpy
- pandas
- matplotlib
- librosa(用于音频处理)
- moviepy(用于视频处理)
您可以使用以下命令安装所需的Python库:
pip install torch torchvision opencv-python numpy pandas matplotlib librosa moviepy
数据集准备
假设您的数据集已经下载并组织好,并且包含训练集、验证集和测试集。以下是数据集的预期结构:
datasets/
└── lrw_1000/
├── videos/
│ ├── train/
│ ├── val/
│ └── test/
└── transcripts/
├── train.txt
├── val.txt
└── test.txt
数据集文件说明
videos/
目录包含所有的视频文件。transcripts/
目录包含每个视频的转录文本。
模型选择
我们将使用一个预训练的模型进行微调。这里我们选择一个常用的唇语识别模型,例如 AV-Sync
或者 Vid2Speech
。为了简化流程,我们将使用 AV-Sync
模型。
下载预训练模型
首先,我们需要下载预训练的 AV-Sync
模型权重。可以从 AV-Sync GitHub 获取预训练模型。
安装 AV-Sync
克隆 AV-Sync 仓库并设置环境:
git clone https://github.com/microsoft/av-sync.git
cd av-sync
pip install -r requirements.txt
配置文件
创建一个配置文件来定义训练参数和其他设置。以下是一个示例配置文件 config.yaml
:
[<title="Configuration File for Lip Reading">]
# Configuration file for lip reading task using AV-Sync
# Data paths
data_root: '../datasets/lrw_1000/'
# Model settings
model_name: 'av_sync'
pretrained_weights: './pretrained_weights/av_sync.pth'
# Training settings
batch_size: 16
num_workers: 4
learning_rate: 0.001
num_epochs: 50
log_interval: 10
# Dataset settings
train_split: 'train.txt'
val_split: 'val.txt'
test_split: 'test.txt'
# Other settings
device: 'cuda' # or 'cpu'
数据加载和预处理
我们需要编写数据加载和预处理脚本。以下是一个示例脚本 dataset.py
:
[<title="Dataset Loader and Preprocessor">]
import os
import cv2
import numpy as np
import torch
from torch.utils.data import Dataset
import librosa
from moviepy.editor import VideoFileClip
class LRWDataset(Dataset):
def __init__(self, data_root, split_file, transform=None):
self.data_root = data_root
self.split_file = split_file
self.transform = transform
self.samples = self._load_samples()
def _load_samples(self):
samples = []
with open(os.path.join(self.data_root, 'transcripts', self.split_file), 'r') as f:
for line in f:
video_path, transcript = line.strip().split(' ', 1)
samples.append((video_path, transcript))
return samples
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
video_path, transcript = self.samples[idx]
video_frames = self._extract_video_frames(video_path)
audio_features = self._extract_audio_features(video_path)
if self.transform:
video_frames = self.transform(video_frames)
return video_frames, audio_features, transcript
def _extract_video_frames(self, video_path):
frames = []
cap = cv2.VideoCapture(os.path.join(self.data_root, 'videos', video_path))
while True:
ret, frame = cap.read()
if not ret:
break
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frames.append(frame)
cap.release()
return np.array(frames)
def _extract_audio_features(self, video_path):
video_clip = VideoFileClip(os.path.join(self.data_root, 'videos', video_path))
audio_signal = video_clip.audio.to_soundarray(fps=16000)
mel_spectrogram = librosa.feature.melspectrogram(y=audio_signal.mean(axis=1), sr=16000, n_mels=80)
log_mel_spectrogram = librosa.power_to_db(mel_spectrogram, ref=np.max)
return log_mel_spectrogram.T
# Example usage
if __name__ == '__main__':
dataset = LRWDataset(data_root='../datasets/lrw_1000/', split_file='train.txt')
sample = dataset[0]
print(sample[0].shape, sample[1].shape, sample[2])
模型定义
我们将使用 AV-Sync
模型进行微调。以下是一个示例模型定义 model.py
:
[<title="Model Definition for Lip Reading">]
import torch
import torch.nn as nn
import torchvision.models as models
class AVSync(nn.Module):
def __init__(self, num_classes=1000, pretrained=True):
super(AVSync, self).__init__()
self.visual_model = models.resnet18(pretrained=pretrained)
self.visual_model.fc = nn.Linear(self.visual_model.fc.in_features, 256)
self.audio_model = nn.Sequential(
nn.Conv1d(80, 128, kernel_size=3, stride=1, padding=1),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2, stride=2),
nn.Conv1d(128, 256, kernel_size=3, stride=1, padding=1),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2, stride=2),
nn.Flatten(),
nn.Linear(256 * 25, 256),
nn.ReLU()
)
self.fusion_layer = nn.Sequential(
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, num_classes)
)
def forward(self, visual_input, audio_input):
visual_output = self.visual_model(visual_input)
audio_output = self.audio_model(audio_input)
fusion_output = torch.cat((visual_output, audio_output), dim=1)
output = self.fusion_layer(fusion_output)
return output
# Example usage
if __name__ == '__main__':
model = AVSync(num_classes=1000)
visual_input = torch.randn(16, 3, 224, 224) # Batch size of 16, 3 channels, 224x224 images
audio_input = torch.randn(16, 80, 100) # Batch size of 16, 80 mel bins, 100 frames
output = model(visual_input, audio_input)
print(output.shape)
训练脚本
编写训练脚本来训练模型。以下是一个示例训练脚本 train.py
:
[<title="Training Script for Lip Reading">]
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
from dataset import LRWDataset
from model import AVSync
import yaml
from tqdm import tqdm
def load_config(config_path):
with open(config_path, 'r') as f:
config = yaml.safe_load(f)
return config
def main():
config = load_config('config.yaml')
device = torch.device(config['device'])
# Create datasets and dataloaders
train_dataset = LRWDataset(data_root=config['data_root'], split_file=config['train_split'])
val_dataset = LRWDataset(data_root=config['data_root'], split_file=config['val_split'])
train_loader = DataLoader(train_dataset, batch_size=config['batch_size'], shuffle=True, num_workers=config['num_workers'])
val_loader = DataLoader(val_dataset, batch_size=config['batch_size'], shuffle=False, num_workers=config['num_workers'])
# Create model
model = AVSync(num_classes=1000, pretrained=True)
model.to(device)
# Load pretrained weights
if config['pretrained_weights']:
state_dict = torch.load(config['pretrained_weights'])
model.load_state_dict(state_dict, strict=False)
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=config['learning_rate'])
# Training loop
for epoch in range(config['num_epochs']):
model.train()
running_loss = 0.0
for i, (video_frames, audio_features, transcripts) in enumerate(tqdm(train_loader)):
video_frames = video_frames.permute(0, 3, 1, 2).to(device) # Reshape to (batch_size, C, H, W)
audio_features = audio_features.transpose(1, 2).to(device) # Reshape to (batch_size, C, L)
optimizer.zero_grad()
outputs = model(video_frames, audio_features)
targets = torch.tensor([int(transcript) for transcript in transcripts], dtype=torch.long).to(device)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % config['log_interval'] == 0:
print(f'Epoch [{epoch+1}/{config["num_epochs"]}], Step [{i+1}/{len(train_loader)}], Loss: {running_loss/config["log_interval"]:.4f}')
running_loss = 0.0
# Validation
model.eval()
correct = 0
total = 0
with torch.no_grad():
for video_frames, audio_features, transcripts in tqdm(val_loader):
video_frames = video_frames.permute(0, 3, 1, 2).to(device) # Reshape to (batch_size, C, H, W)
audio_features = audio_features.transpose(1, 2).to(device) # Reshape to (batch_size, C, L)
outputs = model(video_frames, audio_features)
_, predicted = torch.max(outputs.data, 1)
targets = torch.tensor([int(transcript) for transcript in transcripts], dtype=torch.long).to(device)
total += targets.size(0)
correct += (predicted == targets).sum().item()
accuracy = 100 * correct / total
print(f'Validation Accuracy: {accuracy:.2f}%')
# Save the trained model
torch.save(model.state_dict(), 'runs/train/exp/best.pt')
if __name__ == '__main__':
main()
结果评估
训练完成后,可以使用以下脚本评估模型性能:
[<title="Evaluation Script for Lip Reading">]
import os
import torch
from torch.utils.data import DataLoader
from dataset import LRWDataset
from model import AVSync
import yaml
from tqdm import tqdm
def load_config(config_path):
with open(config_path, 'r') as f:
config = yaml.safe_load(f)
return config
def evaluate_model(model, dataloader, device):
model.eval()
correct = 0
total = 0
with torch.no_grad():
for video_frames, audio_features, transcripts in tqdm(dataloader):
video_frames = video_frames.permute(0, 3, 1, 2).to(device) # Reshape to (batch_size, C, H, W)
audio_features = audio_features.transpose(1, 2).to(device) # Reshape to (batch_size, C, L)
outputs = model(video_frames, audio_features)
_, predicted = torch.max(outputs.data, 1)
targets = torch.tensor([int(transcript) for transcript in transcripts], dtype=torch.long).to(device)
total += targets.size(0)
correct += (predicted == targets).sum().item()
accuracy = 100 * correct / total
print(f'Test Accuracy: {accuracy:.2f}%')
def main():
config = load_config('config.yaml')
device = torch.device(config['device'])
# Create dataset and dataloader
test_dataset = LRWDataset(data_root=config['data_root'], split_file=config['test_split'])
test_loader = DataLoader(test_dataset, batch_size=config['batch_size'], shuffle=False, num_workers=config['num_workers'])
# Create model
model = AVSync(num_classes=1000, pretrained=False)
model.to(device)
# Load trained weights
model.load_state_dict(torch.load('runs/train/exp/best.pt'))
# Evaluate model
evaluate_model(model, test_loader, device)
if __name__ == '__main__':
main()
使用说明
-
配置路径:
- 确保
datasets/lrw_1000/
目录结构正确。 - 确保
config.yaml
中的路径和设置正确。
- 确保
-
运行脚本:
- 在终端中依次运行训练脚本和评估脚本。
-
注意事项:
- 根据需要调整超参数和训练设置。
- 可以通过修改
config.yaml
中的模型架构和其他参数来优化模型性能。
示例
假设您的数据集文件夹结构如下:
datasets/
└── lrw_1000/
├── videos/
│ ├── train/
│ ├── val/
│ └── test/
└── transcripts/
├── train.txt
├── val.txt
└── test.txt
并且 transcripts/
目录中的 .txt
文件包含每个视频的路径和对应的类别索引。运行上述脚本后,您可以查看训练日志和最终的模型权重文件。
总结
通过上述步骤,我们可以构建一个全面的唇语识别系统,包括数据集准备、模型定义、训练和结果评估。以下是所有相关的代码文件:
- 配置文件 (
config.yaml
) - 数据加载和预处理脚本 (
dataset.py
) - 模型定义脚本 (
model.py
) - 训练脚本 (
train.py
) - 评估脚本 (
eval.py
)