基于Pytorch使用RNN（LSTM）的音乐生成（基础）

CaptainHarryChen

已于 2024-09-17 15:43:42 修改

阅读量5.3k

点赞数 8

分类专栏： AI编曲文章标签： lstm pytorch rnn AI作曲人工智能

于 2022-02-05 21:51:07 首次发布

本文链接：https://blog.csdn.net/can919/article/details/122793127

版权

PyTorch 音乐生成 LSTM 数据预处理模型构建

关键词由CSDN通过智能技术生成

AI编曲专栏收录该内容

2 篇文章

订阅专栏

通过PyTorch实现音乐生成模型，学习其代码结构，重点讲解数据预处理、LSTM应用和损失函数设计。文中提到的问题和优化建议使得模型效果欠佳。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

警告：本文已经远远落后于时代，现阶段已经较少使用RNN模型，并且架构比本文好得多，不建议某些需要应用的读者直接使用本文的内容

上次跟着Tensorflow的官方教程，用tensorflow写了一遍后，觉得不太习惯tensorflow的风格，于是用pytorch再写了一遍，熟悉了pytorch的基本代码流程

本文主要为了熟悉pyTorch机器学习的代码一般格式，在音乐生成的模型上有很多不合理的操作，所以结果也不太好。

先粘一段上一篇博客的内容

安装

用的目前最新版pyTorch，也就是1.10.2
使用pretty_midi库来读取midi文件，仅针对没有速度、节拍等信息的midi文件
其余库如numpy，进度条tqdm等

数据集使用maestro-v2.0.0，在tensorflow官方教程里的

基本原理

从midi文件中读入单个乐器的音符列表，记录音符的pitch(音高)，step(音符起始时间距上一个音符起始时间的距离)，duration(音符的长度)，时间单位都是秒，全都以float类型记录
PS：显然不太合理，没有记录音乐中重要的节拍信息，还把音高这种离散数据视为连续的，但这不是重点

从数据集中取出连续的sequence_length个音符输入进LSTM模块（原理略），得到的输出分别用三个全连接层处理，得到预测的下一个音符的pitch,step,duration

读取MIDI文件

pretty_midi库简单介绍

pretty_midi比较简单好用
读取midi文件

pm = pretty_midi.PrettyMIDI(midi_file)

然后在PrettyMIDI对象pm中，pm.instruments即是乐器列表

instrument = pm.instruments[0]

instrument即为第一个乐器，类型为pretty_midi.Instrument，instrument.notes即为该乐器的音符列表
音符类型pretty_midi.Note有4个属性

note.start 		#开始时间
note.end 		#结束时间
note.pitch		#音高
note.velocity 	#音符力度

创建midi文件pm.write(midi_file)即可

处理midi数据

从pretty_midi.Instrument中读入音符列表，将其处理为我们需要的三个特征pitch，step，duration，并转换为numpy数组类型

def GetNoteSequence(instrument: Instrument) -> np.ndarray:
    sorted_notes = sorted(instrument.notes, key=lambda x: x.start)
    assert len(sorted_notes) > 0
    notes = []
    prev_start = sorted_notes[0].start
    for note in sorted_notes:
        notes.append([note.pitch, note.start -
                     prev_start, note.end-note.start])
        prev_start = note.start
    return np.array(notes)

准备训练数据

pyTorch中数据处理的核心是torch.utils.data.Dataset和torch.utils.data.DataLoader
我们需要自定义一个Dataset来处理我们的midi数据
首先继承torch.utils.data.Dataset，然后实现__init__,__getitem__,__len__三个函数，功能分别为初始化，取出第i个数据，获得数据总数

在初始化中，首先用glob读出文件列表（glob可以使用通配符），然后遍历所有文件，用pretty_midi打开，得到它的音符序列，并对音符的pitch归一化（tensorflow官方教程这样干的）
将音符序列保存在类里面
（关于np.append的功能可以看这个文章）

在读取第i个数据时，取出第i个音符开始的，长度为seq_len的序列作为输入数据，取出序列尾部的下一个音符为标签

最后加了getendseq功能，主要是方便测试时能够获得无标签的最后一个序列

class SequenceMIDI(Dataset):
    def __init__(self, files, seq_len, max_file_num=None):
        notes = None
        filenames = glob.glob(files)
        print(f"Find {len(filenames)} files.")
        if max_file_num is None:
            max_file_num = len(filenames)
        print(f"Reading {max_file_num} files...")
        for f in tqdm(filenames[:max_file_num]): # tqdm提供进度条
            pm = PrettyMIDI(f)
            instrument = pm.instruments[0]
            new_notes = GetNoteSequence(instrument)
            new_notes /= [128.0, 1.0, 1.0]
            if notes is not None:
                notes = np.append(notes, new_notes, axis=0)
            else:
                notes = new_notes

        self.seq_len = seq_len
        self.notes = np.array(notes, dtype=np.float32)

    def __len__(self):
        return len(self.notes)-self.seq_len

    def __getitem__(self, idx) -> (np.ndarray, dict):
        label_note = self.notes[idx+self.seq_len]
        label = {
            'pitch': (label_note[0]*128).astype(np.int64), 'step': label_note[1], 'duration': label_note[2]}
        return self.notes[idx:idx+self.seq_len], label

    def getendseq(self) -> np.ndarray:
        return self.notes[-self.seq_len:]

加载数据

用刚刚构建的Dataset来构建一个DataLoader作为数据加载器
然后就可以像遍历数组一样遍历DataLoader来获取数据了

# 一些参数设置
batch_size = 64
sequence_lenth = 25
max_file_num = 1200
epochs = 200
learning_rate = 0.005

loss_weight = [0.1, 20.0, 1.0]

save_model_name = "model.pth"
# 使用GPU训练
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

trainning_data = SequenceMIDI(
    "maestro-v2.0.0\\*\\*.midi", sequence_lenth, max_file_num=max_file_num)
print(f"Read {len(trainning_data)} sequences.")
loader = DataLoader(trainning_data, batch_size=batch_size)

for X, y in loader:
    print(f"X: {X.shape} {X.dtype}")
    print(f"y: {y}")
    break

模型构建

继承torch.nn.Module来构建自己的模型
一个LSTM处理输入的音符，再分别用三个全连接层算出pitch,step,duration，其中pitch的全连接层后使用Sigmoid将得到的值放在0~1之间
在这里LSTM在指定batch_first=True时，输入维度为 $N,L,H_{in})$ ，分别为batch，序列长度，输入维度
pitch特征输出为128位，表示每个音高出现的权重
step和duration都是一维的标量

一般torch.nn.中的自带模型初始化都是(输入维度，输出维度)

在forward中，注意到lstm的输出为两个，它的输出格式其实是 $output,(h_n,c_n)$ ，后面的tuple是我们用不到的隐藏状态
而output是一个序列，而我们只需要序列的最后一个，所以后面的x都是取x[:-1]
（具体可以去了解LSTM的原理）

class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.lstm = torch.nn.LSTM(3, 128, num_layers=1, batch_first=True)
        self.pitch_linear = torch.nn.Linear(128, 128)
        self.pitch_sigmoid = torch.nn.Sigmoid()
        self.step_linear = torch.nn.Linear(128, 1)
        self.duration_linear = torch.nn.Linear(128, 1)

    def forward(self, x):
        x, _ = self.lstm(x)
        pitch = self.pitch_sigmoid(self.pitch_linear(x[:, -1]))
        step = self.step_linear(x[:, -1])
        duration = self.duration_linear(x[:, -1])
        return {'pitch': pitch, 'step': step, 'duration': duration}

损失函数

为了方便我把损失函数也写成一个模型
在这个损失函数中，分别计算三个特征的损失，并带权加和
pitch使用交叉熵（常用于分类器），而另外两个标量用均方差并带上使他变为正数的压力（毕竟时间都是正数）

注意到torch自带的CrossEntropyLoss既可以计算一个分类权重与下标的交叉熵，也可以两个分类权重的交叉熵
也就是下面两种都支持计算（这里我们使用第一种）

[       1       ,        2       ,      3      ]
[[0.05, 0.95, 0], [0.1, 0.8, 0.1],[0.3,0.7,0.0]]

[[1.00, 0.00, 0], [0.0, 1.0, 0.0],[0.0,0.0,1.0]]
[[0.05, 0.95, 0], [0.1, 0.8, 0.1],[0.3,0.7,0.0]]

def mse_with_positive_pressure(pred, y):
    mse = (y-pred) ** 2
    positive_pressure = 10*torch.maximum(-pred, torch.tensor(0))
    return torch.mean(mse+positive_pressure)
   
   
class MyLoss(torch.nn.Module):
    def __init__(self, weight):
        super(MyLoss, self).__init__()
        self.weight = torch.Tensor(weight)
        self.pitch_loss=torch.nn.CrossEntropyLoss()
        self.step_loss=mse_with_positive_pressure
        self.duration_loss=mse_with_positive_pressure

    def forward(self, pred, y):
        a = self.pitch_loss(pred['pitch'], y['pitch'])
        b = self.step_loss(pred['step'], y['step'])
        c = self.duration_loss(pred['duration'], y['duration'])
        return a*self.weight[0]+b*self.weight[1]+c*self.weight[2]

关于loss权重的设置
可以先预先运行一下，得到几个损失值，然后手动设置权重使他们比较相近

训练模型

首先加载模型到GPU
设置损失后汉书，优化器
循环每个epoch，用model.train()将模型设置为训练模式
在每个epoch中，遍历loader来获取数据batch
将数据放在GPU上后，输入进模型，计算损失
将优化器的导数记录清零，再对loss求导，然后用optimizer.step()优化模型参数

model = MyModel().to(device)
loss_fn = MyLoss(loss_weight).to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
print(model)
print(loss_fn)

print("Start trainning...")
size = len(loader.dataset)
for t in range(epochs):
    model.train()
    avg_loss = 0.0
    print(f"Epoch {t+1}\n-----------------")
    for batch,(X, y) in enumerate(tqdm(loader)):
        X= X.to(device)
        for feat in y.keys():
            y[feat]=y[feat].to(device)
        pred = model(X)
        loss = loss_fn(pred, y)
        avg_loss = avg_loss + loss.item()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    avg_loss /= len(loader)
    print(f"average loss = {avg_loss}")
    if (t+1) % 10 == 0:
        torch.save(model.state_dict(), "model%d.pth" % (t+1))
print("Done!")

torch.save(model.state_dict(), save_model_name)
print(f"Saved PyTorch Model State to {save_model_name}")

生成音乐

预测下一个音符

首先用model.eval()将模型设置为测试模式，用torch.no_grad()让pyTorch不记录导数节约内存
读入的音符序列需要增加一个维度来代表batch，因为模型的输入是带有batch维度的
使用torch.tensor.unsqueeze()来增加维度（[0,1,2]-->[[0,1,2]]）
然后将输入数据扔进模型里得到predictions
根据prediction中音高pitch的128位权重输出，按权重随机产生音符，这里我手写了一个按权值随机
由于输出中pitch,duration,step都是带有一维batch的，所以使用np.squeeze把batch维度去掉（[[2]]–>[2]）
最后要将step与duration与0取max，防止输出负数时间

def WeightedRandom(weight, k=100000) -> int:
    sum = int(0)
    for w in weight:
        sum += int(k*w)
    x = random.randint(1, sum)
    sum = 0
    for id, w in enumerate(weight):
        sum += int(k*w)
        if sum >= x:
            return id
    return


def PredictNextNote(model: MyModel(), input: np.ndarray):
    model.eval()
    with torch.no_grad():
        input = torch.tensor(input, dtype=torch.float32).unsqueeze(0)
        pred = model(input)
        pitch = WeightedRandom(np.squeeze(pred['pitch'], axis=0))
        step = np.maximum(np.squeeze(pred['step'], axis=0), 0)
        duration = np.maximum(np.squeeze(pred['duration'], axis=0), 0)
    return pitch, float(step), float(duration)

生成序列

首先需要一个起始的输入序列作为灵感
用sample_file_name中初始化一个Dataset，然后将他的最后一个序列作为输入
具体操作就是每预测一个音符，就先删除输入序列的第一个音符，并将生成的音符放进输入序列的末尾

sample_file_name = "sample.mid"
output_file_name = "output10.mid"
save_model_name = "model10.pth"
predict_length = 128
sequence_lenth = 25

model = MyModel()
model.load_state_dict(torch.load(save_model_name))

sample_data = SequenceMIDI(sample_file_name, sequence_lenth)

cur = sample_data.getendseq()
res = []
prev_start = 0
for i in tqdm(range(predict_length)):
    pitch, step, duration = PredictNextNote(model, cur)
    res.append([pitch, step, duration])
    cur = cur[1:]
    cur = np.append(cur, [[pitch, step, duration]], axis=0)
    prev_start += step

pm_output = PrettyMIDI()
pm_output.instruments.append(
    CreateMIDIInstrumennt(res, "Acoustic Grand Piano"))
pm_output.write(output_file_name)