膨胀卷积神经网络_用膨胀的卷积神经网络生成钢琴音乐

膨胀卷积神经网络

介绍 (Introduction)

Fully convolutional neural networks consisting of dilated 1D convolutions are straightforward to construct, easy to train, and can generate realistic piano music, such as the following:

由膨胀的一维卷积组成的全卷积神经网络易于构造,易于训练,并且可以生成逼真的钢琴音乐,例如:

Example performance generated by a fully convolutional network trained on 100 hours of classical music.
由经过100小时古典音乐训练的完全卷积网络产生的示例演奏。

动机 (Motivation)

A considerable amount of research has been devoted to training deep neural networks that can compose piano music. For example, Musenet, developed by OpenAI, has trained large-scale transformer models capable of composing realistic piano pieces that are many minutes in length. The model used by Musenet adopts many of the technologies, such as attention layers, that were originally developed for NLP tasks. See this previous TDS post for more details on applying attention-based models to music generation.

大量研究致力于训练可组成钢琴音乐的深度神经网络。 例如, Musenet ,通过OpenAI开发,培养能够构成现实的钢琴曲,其长度多分钟的大型变压器模型。 Musenet使用的模型采用了最初为NLP任务开发的许多技术,例如注意层。 有关将基于注意力的模型应用于音乐生成的更多详细信息,请参见此TDS上的帖子

Although NLP-based methods are a fantastic fit for machine-based music generation (after all, music is like a language), the transformer model architecture is somewhat involved, and proper data preparation and training can require great care and experience. This steep learning curve motivates my exploration of simpler approaches to training deep neural networks that can compose piano music. In particular, I’ll focus on fully convolutional neural networks based on dilated convolutions, which require only a handful of lines of code to define, take minimal data preparation, and are easy to train.

尽管基于NLP的方法非常适合基于机器的音乐生成(毕竟,音乐就像一种语言),但转换器模型的体系结构还是有些复杂,正确的数据准备和培训可能需要极大的照顾和经验。 这种陡峭的学习曲线激发了我对训练可构成钢琴音乐的深度神经网络的更简单方法的探索。 特别是,我将重点介绍基于膨胀卷积的全卷积神经网络,该网络只需要定义几行代码,就可以进行最少的数据准备,并且易于训练。

历史背景 (Historical Context)

In 2016, DeepMind researchers introduced the WaveNet model architecture,¹ which yielded state-of-the-art performance in speech synthesis. Their research demonstrated that stacked 1D convolutional layers with exponentially growing dilation rates can process sequences of raw audio waveforms extremely efficiently, leading to generative models that can synthesize convincing audio from a variety of sources, including piano music.

2016年, DeepMind的研究人员介绍了WaveNet模型体系结构¹,该体系结构在语音合成中产生了最先进的性能。 他们的研究表明,具有成倍增长的扩张速度的堆叠式一维卷积层可以极其有效地处理原始音频波形序列,从而生成了可以合成令人信服的音频(包括钢琴音乐)的生成模型。

In this post, I build upon DeepMind’s research, with an explicit focus on generating piano music. Instead of feeding the model raw audio from recorded music, I explicitly feed the model sequences of piano notes encoded in Musical Instrument Digital Interface (MIDI) files. This facilitates data collection, drastically reduces computational load, and allows the model to focus entirely on the musical aspects of the data. This efficient data encoding and ease of data collection enables rapid exploration of how well fully-convolutional networks can understand piano music.

在本文中,我以DeepMind的研究为基础,重点明确于产生钢琴音乐。 我没有提供来自录制音乐的模型原始音频,而是明确提供了在乐器数字接口(MIDI)文件中编码的钢琴音符的模型序列。 这有利于数据收集,大大减少了计算量,并使模型完全专注于数据的音乐方面。 这种高效的数据编码和易于收集的数据可以快速探索全卷积网络如何理解钢琴音乐。

这些模型“弹钢琴”的程度如何? (How Well Can These Models ‘Play the Piano’?)

To give a sense of how realistic these models can sound, let’s play an imitation game. Which excerpt below is composed by a human, and which is composed by a model:

为了让您感觉到这些模型的逼真度,让我们玩一个模仿游戏。 以下哪个摘录是由人组成的,哪个是由模型组成的:

Piano Composition A: Human or model?
钢琴作品A:人还是模型?
Piano Composition B: Human or model?
钢琴作品B:人还是模型?

Maybe you anticipated this trick, but both compositions were produced by the model described in this post. The model generating the above two pieces took only four days to train on a single NVIDIA Tesla T4 with 100 hours of classical piano music in the training set.

也许您已经预料到了这一技巧,但是两种组合都是由本文中描述的模型产生的。 生成上述两首歌的模型只用了四天的时间就在一台NVIDIA Tesla T4上进行了训练,并带有100小时的古典钢琴音乐。

I hope the quality of these two performances provides you with motivation to read on and explore how to build your own models for generating piano music. The code described in this project can be found at PianoNet’s Github, and more example compositions can be found at PianoNet’s SoundCloud.

我希望这两场演出的质量能为您提供继续阅读和探索如何建立自己的模型以产生钢琴音乐的动力。 该项目中描述的代码可以在PianoNet的Github上找到,更多示例作品可以在PianoNet的SoundCloud上找到。

Now, let’s dive into the details of how to train a model to produce piano music like the above examples.

现在,让我们深入研究如何训练模型以产生钢琴音乐的细节,就像上面的示例一样。

方法 (Approach)

When beginning any machine learning project, it’s good practice to clearly define the task we’re trying to accomplish, the experience from which our model will learn, and the performance measure(s) we’ll use to determine if our model is improving at the task.

当开始任何机器学习项目,这是很好的做法,清楚地界定我们要完成,从我们的模型将学习经验任务 ,而且性能指标(S),我们将用它来确定,如果我们的模型是改善任务。

任务 (Task)

Our overarching goal is to produce a model that efficiently approximates the data generating distribution, P(X). This distribution is a function that maps any sequence of piano notes, X, to a real number ranging between 0 and 1. In essence, P(X) assigns larger values to sequences that are more likely to have been created by skilled human composers. For example, if X¹ is a composition consisting of 200 hours of randomly selected notes, and X² is a Mozart sonata, then P(X¹) < P(X²). Further, P(X¹) will be very near zero.

我们的总体目标是建立一个有效地近似数据生成分布P(X)的模型。 这种分布是一种将钢琴音符X的任何序列映射到介于0和1之间的实数的功能。实质上,P(X)将更大的值分配给更有可能由熟练的人类作曲家创建的序列。 例如,如果X 1是由200小时随机选择的音符组成的组成,而X 2是莫扎特奏鸣曲,则P(X 1)<P(X 2)。 此外,P(X 1)将非常接近零。

In practice, the distribution P(X) can never be exactly determined, as this would require gathering all the human composers that could ever exist into a room and making them write piano music for all of eternity. However, the same incompleteness exists for simpler data generating distributions. For example, exactly determining the distribution of human heights requires all possible humans to exist and be measured, but this doesn’t stop us from defining and approximating such a distribution. In this sense, the P(X) defined above is a useful mathematical abstraction that encompasses all possible factors determining how piano music is generated.

在实践中,永远无法精确确定分布P(X),因为这将需要将所有可能存在的人类作曲家聚集到一个房间里,并使他们永远写钢琴音乐。 但是,对于更简单的数据生成分布,存在相同的不完整性。 例如,精确确定人的身高分布要求所有可能的人都存在并被测量,但这并不能阻止我们定义和近似这种高度分布。 从这个意义上讲,上面定义的P(X)是有用的数学抽象,它包含确定钢琴音乐产生方式的所有可能因素。

If we estimate P(X) well enough with a model, we can use that model to stochastically sample new, realistic compositions that have never been heard before. This definition is still a little abstract, so let’s apply some sensible approximations to make the estimation of P(X) more tractable.

如果我们用模型足够好地估计P(X),则可以使用该模型随机采样以前从未听说过的新的,现实的构图。 这个定义仍然有点抽象,因此让我们应用一些合理的近似值,以使P(X)的估计更易处理。

Data Encoding: We need to encode piano music in a way that a computer can understand. To do this, we will represent a piano composition as a variable-length time series of binary states, each state tracking whether or not a given note on the keyboard is being pressed down by a finger during a time step:

数据编码:我们需要以一种计算机可以理解的方式对钢琴音乐进行编码。 为此,我们将钢琴作品表示为二进制状态的可变长度时间序列,每个状态跟踪在某个时间步中是否用手指按下了键盘上的给定音符:

Image for post
Figure 1: The data-encoding process, showing how piano music can be represented as a 1D series of binary key states, with a state of one representing a key being pressed during a particular time step.
图1:数据编码过程,显示如何将钢琴音乐表示为一维系列的二进制键状态,其中一个状态表示在特定时间步长期间按下的键。

The data processing inequality tells us that information can only ever be lost when we process information,² and our chosen method of encoding of piano music is no exception. There are two ways information loss can happen in this case. First, resolution in the time direction must be limited by making the time steps finite. Luckily, a rather large step size of 0.02 seconds still leads to negligible reduction in music quality. Second, we do not represent the velocity with which keys are being pressed, and thus, musical dynamics are lost.

数据处理的不平等告诉我们,信息只有在我们处理信息时才会丢失,²而且我们选择的钢琴音乐编码方法也不例外。 在这种情况下,有两种方法可以造成信息丢失。 首先,必须通过限制时间步长来限制时间方向的分辨率。 幸运的是,0.02秒的较大步长仍然导致音乐质量的下降可忽略不计。 其次,我们不代表按下键的速度,因此失去了音乐动力。

Despite significant approximations, this encoding method captures much of the underlying information in a concise and machine-friendly form. This is because a piano is effectively a big mechanical finite state machine. Efficiently encoding the music of other, more nuanced instruments, such as a guitar or a flute, would likely be much harder.

尽管有很大的近似值,但是这种编码方法以简洁且机器友好的形式捕获了很多基础信息。 这是因为钢琴实际上是大型的机械有限状态机。 有效地编码其他更细微的乐器(例如吉他或长笛)的音乐可能会困难得多。

Now that we have an encoding scheme, we can represent the data generating distribution in a more concrete form:

现在我们有了编码方案,我们可以用更具体的形式表示数据生成分布:

Image for post
x variable can be 1 (key pressed) or 0. x变量可以是1(按下键)或0。
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值