VR系列——Oculus Audio sdk文档：一、虚拟现实音频技术简介（3）——3D音频的空间化

最新推荐文章于 2024-07-22 06:34:02 发布

cleardo

最新推荐文章于 2024-07-22 06:34:02 发布

阅读量2.5k

点赞数 1

分类专栏： VR 文章标签：虚拟现实

VR 专栏收录该内容

83 篇文章 13 订阅

订阅专栏

基于HRTF的空间方法多数使用的是公共数据集，不包含所有的HRTF，甚至可能只是稀疏的采样，会造成声音的方向判断存在一定的差距，头部运动会影响到HRTF的计算，耳机使用占据着重要地位，可以更有效的投射声音到用户的听觉空间，HRTF计算相对会更可靠，HRTF存在着不能对距离进行建模的缺点，需要通过其他方法进行计算，如声音大小、起始时间延迟和运动视差等。

前文讨论了人类如何在三维空间中定位声音的来源，现在，我们反过来问，“我们能不能应用这些信息来让人们认为一个声音来自于空间中的某个具体位置？”

庆幸的是，答案是YES，否则的话，这篇文章就会非常短了。VR音频很重要的一部分就是空间化：能够让一个声音听起来来自于三维空间中的某个位置。空间化为用户提供了在一个真实3D环境中的感觉，有助于加强沉浸感。

和定位一样，空间化有两个重要部分：方向和距离。

一、使用HRTF的方向空间化（Directional Spatialization with Head-Related Transfer Functions(HRTFs)）

我们知道不同方向的声音在身体和耳朵里的传播是不同的。这些不同的影响构成了HRTF的基础，我们用它来定位声音。

We know that sounds are transformed by our body and ear geometry differently depending on the incoming direction. These different effects form the basis of HRTFs, which we use to localize a sound.

捕捉HRTF（Capturing HRTFs）

最准确的捕捉HRTF方法是把话筒放在人的耳朵里，然后放在一个无声的房间里（也就是无回声的环境里），然后在房间里从各个我们关注的方向播放声音，并记录下话筒里的声音，通过比较原始声音和话筒的声音就可以计算出HRTF。

The most accurate method of HRTF capture is to take an individual, put a couple microphones in their ears (right outside the ear canal), place them in an anechoic chamber (i.e., an echoless environment), play sounds in the chamber from every direction we care about, and record those sounds from the mics. We can then compare the original sound with the captured sound and compute the HRTF that takes you from one to the other.

两只耳朵都必须这样做，并且需要从足够多的离散方向建立模型来捕获声音。

We have to do this for both ears, and we have to capture sounds from a sufficient number of discrete directions to build a usable sample set.

但是等等——我们只能为一个特定的人捕获HRTF。假如我们的大脑习惯于解释我们自身的HRTF，这怎么会有效？难道我们不需要去实验室生成一套个性化的HRTF设置？

But wait — we have only captured HRTFs for a specific person. If our brains are conditioned to interpret the HRTFs of our own bodies, why would that work? Don't we have to go to a lab and capture a personalized HRTF set?

在一个完美的世界中，我们都有自定义HRTFs，精确地匹配我们的身体和耳朵，但事实上这并不实际。虽然我们的HRTF是私人的，但是对于大多数情况，尤其是结合头部跟踪时，HRTF彼此之间的相似是足够通用的。

In a perfect world, yes, we'd all have custom HRTFs measured that match our own body and ear geometry precisely, but in reality this isn't practical. While our HRTFs are personal, they are similar enough to each other that a generic reference set is adequate for most situations, especially when combined with head tracking.

大多数基于HRTF的空间化方法，都采用一些现有的公开数据集，这些数据是从一些人类受试者，或如KEMAR的合成头模型中收集的。

Most HRTF-based spatialization implementations use one of a few publicly available data sets, captured either from a range of human test subjects or from a synthetic head model such as the KEMAR.

大多数HRTF的数据集中不包含所有方向的HRTF。例如，如果可能的话，直接将扬声器放在离个人头上一米以下是很难的，所以通常表示受试者头下的区域往往有很大的差距。有些HRTF数据集只进行了稀疏采样，只有5-15个维度。

Most HRTF databases do not have HRTFs in all directions. For example, there is often a large gap representing the area beneath the subject's head, as it is difficult, if no impossible, to place a speaker one meter directly below an individual's head. Some HRTF databases are sparsely sampled, including HRTFs only every 5 or 15 degrees.

大多数实现的方法是采用对齐到最近的HRTF（展品的声音不连续），或进行插值。这是一个仍在研究中的领域，但是对于基于桌面的VR应用，通常情况下都能够找到足够的数据集。

Most implementations either snap to the nearest acquired HRTF (which exhibits audible discontinuities) or use some method of HRTF interpolation. This is an ongoing area of research, but for VR applications on desktops, it is often adequate to find and use a sufficiently-dense data set.

应用HRTF（Applying HRTFs）

给定一个HRTF数据集，如果我们知道了声源的方向，我们就可以选择一个HRTF然后把它应用到该声源。这通常可通过一个时域的卷积或者一个FFT/IFFT对实现。

Given an HRTF set, if we know the direction we want a sound to appear to come from, we can select an appropriate HRTF and apply it to the sound. This is usually done either in the form of a time-domain convolution or an FFT/IFFT pair.

如果你不知道这些是什么东西，也不用着急，这些细节只有当你自己实现HRTF时才用得到。本文讨论的是应用的细节（例如：怎么存储一个HRTF，当处理音频时如何调用）。我们关注的是高层的逻辑：将音频进行过滤，让它听起来来自于一个特定的方向。

If you don't know what these are, don't worry - those details are only relevant if you are implementing the HRTF
system yourself. Our discussion glosses over a lot of the implementation details (e.g., how we store an HRTF, how we use it when processing a sound). For our purposes, what matters is the high-level concept: we are simply filtering an audio signal to make it sound like it's coming from a specific direction.

由于HRTFs会考虑到听者的头部位置，因此在进行空间化时使用耳机是很重要的。如果没有耳机，就需要应用两个HRTF：一个是模拟，一个是通过身体真实得到的。

Since HRTFs take the listener's head geometry into account, it is important to use headphones when performing spatialization. Without headphones, you are effectively applying two HRTFs: the simulated one, and the actual HRTF caused by the geometry of your body.

头部追踪（Head Tracking）

听众本能的会通过头部的运动来分辨空间中的声音，假如我们把这种能力拿走，我们定位声音的能力就会减弱，尤其会分辨不出高低和前后。如果我们无法弥补头部运动，声音的重现充其量是脆弱的，那么我们甚至会忽略定位。如果一个听众的脑袋向一侧转了45度，那我们就必须在他们的听觉环境中体现出来，否则音响就会出错。

VR头部设备（例如Rift）可以追踪听者的头部方向（或者位置）。根据声学包提供的这些信息，我们就可以投射声音信息到听者的空间中，而不用管他们头的位置。

这些的前提是听者戴了耳机。使用扬声器的组合可以模仿这种情况，但它不是那么可靠，反而更麻烦，更难以实现，对于大多数VR应用来说是无法接受的。

二、距离模型（Distance Modeling）

HRTF帮助我们定位声音的方向，但是不能对距离进行建模。人们通过一些因素来估计声音的距离，这些可以通过在软件中调整参数和精度来进行模拟。

HRTFs help us identify a sound's direction, but they do not model our localization of distance. Humans use several factors to infer the distance to a sound source. These can be simulated with varying degrees of accuracy and cost in software:

声音大小：最可靠的因素，很容易通过听者和声源的距离进行建模。
起始时间延迟：很难建模，需要根据给定的一组几何图形特征计算早反射。这个计算花费昂贵并且也不适合在建筑上落实（特别地，向低级API发送几何通常是复杂的）。即便如此，依然有几个数据集尝试对此进行建模，从简单的shoebox 模型到完整的空间几何建模。
声音和混音（或者，音频制作中称“干/湿混合”）：是任何系统尝试对反射和回声进行精确建模而得到的一种天然副产品。不幸的是，这种系统的计算往往非常昂贵。使用基于人工混响器的点对点模型，软件中的混合设置可以调整，但这完全是靠经验建模。
运动视差：可以“免费”获取，因为它是声源速度的副产品。
高频衰减：是空气吸收的一个小影响，通过低通滤波器调整截止频率和斜率可以很容易地建模，在实际应用中，该因素相比于其他距离因素并不是很重要。

Loudness, our most reliable cue, is trivial to model with simple attenuation based on distance between the source and the listener.
Initial Time Delayis significantly harder to model, as it requires computing the early reflections for a given set of geometry, along with that geometry's characteristics. This is both computationally expensive and awkward to implement architecturally (specifically, sending world geometry to a lower level API is often complex). Even so, several packages have made attempts at this, ranging from simple “shoebox models” to elaborate full scene geometric modeling.
Direct vs. Reverberant Sound(or, in audio production, the “wet/dry mix”) is a natural byproduct of any system that attempts to accurately model reflections and late reverberations. Unfortunately, such systems tend to be very expensive computationally. With ad hoc models based on artificial reverberators, the mix setting can be adjusted in software, but these are strictly empirical models.
Motion Parallaxwe get “for free,” because it is a byproduct of the velocity of a sound source.
High Frequency Attenuationdue to air absorption is a minor effect, but it is also reasonably easy to model by applying a simple low-pass filter, and by adjusting cutoff frequency and slope. In practice, HF attenuation is not very significant in comparison to the other distance cues.