基于HRTF的空间方法多数使用的是公共数据集,不包含所有的HRTF,甚至可能只是稀疏的采样,会造成声音的方向判断存在一定的差距,头部运动会影响到HRTF的计算,耳机使用占据着重要地位,可以更有效的投射声音到用户的听觉空间,HRTF计算相对会更可靠,HRTF存在着不能对距离进行建模的缺点,需要通过其他方法进行计算,如声音大小、起始时间延迟和运动视差等。
前文讨论了人类如何在三维空间中定位声音的来源,现在,我们反过来问,“我们能不能应用这些信息来让人们认为一个声音来自于空间中的某个具体位置?”
庆幸的是,答案是YES,否则的话,这篇文章就会非常短了。VR音频很重要的一部分就是空间化:能够让一个声音听起来来自于三维空间中的某个位置。空间化为用户提供了在一个真实3D环境中的感觉,有助于加强沉浸感。
和定位一样,空间化有两个重要部分:方向和距离。
一、使用HRTF的方向空间化(Directional Spatialization with Head-Related Transfer Functions(HRTFs))
捕捉HRTF(Capturing HRTFs)
最准确的捕捉HRTF方法是把话筒放在人的耳朵里,然后放在一个无声的房间里(也就是无回声的环境里),然后在房间里从各个我们关注的方向播放声音,并记录下话筒里的声音,通过比较原始声音和话筒的声音就可以计算出HRTF。
The most accurate method of HRTF capture is to take an individual, put a couple microphones in their ears (right outside the ear canal), place them in an anechoic chamber (i.e., an echoless environment), play sounds in the chamber from every direction we care about, and record those sounds from the mics. We can then compare the original sound with the captured sound and compute the HRTF that takes you from one to the other.
两只耳朵都必须这样做,并且需要从足够多的离散方向建立模型来捕获声音。
We have to do this for both ears, and we have to capture sounds from a sufficient number of discrete directions to build a usable sample set.
但是等等——我们只能为一个特定的人捕获HRTF。假如我们的大脑习惯于解释我们自身的HRTF,这怎么会有效?难道我们不需要去实验室生成一套个性化的HRTF设置?
But wait — we have only captured HRTFs for a specific person. If our brains are conditioned to interpret the HRTFs of our own bodies, why would that work? Don't we have to go to a lab and capture a personalized HRTF set?
在一个完美的世界中,我们都有自定义HRTFs,精确地匹配我们的身体和耳朵,但事实上这并不实际。虽然我们的HRTF是私人的,但是对于大多数情况,尤其是结合头部跟踪时,HRTF彼此之间的相似是足够通用的。
In a perfect world, yes, we'd all have custom HRTFs measured that match our own body and ear geometry precisely, but in reality this isn't practical. While our HRTFs are personal, they are similar enough to each other that a generic reference set is adequate for most situations, especially when combined with head tracking.
大多数基于HRTF的空间化方法,都采用一些现有的公开数据集,这些数据是从一些人类受试者,或如KEMAR的合成头模型中收集的。
Most HRTF-based spatialization implementations use one of a few publicly available data sets, captured either from a range of human test subjects or from a synthetic head model such as the KEMAR.
大多数HRTF的数据集中不包含所有方向的HRTF。例如,如果可能的话,直接将扬声器放在离个人头上一米以下是很难的,所以通常表示受试者头下的区域往往有很大的差距。有些HRTF数据集只进行了稀疏采样,只有5-15个维度。
Most HRTF databases do not have HRTFs in all directions. For example, there is often a large gap representing the area beneath the subject's head, as it is difficult, if no impossible, to place a speaker one meter directly below an individual's head. Some HRTF databases are sparsely sampled, including HRTFs only every 5 or 15 degrees.
大多数实现的方法是采用对齐到最近的HRTF(展品的声音不连续),或进行插值。这是一个仍在研究中的领域,但是对于基于桌面的VR应用,通常情况下都能够找到足够的数据集。
Most implementations either snap to the nearest acquired HRTF (which exhibits audible discontinuities) or use some method of HRTF interpolation. This is an ongoing area of research, but for VR applications on desktops, it is often adequate to find and use a sufficiently-dense data set.
应用HRTF(Applying HRTFs)
system yourself. Our discussion glosses over a lot of the implementation details (e.g., how we store an HRTF, how we use it when processing a sound). For our purposes, what matters is the high-level concept: we are simply filtering an audio signal to make it sound like it's coming from a specific direction.
头部追踪(Head Tracking)
二、距离模型(Distance Modeling)
HRTF帮助我们定位声音的方向,但是不能对距离进行建模。人们通过一些因素来估计声音的距离,这些可以通过在软件中调整参数和精度来进行模拟。
HRTFs help us identify a sound's direction, but they do not model our localization of distance. Humans use several factors to infer the distance to a sound source. These can be simulated with varying degrees of accuracy and cost in software:
- 声音大小:最可靠的因素,很容易通过听者和声源的距离进行建模。
- 起始时间延迟:很难建模,需要根据给定的一组几何图形特征计算早反射。这个计算花费昂贵并且也不适合在建筑上落实(特别地,向低级API发送几何通常是复杂的)。即便如此,依然有几个数据集尝试对此进行建模,从简单的shoebox 模型到完整的空间几何建模。
- 声音和混音(或者,音频制作中称“干/湿混合”):是任何系统尝试对反射和回声进行精确建模而得到的一种天然副产品。不幸的是,这种系统的计算往往非常昂贵。使用基于人工混响器的点对点模型,软件中的混合设置可以调整,但这完全是靠经验建模。
- 运动视差:可以“免费”获取,因为它是声源速度的副产品。
- 高频衰减:是空气吸收的一个小影响,通过低通滤波器调整截止频率和斜率可以很容易地建模,在实际应用中,该因素相比于其他距离因素并不是很重要。
- Loudness, our most reliable cue, is trivial to model with simple attenuation based on distance between the source and the listener.
- Initial Time Delayis significantly harder to model, as it requires computing the early reflections for a given set of geometry, along with that geometry's characteristics. This is both computationally expensive and awkward to implement architecturally (specifically, sending world geometry to a lower level API is often complex). Even so, several packages have made attempts at this, ranging from simple “shoebox models” to elaborate full scene geometric modeling.
- Direct vs. Reverberant Sound(or, in audio production, the “wet/dry mix”) is a natural byproduct of any system that attempts to accurately model reflections and late reverberations. Unfortunately, such systems tend to be very expensive computationally. With ad hoc models based on artificial reverberators, the mix setting can be adjusted in software, but these are strictly empirical models.
- Motion Parallaxwe get “for free,” because it is a byproduct of the velocity of a sound source.
- High Frequency Attenuationdue to air absorption is a minor effect, but it is also reasonably easy to model by applying a simple low-pass filter, and by adjusting cutoff frequency and slope. In practice, HF attenuation is not very significant in comparison to the other distance cues.