In the real world, when you hear two sounds at once, what you’re hearing is the combination (in the “+” sense) of the two noises. If you put five hundred drummers in the same room and, avoiding the obvious drummer jokes for now, told them all to play, you’d get drummer 1 + drummer 2 + … + drummer 500 (also bleeding ears).
With digital audio though, the volume doesn’t go up to oh-god-please-make-them-stop – it’s limited to a small dynamic range.
So, digital mixing actually requires a little thought in order to avoid overflowing these bounds and clipping. I recently came across this when writing some mixing routines for my upcoming app Loopy 2, and found a very useful discussion on mixing digital audio by software developer and author Viktor Toth.
The basic concept is to mix in such a way that we stay within the dynamic range of the target audio format, while representing the dynamics of the mixed signals as faithfully as possible.Note that a simple average of the samples (as in, (sample 1 + sample 2) / 2) won’t accomplish this – for example, if sample 1 is silent, whilesample 2 is happily jamming away, sample 2 will be halved in volume.
Instead, we want to meet three goals – assuming signed audio samples, the standard format for Remote IO/audio units on the iPhone/iPad, which can range from negative, through to zero (silence), up to positive values.
- If both samples are positive, we mix them so that the output value is somewhere between the maximum value of the two samples, and the maximum possible value
- If both samples are negative, we mix them so that the output value is somewhere between the minimum value of the two samples, and the minimum possible value
- If one sample is positive, and one is negative, we want them to cancel out somewhat
If we’re talking about signed samples, MIN…0…MAX, this does the trick:
This lets the volume level for both samples remain the same, while fitting within the available range.
Here’s how it’s done on iOS:
SInt16 *bufferA, SInt16 *bufferB;
NSInteger bufferLength;
SInt16 *outputBuffer;
for ( NSInteger i=0; i<bufferLength; i++ ) {
if ( bufferA[i] < 0 && bufferB[i] < 0 ) {
// If both samples are negative, mixed signal must have an amplitude between
// the lesser of A and B, and the minimum permissible negative amplitude
outputBuffer[i] = (bufferA[i] + bufferB[i]) - ((bufferA[i] * bufferB[i])/INT16_MIN);
} else if ( bufferA[i] > 0 && bufferB[i] > 0 ) {
// If both samples are positive, mixed signal must have an amplitude between the greater of
// A and B, and the maximum permissible positive amplitude
outputBuffer[i] = (bufferA[i] + bufferB[i]) - ((bufferA[i] * bufferB[i])/INT16_MAX);
} else {
// If samples are on opposite sides of the 0-crossing, mixed signal should reflect
// that samples cancel each other out somewhat
outputBuffer[i] = bufferA[i] + bufferB[i];
}
}
Update: A reader recently demonstrated that this technique can introduce some unpleasant distortion with certain kinds of input — as the algorithm is nonlinear, some distortion is inevitable (see the sharp points on the waveform where the condition switches over). For the kind of audio I’m mixing, the results seem to be perfectly adequate, but this may not be generally true.
Update 2: Here’s an inline function I put together for neatness:
inline SInt16 TPMixSamples(SInt16 a, SInt16 b) {
return
// If both samples are negative, mixed signal must have an amplitude between the lesser of A and B, and the minimum permissible negative amplitude
a < 0 && b < 0 ?
((int)a + (int)b) - (((int)a * (int)b)/INT16_MIN) :
// If both samples are positive, mixed signal must have an amplitude between the greater of A and B, and the maximum permissible positive amplitude
( a > 0 && b > 0 ?
((int)a + (int)b) - (((int)a * (int)b)/INT16_MAX)
// If samples are on opposite sides of the 0-crossing, mixed signal should reflect that samples cancel each other out somewhat
:
a + b);
}
but someone say this is wrong
This is so terribly wrong. Please don’t mislead newbies into thinking that this is the correct way to mix two channels. The correct way is to simply sum/average them together, as you dismissed early in the article.
Summing/averaging is exactly what every professional analog or digital mixing console does, because it’s exactly what happens in the air and in our ears and in our brains. Yes, it can change the crest factor of the signal, but that’s ok because digital audio is designed to have lots of headroom for the peaks above the normal signal level that you listen at. You’re not generating audio at 0 dBFS are you? Surely you know better than that. :D
If you want to participate in the Loudness War and harshly reduce the dynamic range of your mix til everything is at 11 all the time, use a locally-linear limiter, not this nonlinear distortion stuff.