现在的OpenH323已经取得了全球性的成功（噢，是的），而在之前的“任性的(heady)”日子里，我曾经有许多的时间去思考并写代码验证各种各样有趣的想法。在那段时间里，我记得在半夜里惊醒，脑子里面全是如何实现MCU（Multi Conference Unit）需要的混音算法。在接下去的周末2天里，我近乎疯狂编写代码去实现它，OpenMCU从此诞生了。
OpenMCU的混音算法是非常简单的，它基于2个想法。第一个关键想法是：所有的语音通道都要转化为PCM格式，以便他们可以通过简单的代数运算算法进行混音。现在我知道一些语音数据比如G.723.1或者G.729不需要先转换为PCM格式就可以直接进行混音(我曾经看到一些厉害的人把MPEG 视频格式的标题和水印通过直接DCT 系数映射的方法实时的加入到视频流中去,因此在音频的压缩域直接进行语音混合就应该像在公园散步一样简单了)，但是这个已经超出了当时我想要去实现的东西的范围。
在大量连接情况下的算法复杂度和内存耗费问题也是很明显的，因为算法复杂度为O(n^2)。为整个会议使用一个简单的混音器(mixer)将导致一个O(n)的算法，但是也将重新引入本地回音（but would also also re-introduce local echo），我想最好的办法是使用许多的部分总和(partial sums)，它可能会使代码更加复杂，但也许可能是一个在复杂度在二者之间的算法。我一直抽不出时间来尝试一下。
另外一个严重的问题就是代码完全没有使用硬件定时器(hardware timer)。这意味着语音质量随着CPU使用率到达饱和的时候会急剧下降，因为此时输出语音数据的时间信息会乱掉（is lost）。
一段时间后,Derek Smithies增加了视频混合功能并修改语音混合算法以提供"被控制的(directed)"语音。这个算法提供优先权给通道内声音最大的那个语音数据，使得它成为所有连接中唯一能够听到的声音，并且视频也能够随着可选择的语音源的切换而切换，以显示谁正在说话。对Derek来说，这看起来不错，但是我不得不诚实的说，我不喜欢这个主意。我一直希望这个程序能够听起来和电话会议的声音一样，这意味着多个人可以同时在说话，如果需要的话，控制权可以是声音最大的那个人，而不是"只能听到声音最大人的声音(loudest gets all)"
这就是所有真实的故事，OpenMCU从来没有被设计成一个"产品"，它只是一个被实现了的想法，还有很多的东西需要提高，这些代码可以变得更加的有效率和更加健壮--有很多人已经报告了它在SMP机器上的不稳定，我怀疑是存在一些明显的"紊乱情况"(race condition)。它还可以加入一些对标准H.323 MCU命令的支持。比如，允许获得每个房间成员信息，可用房间列表，以及所以与会者的管理等等。
I sincerely hope this new year is better than the last one, but then, I guess it has to be as it could scarcely be worse.
There has been a few discussions regarding OpenMCU on the lists recently, so I though it might be a good time to retell the story of how that particular program came into existence. Hopefully this will help people understand the raison d'etre behind this particular piece of sofware and may assist them in better evaluating whether it suits their needs.
Back in the heady days before OpenH323 was the international success it is now (yeah right), I had plenty of time to think about, and to write code for, all sorts of interesting ideas. During that time, I remember waking up in the middle of the night with the idea of how to implement audio mixing as required by an MCU (Multi Conference Unit) fully formed in in my head. The following weekend I implemented that idea in an orgy of coding that went on for two days, and OpenMCU was born.
The audio mixing algorithm for OpenMCU is very simple and is based on two concepts. The first key idea is that all of the the audio paths need to be converted to/from PCM so they can be mixed using simple algebraic operations. While I know it is possible to mix audio data like G.723.1 or G.729 without first converting them to PCM (I've seen some really kinky stuff done with MPEG video streams whereby titles and watermarks are added to video streams in real-time by direct manipulation of the DCT co-efficents, so audio mixing in the compressed domain has got to be a walk in the park), that is way beyond the scope of what I intended to implement at the time.
The second key idea is that the incoming audio for each connection is copied into a seperate queue for each of the other connections in a conference as it arrives, and the outgoing audio for each connection is created from the algebraic sum of the the incoming queues for that connection. That sounds complex, but it is really simple: in a conference with n connections, each connection has n-1 queues that contain a copy of the inomcing audio from each of the other connections. When an incoming packet of audio arrives on connection x, it is copied into the front of the correct queue for every connection 1 through connection n, but not connection x. When it is time to send a packet of audio on a connection x, it is created by mixing together the correct amount of the data off the back of each of the queues for that connection.
The intent was to emulate how the analog mixers in telephone conference systems work. I did some tests and it worked well - multiple audio streams could be present at the same time (I tested by mixing a pre-recorded message and audio from a CD) and each stream could be heard distinctly seperate - just like a telephone conference call.
It also turned out that this approach has several additional nice properties. The connections associated with a conference are unsynchronised which removes a lot of complexity. As long as the audio data arrives at the correct rate then everything just works. It also removed the problem of local echo, as the incoming audio for each connection does not contain a copy of it's own outgoing signal. And finally, it allowed the mixing algorithm to be changed at will because each channel has a complete copy of all of the source signals. This was a nice feature as it meant I could experiment with lots of different mixing methods without rewriting great chunks of code.
When I combined all of this into a H.323-based infrastructure, I was completely focussed on the implementation of the audio mixing algorithm. Issues of efficiency in the use of mutexes or connections were not even considered - it was a "straight line between two points" implementation. Looking at the code afterwards, it's pretty obvious it could be done much better by keeping pointers to the connection objects in each connection rather than doing a lookup of the connection token for each and every audio packet.
It became quickly apparent that the approach had several fundamental problems. The base noise level was raised with each additional connection, as multiple background noise signals were added together thus creating a loud "hiss" with more tha about 4 connections. I experimented with scaling the channel amplitude in proportion to the numner of channels, but all that did was decrease the volume of each channel resulting in an inaudible blur with more than about 5 connections! In the end, the use of silence supression fixed most of the noise problem.
The issue of algorithmic complexity and storage at large numbers of conenctions was fairly apparent, as the algortihm being used is O(n^2). Using a single mixer for the entire conference would result in a O(n) algorithm, but would also also re-introduce local echo. I think the best solution for this is to use a number of partial sums which would make the code more complicated, but would probably give an algorithmic complexity somewhere between these two. I never got around to trying this.
Another serious problem is the code operates completely without use of hardware timers. This means that audio quality will drop dramatically when the CPU reaches saturation, as the timing of the outgoing audio is lost.
In any case, once I had the code working and I had played with some mixing algorithms I published the code as it stood under the name OpenMCU and promptly forgot about it.
Some time afterwards, Derek Smithies added video mixing and changed the audio mixing to provide "directed" audio. This algorithm gives preference to the loudest incoming audio signal, which becomes the only audio heard by all connection, and the code also switches the video to come from the selected audio source as well to give an indication of who is speaking. That seemed to work OK for Derek, but I have to be honest and say that I never really liked that idea. I always wanted something that sounded like a telephone conference call, and that means the ability for multiple people to talk all at once, right over the top of each other if need be, rather than "loudest gets all".
That's really all there is to it. OpenMCU was never really intended to be a "product" - it was just an idea that was put into code and there is a lot of scope for improvement. The code could be made much more efficient and robust - there have been many people who have reported instabilty on SMP machines, so I suspect there are a few race conditions outstanding. It would also be good to add support for the standard H.323 MCU commands that allow retreival of members of each room, lists of available rooms and to do full attendee management.
I've no plans to do any of this myself in the near future, but if someone needs an MCU and is willing to pay then please contact me.