OpenMCU：真实的故事

最新推荐文章于 2025-03-22 23:13:02 发布

BillShow

最新推荐文章于 2025-03-22 23:13:02 发布

阅读量4.6k

点赞数

文章标签： audio each 算法 algorithm idea pointers

本文链接：https://blog.csdn.net/BillShow/article/details/115429

版权

本文讲述了OpenMCU程序的诞生过程，其混音算法基于将语音通道转化为PCM格式及队列拷贝代数和两个关键想法。该算法有一定好处，但也存在基准噪音、算法复杂度和内存耗费、未用硬件定时器等问题，后续还增加了视频混合功能，代码仍有提升空间。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

（第一次翻译文章，其中有部分我也不太清楚作者想说什么，我给出了原文，并做了我认为正确的翻译，请大家指正。）

我真诚地希望这新的一年能够比过去的一年好，我想它也应该如此，因为不太可能变得更差。

近来在邮件列表（OpenH323的邮件列表,译者注）上有一些关于OpenMCU的讨论，因此我想现在是时候讲述一下这个程序的来龙去脉了。希望能够帮助人们了解这个程序之所以存在的理由，以便能更好的评估它是否符合他们的需求。

现在的OpenH323已经取得了全球性的成功（噢，是的），而在之前的“任性的(heady)”日子里，我曾经有许多的时间去思考并写代码验证各种各样有趣的想法。在那段时间里，我记得在半夜里惊醒，脑子里面全是如何实现MCU（Multi Conference Unit）需要的混音算法。在接下去的周末2天里，我近乎疯狂编写代码去实现它，OpenMCU从此诞生了。

OpenMCU的混音算法是非常简单的，它基于2个想法。第一个关键想法是：所有的语音通道都要转化为PCM格式，以便他们可以通过简单的代数运算算法进行混音。现在我知道一些语音数据比如G.723.1或者G.729不需要先转换为PCM格式就可以直接进行混音(我曾经看到一些厉害的人把MPEG 视频格式的标题和水印通过直接DCT 系数映射的方法实时的加入到视频流中去,因此在音频的压缩域直接进行语音混合就应该像在公园散步一样简单了)，但是这个已经超出了当时我想要去实现的东西的范围。

第二个关键的想法就是,在一个会议中，一个连接（connection）上有语音输入时，都会拷贝一份到其他连接为其分别准备的队列中，而连接上的语音输出则是那个连接所有输入语音队列的代数和。这听上去比较复杂，其实这是非常简单的。一个会议中有n个连接，每个连接有n-1个队列用于保存其他连接的输入语音数据。当有一个连接x有语音数据到达时，它把他拷贝到从1到n(除了连接x)的连接队列的队首。当到时间发送一个语音数据时，它通过混合这个连接所有的队列队尾的正确数量的数据来创建这个包。

为了仿效电话会议系统中类似的混频器是如何做的，我做了一些测试并且效果不错。多路语音流能够同时参加(我使用从CD上混合一些预先录制的消息和语音进行测试) 并且每一路流都能够分别清晰的听到--就像电话会议里面的一样。

这也证明这个方法有几个额外的好处，会议中的连接是不同步的，这去掉了很多的复杂性。只要语音数据能够以正确的速度到达，一切都会很好。他也去掉了本地回音的问题，因为一个连接的输入语音包并不包含它自己的输出语音包。最后，他也允许混音算法可以随意的变更，因为每个通道都有所有源信号(source signal)的完全拷贝。这是一个很好的特性，它意味着我可以尝试各种不同的混音算法而不需要改写大段大段的代码。

当我把所有这些整合到基于H.323的架构时，我完全专注于语音混频算法的实现，而互斥变量(mutexes)和连接的使用效率问题则完全没有考虑--这是一个"两点一线"的实现。后来再来看这段代码，很明显的，通过在每个连接中保存指向连接对象的指针而不是每次通过连接记号(connection token)去查找每一个语音包的方法，它可以做得更好。

很明显，这个方法也有一些基本的问题。基准噪音(base noise)级别会随着每一个连接的加入而有所提高，特别是超过4个连接以后，多路背景噪音的叠加会产生了比较大的"嘶嘶"的声音。我曾经试过与通道数量呈比例的调整通道的振幅(amplitude)，但是结果是降低了每个通道的音量，特别是超过5个连接的时候，导致了听不到声音的问题。最后，静音抑制的使用解决了大部分的噪音问题。

在大量连接情况下的算法复杂度和内存耗费问题也是很明显的，因为算法复杂度为O(n^2)。为整个会议使用一个简单的混音器(mixer)将导致一个O(n)的算法，但是也将重新引入本地回音（but would also also re-introduce local echo），我想最好的办法是使用许多的部分总和(partial sums)，它可能会使代码更加复杂，但也许可能是一个在复杂度在二者之间的算法。我一直抽不出时间来尝试一下。

另外一个严重的问题就是代码完全没有使用硬件定时器(hardware timer)。这意味着语音质量随着CPU使用率到达饱和的时候会急剧下降，因为此时输出语音数据的时间信息会乱掉（is lost）。

不管如何，我曾经让这段代码发挥作用，并尝试了多种混音算法，并且以OpenMCU的名义发布了，然后很快的，忘掉了它。

一段时间后,Derek Smithies增加了视频混合功能并修改语音混合算法以提供"被控制的(directed)"语音。这个算法提供优先权给通道内声音最大的那个语音数据，使得它成为所有连接中唯一能够听到的声音，并且视频也能够随着可选择的语音源的切换而切换，以显示谁正在说话。对Derek来说，这看起来不错，但是我不得不诚实的说，我不喜欢这个主意。我一直希望这个程序能够听起来和电话会议的声音一样，这意味着多个人可以同时在说话，如果需要的话，控制权可以是声音最大的那个人，而不是"只能听到声音最大人的声音(loudest gets all)"

这就是所有真实的故事，OpenMCU从来没有被设计成一个"产品"，它只是一个被实现了的想法，还有很多的东西需要提高，这些代码可以变得更加的有效率和更加健壮--有很多人已经报告了它在SMP机器上的不稳定，我怀疑是存在一些明显的"紊乱情况"(race condition)。它还可以加入一些对标准H.323 MCU命令的支持。比如，允许获得每个房间成员信息，可用房间列表，以及所以与会者的管理等等。

我自己最近没有计划做上述的事情，但是如果你需要一个MCU并且愿意付费的话，请联系我。

原文如下：

I sincerely hope this new year is better than the last one, but then, I guess it has to be as it could scarcely be worse.

There has been a few discussions regarding OpenMCU on the lists recently, so I though it might be a good time to retell the story of how that particular program came into existence. Hopefully this will help people understand the raison d'etre behind this particular piece of sofware and may assist them in better evaluating whether it suits their needs.

Back in the heady days before OpenH323 was the international success it is now (yeah right), I had plenty of time to think about, and to write code for, all sorts of interesting ideas. During that time, I remember waking up in the middle of the night with the idea of how to implement audio mixing as required by an MCU (Multi Conference Unit) fully formed in in my head. The following weekend I implemented that idea in an orgy of coding that went on for two days, and OpenMCU was born.
The audio mixing algorithm for OpenMCU is very simple and is based on two concepts. The first key idea is that all of the the audio paths need to be converted to/from PCM so they can be mixed using simple algebraic operations. While I know it is possible to mix audio data like G.723.1 or G.729 without first converting them to PCM (I've seen some really kinky stuff done with MPEG video streams whereby titles and watermarks are added to video streams in real-time by direct manipulation of the DCT co-efficents, so audio mixing in the compressed domain has got to be a walk in the park), that is way beyond the scope of what I intended to implement at the time.

The second key idea is that the incoming audio for each connection is copied into a seperate queue for each of the other connections in a conference as it arrives, and the outgoing audio for each connection is created from the algebraic sum of the the incoming queues for that connection. That sounds complex, but it is really simple: in a conference with n connections, each connection has n-1 queues that contain a copy of the inomcing audio from each of the other connections. When an incoming packet of audio arrives on connection x, it is copied into the front of the correct queue for every connection 1 through connection n, but not connection x. When it is time to send a packet of audio on a connection x, it is created by mixing together the correct amount of the data off the back of each of the queues for that connection.
The intent was to emulate how the analog mixers in telephone conference systems work. I did some tests and it worked well - multiple audio streams could be present at the same time (I tested by mixing a pre-recorded message and audio from a CD) and each stream could be heard distinctly seperate - just like a telephone conference call.

It also turned out that this approach has several additional nice properties. The connections associated with a conference are unsynchronised which removes a lot of complexity. As long as the audio data arrives at the correct rate then everything just works. It also removed the problem of local echo, as the incoming audio for each connection does not contain a copy of it's own outgoing signal. And finally, it allowed the mixing algorithm to be changed at will because each channel has a complete copy of all of the source signals. This was a nice feature as it meant I could experiment with lots of different mixing methods without rewriting great chunks of code.
When I combined all of this into a H.323-based infrastructure, I was completely focussed on the implementation of the audio mixing algorithm. Issues of efficiency in the use of mutexes or connections were not even considered - it was a "straight line between two points" implementation. Looking at the code afterwards, it's pretty obvious it could be done much better by keeping pointers to the connection objects in each connection rather than doing a lookup of the connection token for each and every audio packet.

It became quickly apparent that the approach had several fundamental problems. The base noise level was raised with each additional connection, as multiple background noise signals were added together thus creating a loud "hiss" with more tha about 4 connections. I experimented with scaling the channel amplitude in proportion to the numner of channels, but all that did was decrease the volume of each channel resulting in an inaudible blur with more than about 5 connections! In the end, the use of silence supression fixed most of the noise problem.

The issue of algorithmic complexity and storage at large numbers of conenctions was fairly apparent, as the algortihm being used is O(n^2). Using a single mixer for the entire conference would result in a O(n) algorithm, but would also also re-introduce local echo. I think the best solution for this is to use a number of partial sums which would make the code more complicated, but would probably give an algorithmic complexity somewhere between these two. I never got around to trying this.

Another serious problem is the code operates completely without use of hardware timers. This means that audio quality will drop dramatically when the CPU reaches saturation, as the timing of the outgoing audio is lost.

In any case, once I had the code working and I had played with some mixing algorithms I published the code as it stood under the name OpenMCU and promptly forgot about it.

Some time afterwards, Derek Smithies added video mixing and changed the audio mixing to provide "directed" audio. This algorithm gives preference to the loudest incoming audio signal, which becomes the only audio heard by all connection, and the code also switches the video to come from the selected audio source as well to give an indication of who is speaking. That seemed to work OK for Derek, but I have to be honest and say that I never really liked that idea. I always wanted something that sounded like a telephone conference call, and that means the ability for multiple people to talk all at once, right over the top of each other if need be, rather than "loudest gets all".

That's really all there is to it. OpenMCU was never really intended to be a "product" - it was just an idea that was put into code and there is a lot of scope for improvement. The code could be made much more efficient and robust - there have been many people who have reported instabilty on SMP machines, so I suspect there are a few race conditions outstanding. It would also be good to add support for the standard H.323 MCU commands that allow retreival of members of each room, lists of available rooms and to do full attendee management.

I've no plans to do any of this myself in the near future, but if someone needs an MCU and is willing to pay then please contact me.