目录
概述
音视频同步就是让音频和视频以一个相对平滑的速度协调的进行播放,不至于出现音频播放过快,或者视频播放过快的不同步问题。webrtc中的音视频同步是在数据解码前进行的,通过分别调整音视频开始解码的时间来实现同步。
1.基本思路
将最新收到的音视频的RTP时间戳转换为NTP时间戳,拿转换后的时间戳再结合当前音频、视频的延迟进行计算,这个计算可以分为相对延迟时间计算和目标延迟时间计算,具体算法在后面4、5章节进行分析,最终计算得出目标延迟时间target_audio_delay_ms和target_video_delay_ms,这两个时间决定了需要等待多久才开始解码音视频数据。上述同步过程会在ModuleProcessThread线程里面周期性的进行,相关源码位于video\rtp_streams_synchronizer.cc和video\stream_synchronization.cc。
2.RTP包里面的时间戳
2.1音频时间戳
音频时间戳的单位是采样率的倒数。例如采样率为48000,那么一秒种就有48000个采样,一个采样占时 : 1000 / 48000 = 1/48ms,一般一个音频RTP包会打包20ms的音频数据,所以一个RTP包就会有 20 / (1/48) = 960个采样,所有第一个RTP时间戳为0,第二个为960,第三个为1920,依此类推。webrtc在打包音频RTP时间戳时会在加上一个rtpRtcpModule启动时间戳,因此RTP包中的时间戳 T = t + Tstart, t 分别是 0、960、1920..,而Tstart基于rtprtcp模块启动时的时间生成的一个随机数,在每次会话建立时初始化这个值后就不会再变化。
2.2视频时间戳
视频RTP包的时间戳 T = Tcapture + Tstart,其中Tcapture是视频帧捕获时的ntp时间戳取低32位再乘以90,而Tstart和音频RTP包的Tstart意义相同。
// video/video_stream_encoder.cc文件中的OnFrame
// Convert NTP time, in ms, to RTP timestamp.
const int kMsToRtpTimestamp = 90;
incoming_frame.set_timestamp(
kMsToRtpTimestamp * static_cast<uint32_t>(incoming_frame.ntp_time_ms()));
3.RTP时间戳转换为NTP时间戳
由于音视频RTP包的时间戳基准不相同,所以接收端无法直接拿rtp时间戳来进行音视频同步,为方便比较,需要将rtp时间戳转换为ntp时间戳。rtp到ntp时间戳的转换存在一个线性关系 Tn = kTr + b,k、b这两个变量在收到发送端定时发送的SR消息时会进行更新,更新是基于最新的20个SR消息进行的,具体代码如下:
// 源码 system_wrappers/source/rtp_to_ntp_estimator.cc
void RtpToNtpEstimator::UpdateParameters() {
if (measurements_.size() < 2)
return;
std::vector<double> x;
std::vector<double> y;
x.reserve(measurements_.size());
y.reserve(measurements_.size());
for (auto it = measurements_.begin(); it != measurements_.end(); ++it) {
x.push_back(it->unwrapped_rtp_timestamp);
y.push_back(it->ntp_time.ToMs());
}
double slope, offset;
if (!LinearRegression(x, y, &slope, &offset)) {
return;
}
params_.emplace(1 / slope, offset);
}
// Given x[] and y[] writes out such k and b that line y=k*x+b approximates
// given points in the best way (Least Squares Method).
bool LinearRegression(rtc::ArrayView<const double> x,
rtc::ArrayView<const double> y,
double* k,
double* b) {
size_t n = x.size();
if (n < 2)
return false;
if (y.size() != n)
return false;
double avg_x = 0;
double avg_y = 0;
for (size_t i = 0; i < n; ++i) {
avg_x += x[i];
avg_y += y[i];
}
avg_x /= n;
avg_y /= n;
double variance_x = 0;
double covariance_xy = 0;
for (size_t i = 0; i < n; ++i) {
double normalized_x = x[i] - avg_x;
double normalized_y = y[i] - avg_y;
variance_x += normalized_x * normalized_x;
covariance_xy += normalized_x * normalized_y;
}
if (std::fabs(variance_x) < 1e-8)
return false;
*k = static_cast<double>(covariance_xy / variance_x);
*b = static_cast<double>(avg_y - (*k) * avg_x);
return true;
}
发送端会周期性的发送rtcp SR消息,SR消息里面包含了RTP时间戳和NTP时间戳。
4.音视频相对延迟时间的计算
bool StreamSynchronization::ComputeRelativeDelay(
const Measurements& audio_measurement,
const Measurements& video_measurement,
int* relative_delay_ms) {
int64_t audio_last_capture_time_ms;
// Measurement的成员
// latest_receive_time_ms表示最新收到RTP时的本地时间
// latest_timestamp 表示最新收到的RTP包的时间戳
// 这两个变量的来源会在video/rtp_video_stream_receiver.cc文件中的OnRtpPacket函数被更新
if (!audio_measurement.rtp_to_ntp.Estimate(audio_measurement.latest_timestamp,
&audio_last_capture_time_ms)) {
return false;
}
int64_t video_last_capture_time_ms;
if (!video_measurement.rtp_to_ntp.Estimate(video_measurement.latest_timestamp,
&video_last_capture_time_ms)) {
return false;
}
if (video_last_capture_time_ms < 0) {
return false;
}
// Positive diff means that video_measurement is behind audio_measurement.
// video_measurement.latest_receive_time_ms - audio_measurement.latest_receive_time_ms表示最新收到的音视频RTP包的时间差
// video_last_capture_time_ms - audio_last_capture_time_ms 表示最新收到的音视频RTP包在发送端采集时的时间戳的差值
// 因此relative_delay_ms就表示音视频采集时的时间差和被接收端收到时的时间差的变化量。这个变化量是由网络传输和数据处理引入的。relative_delay_ms变量将会被用来参与计算音视频的目标延迟。
*relative_delay_ms =
video_measurement.latest_receive_time_ms -
audio_measurement.latest_receive_time_ms -
(video_last_capture_time_ms - audio_last_capture_time_ms);
// 如果时间差的变化量大于10s,则返回失败
if (*relative_delay_ms > kMaxDeltaDelayMs ||
*relative_delay_ms < -kMaxDeltaDelayMs) {
return false;
}
return true;
}
ComputeRelativeDelay函数会被最新收到的音视频RTP的时间戳转换为NTP时间戳,转换后的NTP时间戳分别为audio_last_capture_time_ms、video_last_capture_time_ms,具体分析参见上面代码的注释。
5.目标延迟时间的计算
bool StreamSynchronization::ComputeDelays(int relative_delay_ms,
int current_audio_delay_ms,
int* total_audio_delay_target_ms,
int* total_video_delay_target_ms) {
// 参数 relative_delay_ms 由 ComputeRelativeDelay计算得来
int current_video_delay_ms = *total_video_delay_target_ms;
RTC_LOG(LS_VERBOSE) << "Audio delay: " << current_audio_delay_ms
<< " current diff: " << relative_delay_ms
<< " for stream " << audio_stream_id_;
// Calculate the difference between the lowest possible video delay and the
// current audio delay.
// 需要把relative_delay_ms也加上
int current_diff_ms =
current_video_delay_ms - current_audio_delay_ms + relative_delay_ms;
// kFilterLength为4,kMinDeltaMs为30,
// 这里的意思就是取4次音视频延迟的差值的平均值,如果这个平均值小于30ms,则不需要调整音视频的目标延迟,直接返回。
avg_diff_ms_ =
((kFilterLength - 1) * avg_diff_ms_ + current_diff_ms) / kFilterLength;
if (abs(avg_diff_ms_) < kMinDeltaMs) {
// Don't adjust if the diff is within our margin.
return false;
}
// Make sure we don't move too fast.
int diff_ms = avg_diff_ms_ / 2;
diff_ms = std::min(diff_ms, kMaxChangeMs);
diff_ms = std::max(diff_ms, -kMaxChangeMs);
// Reset the average after a move to prevent overshooting reaction.
avg_diff_ms_ = 0;
// base_target_delay_ms_这个变量初始值为0 ,虽然通过函数SetTargetBufferingDelay可以更改,但全局搜索源码没有发现调用SetTargetBufferingDelay的地方。
if (diff_ms > 0) {
// The minimum video delay is longer than the current audio delay.
// We need to decrease extra video delay, or add extra audio delay.
if (video_delay_.extra_ms > base_target_delay_ms_) {
// We have extra delay added to ViE. Reduce this delay before adding
// extra delay to VoE.
// 音频快视频慢,降低视频的延迟
video_delay_.extra_ms -= diff_ms;
audio_delay_.extra_ms = base_target_delay_ms_;
} else { // video_delay_.extra_ms > 0
// We have no extra video delay to remove, increase the audio delay.
// 音频快视频慢,增加音频的延迟
audio_delay_.extra_ms += diff_ms;
video_delay_.extra_ms = base_target_delay_ms_;
}
} else { // if (diff_ms > 0)
// The video delay is lower than the current audio delay.
// We need to decrease extra audio delay, or add extra video delay.
if (audio_delay_.extra_ms > base_target_delay_ms_) {
// We have extra delay in VoiceEngine.
// Start with decreasing the voice delay.
// Note: diff_ms is negative; add the negative difference.
// 视频快音频慢,降低音频的延迟
audio_delay_.extra_ms += diff_ms;
video_delay_.extra_ms = base_target_delay_ms_;
} else { // audio_delay_.extra_ms > base_target_delay_ms_
// We have no extra delay in VoiceEngine, increase the video delay.
// Note: diff_ms is negative; subtract the negative difference.
// 视频快音频慢,增加视频的延迟
video_delay_.extra_ms -= diff_ms; // X - (-Y) = X + Y.
audio_delay_.extra_ms = base_target_delay_ms_;
}
}
// Make sure that video is never below our target.
video_delay_.extra_ms =
std::max(video_delay_.extra_ms, base_target_delay_ms_);
int new_video_delay_ms;
if (video_delay_.extra_ms > base_target_delay_ms_) {
new_video_delay_ms = video_delay_.extra_ms;
} else {
// No change to the extra video delay. We are changing audio and we only
// allow to change one at the time.
new_video_delay_ms = video_delay_.last_ms;
}
// Make sure that we don't go below the extra video delay.
new_video_delay_ms = std::max(new_video_delay_ms, video_delay_.extra_ms);
// Verify we don't go above the maximum allowed video delay.
new_video_delay_ms =
std::min(new_video_delay_ms, base_target_delay_ms_ + kMaxDeltaDelayMs);
int new_audio_delay_ms;
if (audio_delay_.extra_ms > base_target_delay_ms_) {
new_audio_delay_ms = audio_delay_.extra_ms;
} else {
// No change to the audio delay. We are changing video and we only allow to
// change one at the time.
new_audio_delay_ms = audio_delay_.last_ms;
}
// Make sure that we don't go below the extra audio delay.
new_audio_delay_ms = std::max(new_audio_delay_ms, audio_delay_.extra_ms);
// Verify we don't go above the maximum allowed audio delay.
new_audio_delay_ms =
std::min(new_audio_delay_ms, base_target_delay_ms_ + kMaxDeltaDelayMs);
video_delay_.last_ms = new_video_delay_ms;
audio_delay_.last_ms = new_audio_delay_ms;
RTC_LOG(LS_VERBOSE) << "Sync video delay " << new_video_delay_ms
<< " for video stream " << video_stream_id_
<< " and audio delay " << audio_delay_.extra_ms
<< " for audio stream " << audio_stream_id_;
*total_video_delay_target_ms = new_video_delay_ms;
*total_audio_delay_target_ms = new_audio_delay_ms;
return true;
}
google的代码阅读起来还是很方便的,注释写得很清楚,看注释就知道整个函数的处理逻辑了。这里说明一下total_video_delay_target_ms这个参数,这个参数传进来的值就是current_video_delay_ms,而这个值又是从下面的代码而来,从中可以看到计算目标延迟的时候是把抖动延迟和解码时间和渲染时间都考虑进去了的, 其中解码时间是最近10s内解码视频帧用时的平均值的95%, render_delay_ms_初始值是10ms:
int VCMTiming::TargetDelayInternal() const {
return std::max(min_playout_delay_ms_,
jitter_delay_ms_ + RequiredDecodeTimeMs() + render_delay_ms_);
}
6.根据目标延迟时间进行解码
通过ComputeDelays计算得到的音视频延迟时间将会决定音频、视频开始下一次解码需要等待的时间。以视频为例,total_video_delay_target_ms将会被设置到VideoReceiveStream类的timing_变量里面,而timing_变量同时也被frame_buffer2.cc和video_receiver2.cc使用,具体实现可以参见frame_buffer2.cc里面的StartWaitForNextFrameOnQueue函数,这个函数就会根据延迟时间来启动视频的解码。这其中涉及的类和变量比较多,感觉用语言来把这个关系都描述清楚比较吃力,还是看代码最为准确。
7.总结
webrtc的音视频同步就是根据最新收到的音频、视频RTP包的时间戳,将这两个时间戳按照一定的算法转换为NTP时间戳,由NTP时间戳算出音频视频的相对延迟relative_delay_ms,再由音频当前延迟current_audio_delay_ms和视频当前延迟current_video_delay_ms结合relative_delay_ms进行比较,延迟差值current_diff_ms = current_video_delay_ms - current_audio_delay_ms + relative_delay_ms(为了音视频播放得更平滑,实际上取4次的平均值),假如current_diff_ms小于30ms则不需要调整音频视频的延迟,如果current_diff_msg大于0则说明音频快视频慢,需要增加音频延迟或者减少视频延迟,如果current_diff_ms小于0则说明音频慢而视频快,需要减小音频延迟或者增加视频延迟。current_audio_delay_ms和current_video_delay_ms这两个值是由网络抖动延迟、解码延迟、渲染延迟和上面的计算结果综合比较而决定的。