WebRTC SDP详细解析（转）

最新推荐文章于 2024-10-09 17:46:59 发布

zytry

最新推荐文章于 2024-10-09 17:46:59 发布

阅读量5.8k

点赞数

原文链接：https://blog.csdn.net/chinabinlang/article/details/79151589

版权

http://blog.csdn.net/onlycoder_net/article/details/76702432

v=0

//sdp版本号，一直为0,rfc4566规定

o=- 7017624586836067756 2 IN IP4 127.0.0.1

// RFC 4566 o=<username> <sess-id> <sess-version> <nettype> <addrtype> <unicast-address>

//username如何没有使用-代替，7017624586836067756是整个会话的编号，2代表会话版本，如果在会话

//过程中有改变编码之类的操作，重新生成sdp时,sess-id不变，sess-version加1

s=-

//会话名，没有的话使用-代替

t=0 0

//两个值分别是会话的起始时间和结束时间，这里都是0代表没有限制

a=group:BUNDLE audio video data

//需要共用一个传输通道传输的媒体，如果没有这一行，音视频，数据就会分别单独用一个udp端口来发送

a=msid-semantic: WMS h1aZ20mbQB0GSsq0YxLfJmiYWE9CBfGch97C

//WMS是WebRTC Media Stream简称，这一行定义了本客户端支持同时传输多个流，一个流可以包括多个track,

//一般定义了这个，后面a=ssrc这一行就会有msid,mslabel等属性

m=audio 9 UDP/TLS/RTP/SAVPF 111 103 104 9 0 8 106 105 13 126

//m=audio说明本会话包含音频，9代表音频使用端口9来传输，但是在webrtc中一现在一般不使用，如果设置为0，代表不

//传输音频,UDP/TLS/RTP/SAVPF是表示用户来传输音频支持的协议，udp，tls,rtp代表使用udp来传输rtp包，并使用tls加密

//SAVPF代表使用srtcp的反馈机制来控制通信过程,后台111 103 104 9 0 8 106 105 13 126表示本会话音频支持的编码，后台几行会有详细补充说明

c=IN IP4 0.0.0.0

//这一行表示你要用来接收或者发送音频使用的IP地址，webrtc使用ice传输，不使用这个地址

a=rtcp:9 IN IP4 0.0.0.0

//用来传输rtcp地地址和端口，webrtc中不使用

a=ice-ufrag:khLS

a=ice-pwd:cxLzteJaJBou3DspNaPsJhlQ

//以上两行是ice协商过程中的安全验证信息

a=fingerprint:sha-256 FA:14:42:3B:C7:97:1B:E8:AE:0C

2:71:03:05:05:16:8F:B9:C7:98:E9:60:43:4B:5B:2C:28:EE:5C:8F3:17

//以上这行是dtls协商过程中需要的认证信息

a=setup:actpass

//以上这行代表本客户端在dtls协商过程中，可以做客户端也可以做服务端，参考rfc4145 rfc4572

a=mid:audio

//在前面BUNDLE这一行中用到的媒体标识

a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level

//上一行指出我要在rtp头部中加入音量信息，参考 rfc6464

a=sendrecv

//上一行指出我是双向通信，另外几种类型是recvonly,sendonly,inactive

a=rtcp-mux

//上一行指出rtp,rtcp包使用同一个端口来传输

//下面几行都是对m=audio这一行的媒体编码补充说明，指出了编码采用的编号，采样率，声道等

a=rtpmap:111 opus/48000/2

a=rtcp-fb:111 transport-cc

//以上这行说明opus编码支持使用rtcp来控制拥塞，参考https://tools.ietf.org/html/draft-holmer-rmcat-transport-wide-cc-extensions-01

a=fmtp:111 minptime=10;useinbandfec=1

//对opus编码可选的补充说明,minptime代表最小打包时长是10ms，useinbandfec=1代表使用opus编码内置fec特性

a=rtpmap:103 ISAC/16000

a=rtpmap:104 ISAC/32000

a=rtpmap:9 G722/8000

a=rtpmap:0 PCMU/8000

a=rtpmap:8 PCMA/8000

a=rtpmap:106 CN/32000

a=rtpmap:105 CN/16000

a=rtpmap:13 CN/8000

a=rtpmap:126 telephone-event/8000

a=ssrc:18509423 cname:sTjtznXLCNH7nbRw

//cname用来标识一个数据源，ssrc当发生冲突时可能会发生变化，但是cname不会发生变化，也会出现在rtcp包中SDEC中，

//用于音视频同步

a=ssrc:18509423 msid:h1aZ20mbQB0GSsq0YxLfJmiYWE9CBfGch97C 15598a91-caf9-4fff-a28f-3082310b2b7a

//以上这一行定义了ssrc和WebRTC中的MediaStream,AudioTrack之间的关系，msid后面第一个属性是stream-d,第二个是track-id

a=ssrc:18509423 mslabel:h1aZ20mbQB0GSsq0YxLfJmiYWE9CBfGch97C

a=ssrc:18509423 label:15598a91-caf9-4fff-a28f-3082310b2b7a

m=video 9 UDP/TLS/RTP/SAVPF 100 101 107 116 117 96 97 99 98

//参考上面m=audio,含义类似

c=IN IP4 0.0.0.0

a=rtcp:9 IN IP4 0.0.0.0

a=ice-ufrag:khLS

a=ice-pwd:cxLzteJaJBou3DspNaPsJhlQ

a=fingerprint:sha-256 FA:14:42:3B:C7:97:1B:E8:AE:0C

2:71:03:05:05:16:8F:B9:C7:98:E9:60:43:4B:5B:2C:28:EE:5C:8F 2:71:03:05:05:16:8F:B9:C7:98:E9:60:43:4B:5B:2C:28:EE:5C:8F 3:17

a=setup:actpass

a=mid:data

a=sctpmap:5000 webrtc-datachannel 1024

转一篇文章：

SDP 协议简单解析

SDP—Session Description Protocol

The Session Description Protocol, defined by RFC 2327 [1], was developed by the IETF MMUSIC working group. It is more of a description syntax than a protocol in that it does not provide a full-range media negotiation capability. The original purpose of SDP was to describe multicast sessions set up over the Internet's multicast backbone, the MBONE. The first application of SDP was by the experimental Session Announcement Protocol (SAP) [2] used to post and retrieve announcements of MBONE sessions. SAP messages carry a SDP message body, and was the template for SIP's use of SDP. Even though it was designed for multicast, SDP has been applied to the more general problem of describing general multimedia sessions established using SIP.

As seen in the examples of Chapter 3, SDP contains the following information about the media session:

IP Address (IPv4 address or host name);
Port number (used by UDP or TCP for transport);
Media type (audio, video, interactive whiteboard, and so forth);
Media encoding scheme (PCM A-Law, MPEG II video, and so forth).

In addition, SDP contains information about the following:

Subject of the session;
Start and stop times;
Contact information about the session.

Like SIP, SDP uses text coding. An SDP message is composed of a series of lines, called fields, whose names are abbreviated by a single lower-case letter, and are in a required order to simplify parsing. The set of mandatory SDP fields is shown in Table 2.1. The complete set is shown in Table 7.1.

Table 7.1: SDP Field List in Their Required Order
Field	Name	Mandatory/Optional
v=	Protocol version number	m
o=	Owner/creator and session identifier	m
s=	Session name	m
i=	Session information	o
u=	Uniform Resource Identifer	o
e=	Email address	o
p=	Phone number	o
c=	Connection information	m
b=	Bandwidth information	o
t=	Time session starts and stops	m
r=	Repeat times	o
z=	Time zone corrections	o
k=	Encryption key	o
a=	Attribute lines	o
m=	Media information	o
a=	Media attributes	o

SDP was not designed to be easily extensible, and parsing rules are strict. The only way to extend or add new capabilities to SDP is to define a new attribute type. However, unknown attribute types can be silently ignored. A SDP parser must not ignore an unknown field, a missing mandatory field, or an out-of-sequence line. An example SDP message containing many of the optional fields is shown here:

v=0
o=johnston 2890844526 2890844526 IN IP4 43.32.1.5
s=SIP Tutorial
i=This broadcast will cover this new IETF protocol
u=http://www.digitalari.com/sip
e=Alan Johnston alan@mci.com
p=+1-314-555-3333 (Daytime Only)
c=IN IP4 225.45.3.56/236
b=CT:144
t=2877631875 2879633673
m=audio 49172 RTP/AVP 0
a=rtpmap:0 PCMU/8000
m=video 23422 RTP/AVP 31
a=rtpmap:31 H261/90000

The general form of a SDP message is:

     x=parameter1 parameter2 ... parameterN

The line begins with a single lower-case letter x. There are never any spaces between the letter and the =, and there is exactly one space between each parameter. Each field has a defined number of parameters. Each line ends with a CRLF. The individual fields will now be discussed in detail.

7.1.1 Protocol Version

The v= field contains the SDP version number. Because the current version of SDP is 0, a valid SDP message will always begin with v=0.

7.1.2 Origin

The o= field contains information about the originator of the session and session identifiers. This field is used to uniquely identify the session. The field contains:

o=username session-id version network-type address-type
address

The username parameter contains the originator's login or host or - if none. The session-id parameter is a Network Time Protocol (NTP) [3] timestamp or a random number used to ensure uniqueness. The version is a numeric field that is increased for each change to the session, also recommended to be a NTP timestamp. The network-type is always IN for Internet. The address-type parameter is either IP4 or IP6 for IPv4 or IPv6 address either in dotted decimal form or a fully qualified host name.

7.1.3 Session Name and Information

The s= field contains a name for the session. It can contain any non-zero number of characters. The optional i=field contains information about the session. It can contain any number of characters.

7.1.4 URI

The optional u= field contains a uniform resource indicator (URI) with more information about the session.

7.1.5 E-Mail Address and Phone Number

The optional e= field contains an e-mail address of the host of the session. If a display name is used, the e-mail address is enclosed in <>. The optional p= field contains a phone number. The phone number should be given in globalized format, beginning with a +, then the country code, a space or −, then the local number. Either spaces or − are permitted as spacers in SDP. A comment may be present in ().

7.1.6 Connection Data

The c= field contains information about the media connection. The field contains:

     c=network-type address-type connection-address

The network-type parameter is defined as IN for the Internet. The address type is defined as IP4 for IPv4 addresses, IP6 for IPv6 addresses. The connection-address is the IP address that will be sending the media packets, which could be either multicast or unicast. If multicast, the connection-address field contains:

connection-address=base-multicast-address/ttl/number-of-
addresses

where ttl is the time-to-live value, and number-of-addresses indicates how many contiguous multicast addresses are included starting with the base-multicast-address.

7.1.7 Bandwidth

The optional b= field contains information about the bandwidth required. It is of the form:

     b=modifier:bandwidth-value

The modifier is either CT for conference total or AS for application specific. CT is used for multicast session to specify the total bandwidth that can be used by all participants in the session. AS is used to specify the bandwidth of a single site. The bandwidth-value parameter is the specified number of kilobytes per second.

7.1.8 Time, Repeat Times, and Time Zones

The t= field contains the start time and stop time of the session.

     t=start-time stop-time

The times are specified using NTP timestamps. For a scheduled session, a stop-time of zero indicates that the session goes on indefinitely. A start-time and stop-time of zero for a scheduled session indicates that it is permanent. The optional r= field contains information about the repeat times that can be specified in either in NTP or in days (d), hours (h), or minutes (m). The optional z= field contains information about the time zone offsets. This field is used if a reoccurring session spans a change from daylight-savings to standard time, or vice versa.

7.1.9 Encryption Keys

The optional k= field contains the encryption key to be used for the media session. The field contains:

     k=method:encryption-key

The method parameter can be clear, base64, uri, or prompt. If the method is prompt, the key will not be carried in SDP; instead, the user will be prompted as they join the encrypted session. Otherwise, the key is sent in the encryption-key parameter.

7.1.10 Media Announcements

The optional m= field contains information about the type of media session. The field contains:

     m=media port transport format-list

The media parameter is either audio, video, application, data, telephone-event, or control. The port parameter contains the port number. The transport parameter contains the transport protocol, which is either RTP/AVP or udp. (RTP/AVP stands for Real-time Transport Protocol [4] / audio video profiles [5], which is described in Section 7.3.) The format-list contains more information about the media. Usually, it contains media payload types defined in RTP audio video profiles. More than one media payload type can be listed, allowing multiple alternative codecs for the media session. For example, the following media field lists three codecs:

     m=audio 49430 RTP/AVP 0 6 8 99

One of these three codecs can be used for the audio media session. If the intention is to establish three audio channels, three separate media fields would be used. For non-RTP media, Internet media types should be listed in the format-list. For example,

     m=application 52341 udp wb

could be used to specify the application/wb media type.

7.1.11 Attributes

The optional a= field contains attributes of the preceding media session. This field can be used to extend SDP to provide more information about the media. If not fully understood by a SDP user, the attribute field can be ignored. There can be one or more attribute fields for each media payload type listed in the media field. For the RTP/AVP example in Section 7.1.10, the following three attribute fields could follow the media field:

a=rtpmap:0 PCMU/8000
a=rtpmap:6 DVI4/16000
a=rtpmap:8 PCMA/8000
a=rtpmap:99 iLBC

Other attributes are shown in Table 7.2. Full details of the use of these attributes are in the standard document [1]. The details of the iLBC (Internet Low Bit Rate) Codec are in [6].

Table 7.2: SDP Attribute Values
Attribute	Name
a=rtpmap:	RTP/AVP list
a=cat:	Category of the session
a=keywds:	Keywords of session
a=tool:	Name of tool used to create SDP
a=ptime:	Length of time in milliseconds for each packet
a=recvonly	Receive only mode
a=sendrecv	Send and receive mode
a=sendonly	Send only mode
a=orient:	Orientation for whiteboard sessions
a=type:	Type of conference
a=charset:	Character set used for subject and information fields
a=sdplang:	Language for the session description
a=lang:	Default language for the session
a=framerate:	Maximum video frame rate in frames per second
a=quality:	Suggests quality of encoding
a=fmtp:	Format transport
a=mid:	Media identification grouping
a=direction:	Direction for symmetric media
a=rtcp:	Explicit RTCP port (and address)
a=inactive	Inactive mode

7.1.12 Use of SDP in SIP

The use of SDP with SIP is given in the SDP Offer Answer RFC 3264 [7]. The default message body type in SIP is application/sdp. The calling party lists the media capabilities that they are willing to receive in SDP in either an INVITE or in an ACK. The called party lists their media capabilities in the 200 OK response to the INVITE. More generally, offers or answers may be in INVITEs, PRACKs, or UPDATEs or in reliably sent 18x or 200 responses to these methods.

Because SDP was developed with scheduled multicast sessions in mind, many of the fields have little or no meaning in the context of dynamic sessions established using SIP. In order to maintain compatibility with the SDP protocol, however, all required fields are included. A typical SIP use of SDP includes the version, origin, subject, time, connection, and one or more media and attribute fields as shown in Table 2.1. The origin, subject, and time fields are not used by SIP but are included for compatibility. In the SDP standard, the subject field is a required field and must contain at least one character, suggested to be s=− if there is no subject. The time field is usually set to t=0 0.

SIP uses the connection, media, and attribute fields to set up sessions between user agents. Because the type of media session and codec to be used are part of the connection negotiation, SIP can use SDP to specify multiple alternative media types and to selectively accept or decline those media types. When multiple media codecs are listed, the caller and called party's media fields must be aligned—that is, there must be the same number, and they must be listed in the same order. The offer answer specification, RFC 3264 [7], recommends that an attribute containing a=rtpmap: be used for each media field [7]. A media stream is declined by setting the port number to zero for the corresponding media field in the SDP response. In the following example, the caller Tesla wants to set up an audio and video call with two possible audio codecs and a video codec in the SDP carried in the initial INVITE:

v=0
o=Tesla 2890844526 2890844526 IN IP4 lab.high-voltage.org
s=-
c=IN IP4 100.101.102.103
t=0 0
m=audio 49170 RTP/AVP 0 8
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
m=video 49172 RTP/AVP 32
a=rtpmap:32 MPV/90000

m=video 49172 RTP/AVP 32 a=rtpmap:32 MPV/90000

The codecs are referenced by the RTP/AVP profile numbers 0, 8, and 32. The called party Marconi answers the call, chooses the second codec for the first media field and declines the second media field, only wanting a PCM A-Law audio session.

v=0
o=Marconi 2890844526 2890844526 IN IP4 tower.radio.org
s=-
c=IN IP4 200.201.202.203
t=0 0
m=audio 60000 RTP/AVP 8
a=rtpmap:8 PCMA/8000
m=video 0 RTP/AVP 32

t=0 0 m=audio 60000 RTP/AVP 8 a=rtpmap:8 PCMA/8000 m=video 0 RTP/AVP 32

If this audio-only call is not acceptable, then Tesla would send an ACK then a BYE to cancel the call. Otherwise, the audio session would be established and RTP packets exchanged. As this example illustrates, unless the number and order of media fields is maintained, the calling party would not know for certain which media sessions were being accepted and declined by the called party.

One party in a call can temporarily place the other on hold (i.e., suspending the media packet sending). This is done by sending an INVITE with identical SDP to that of the original INVITE but with a=sendonly attribute present. The call is made active again by sending another INVITE with the a=sendrecv attribute present. (Note that older RFC 2543 compliant UAs may initiate hold using c=0.0.0.0.) For further examples of SDP use with SIP, see the SDP Offer Answer Examples document [8].

from：https://blog.csdn.net/voipmaker/article/details/6111629

1.a字段

1.1 crypto属性

 a = crypto：<tag> <crypto-suite> <key-params> [<session-params>]

a=crypto:1 AES_CM_128_HMAC_SHA1_80 inline:d0RmdmcmVCspeEc3QGZiNWpVLFJhQX1cfHAwJSoj|2^20|1:32

标签：用于在offer/answer中选择一种crypto属性

加密套件：描述加密的标识符和身份验证算法

关键参数：method：info。目前method只有一种定义“inline”，表明秘钥就是info

会话参数：

参考自：https://tools.ietf.org/html/rfc4568#section-4

1.2 ssrc属性

a = ssrc：<ssrc-id> <attribute>：<value>

a=ssrc:2 cname:stream_1_cname

a=ssrc:2 label:video_track_id_1

attribute包括：cname(唯一标识一个客户端，一个客户端只有一个cname)

msid

mslabel

label

fmtp

参考自：https://tools.ietf.org/html/rfc5576#section-4

备注：label属性，可以参考：https://www.packetizer.com/rfc/rfc4574/

1.3 ssrc-group属性

 a=ssrc-group:<semantics> <ssrc-id> ...

a=ssrc-group:FEC 2 3

semantics：有FID（流识别），FEC（前向纠错），SIM（用于simulcate）。

FID：表示同一时刻只能只用一种codec，注意一个FID不要使用同一个port/ip。FID的实现场景：可以用于重传机制的实现

ssrc-id：有多个，表示一个组里面的所有ssrc

参考自：https://tools.ietf.org/html/rfc5576#section-4

备注：关于rtx的文档https://tools.ietf.org/html/rfc4588

1.4 rtpmap属性

a=rtpmap:<payload type> <encoding name>/<clock rate> [/<encoding
parameters>]

a=rtpmap:120 VP8/90000

payload type：有效载荷类型

encoding name：编码器

encoding parameters：如果是音频，可能表示的是通道数

（备注：有ulpfec和flexfec两种payload类型，参考文档为：

ulpfec：https://tools.ietf.org/html/rfc5109

flexfec：https://tools.ietf.org/html/draft-ietf-payload-flexible-fec-scheme-05）

参考自：https://tools.ietf.org/html/rfc4566

1.5 MediaContentDirection属性

a=sendrecv
a=recvonly
a=sendonly
a=inactive

参考自：https://tools.ietf.org/html/rfc4566

1.6 ice-ufrag 和 ice-pwd属性

a=ice-ufrag:<ufrag>
a=ice-pwd:<pwd>

a=ice-ufrag:ufrag_video

a=ice-pwd:pwd_video

ice打洞的用户名和密码

a=ice-ufrag:ufrag_video

a=ice-pwd:pwd_video

参考自：https://tools.ietf.org/html/rfc5245#section-15.4

1.7 candidate属性

a=candidate <foundation> <component-id> <transport> <priority> <connection-address> typ <candidate-types> <rel-addr> <rel-port>

a=candidate:a0+B/4 1 udp 2130706432 74.125.224.39 3457 typ relay generation 2

foundation:用来区别两个candidate是否是一样的类型，一样的base addr，一样的 stun server

component-id：从1开始递增。RTP的必须是1，RTCP必须是2

priority：优先级，不知道怎么用

cand-type：有四种”host”, “srflx”, “prflx”, “relay”。srflx即server reflexive, prflx即peer reflexive，relay即relayed candidates。应该是四种连接方式。

rel-addr:目前的理解是stun或turn服务器地址

rel-port:

参考自：https://tools.ietf.org/html/rfc5245

1.8 rtcp属性

a=rtcp:<port> <nettype> <addrtype> <connection-address>

a=rtcp:2347 IN IP4 74.125.127.126

rtcp的属性信息

参考自：https://tools.ietf.org/id/draft-ietf-mmusic-sdp4nat-00.txt

1.9 msid-semantic属性

a=msid-semantic:<msid>

a=msid-semantic: WMS local_stream_1

WMS表示Webrtc Media Streams

local_stream_1表示msid（msid具体作用应该是和ssrc对应）

参考自：https://tools.ietf.org/html/draft-alvestrand-rtcweb-msid-02#section-3

1.10 msid属性

a=msid:<msid>

a=msid: local_stream_1

The value of the “msid” attribute consists of an identifier and an optional “appdata” field.(msid属性由标识符和appdata组成)

This new attribute allows endpoints to associate RTP streams that are described in different media descriptions with the same MediaStreams（msid属性允许端点和RTP流连接在不同的media descriptions中使用相同的MediaStreams）

and to carry an identifier for each MediaStreamTrack in its “appdata” field（appdata放置MediaStreamTrack）

参考自：https://tools.ietf.org/html/draft-ietf-mmusic-msid-16#page-10

备注：webrtc中SdpSerialize函数第二个参数需要设置为true才可以有该属性，如果直接用jsep的toString函数，就不会有这个属性

1.11 group属性

a=group:<semantics> <semantics-extension>

a=group:BUNDLE

“a=group” lines are used to group together several “m” lines that are identified by their “mid” attribute（group属性用于通过mid标识符把多个m属性连接起来）

There MAY be several “a=group” lines in a session description.The “a=group” lines of a session description can use the same or different semantics（group属性可以有多个，并且可以有相同或不同的语义）

参考自：

https://tools.ietf.org/html/rfc5888

https://tools.ietf.org/html/draft-ietf-mmusic-sdp-bundle-negotiation-39

1.12 bundle-only属性

a=bundle-only

a=bundle-only

和group属性结合使用。表示不同的media使用同一个port

1.13 rtcp-fb属性

a=rtcp-fb:<payload> <param>

a=rtcp-fb:96 ccm fir

参考自：https://tools.ietf.org/html/rfc4585

1.14 rtcp-rsize属性

a=rtcp-rsize

a=rtcp-rsize

参考自：https://tools.ietf.org/html/rfc5506

1.15 fingerprint属性

a=fingerprint:<hash-func> <fingerprint>

a=fingerprint:SHA-1 4A:AD:B9:B1:3F:82:18:3B:54:02:12:DF:3E:5D:49:6B:19:E5:7C:AB

参考自：https://tools.ietf.org/html/rfc4572#page-7

1.16 extmap属性

a=extmap:<id> <uri>

a=extmap:8 http://www.webrtc.org/experiments/rtp-hdrext/video-timing

rtp的头部扩展。具有三个属性：

1.非对称（可以表示recvonly，sendonly）

2.可以有相互排斥的选择（answer可以选择offer提供相同id中的其中一个rtpextension，id须为4096~4351）

3.在一个会话中可以表示多个头部扩展

参考自：https://tools.ietf.org/html/rfc5285

1.17 fmtp属性

a=fmtp:<payload> <param>

a=fmtp:97 apt=96

表示codec对应的payloadtype，以及param

参考自：https://tools.ietf.org/html/rfc4566

1.18 mid属性

a=mid:<media name>

a=mid:audio

表示media的名字，用于查找具体的media

1.19 setup属性

a=setup:<role>

a=setup:active

表示连接中的角色，是主动连接，还是被动连接等

2 v字段

v=0

参考自：https://tools.ietf.org/html/rfc4566

3 o字段

o=(用户名)（会话标识）（版本）（网络类型）（地址类型）（地址）

o=- 18446744069414584320 18446462598732840960 IN IP4 127.0.0.1

参考自：https://tools.ietf.org/html/rfc4566

4 s字段

s=(会话名)

参考自：https://tools.ietf.org/html/rfc4566

5 m字段

m=(媒体)（端口）（传送层）（格式列表）

m=audio 2345 RTP/SAVPF 111 103 104

参考自：https://tools.ietf.org/html/rfc4566

6 b字段

传输速率

参考自：https://tools.ietf.org/html/rfc4566

offer/answer：

对于offer/answer，可以查看：

https://tools.ietf.org/html/rfc3264#page-8

注：

1.The answer MUST contain exactly the same number of “m=” lines as the offer（m属性的个数和offer的m属性个数要一致）

2.If the answerer has no media formats in common for a particular offered stream, the answerer MUST reject that media stream by setting the port to zero.（如果answer方没有和offer一样的media formats，那么就通过设置端口为0拒绝这个media stream）

3.answer拒绝：如果要拒绝掉一个media stream，那么就需要把拒绝的media的port设置为0，但是有一种情况要注意，就是a=bundle-only，在前面还有a=group:BUNDLE字段，表示几个media stream公用一个端口，这个时候的media可以设置port为0

CreateAnswer比较codec

1.对于audio和video，都会比较两者的name是否一致，如果payload小于等于95，也会比较id是否一致（因为小于等于95的都是静态的payload）

2.对于audio，会比较两者的clockrate，bitrate，channels必须都一致，或者其中一个为0。

3.对于video，如果是H264，则会比较profile-level-id是否一致

from: https://blog.csdn.net/myiloveuuu/article/details/78998183

SDP结构

from：https://blog.piasy.com/2018/10/14/WebRTC-API-Overview/index.html

SDP 基本结构

首先我们搞清楚 SDP 的基本结构。

总体来说，WebRTC 的 SDP 分为几个部分：

session metadata: v=, o=, s=, t=
network description: c=, a=candidate
stream description: m=, a=rtpmap, a=fmtp, a=sendrecv …
security description: a=crypto, a=ice-frag, a=ice-pwd, a=fingerprint
QoS, grouping description: a=rtcp-fb, a=group, a=rtcpmux

m= 开头的一段叫做 m section，这一行叫 m line，里面有很多 a line 来描述这种 media 的各种属性。我们称一种媒体数据为一种 media，每种 media 在 SDP 里都有 m section。

WebRTC 的 SDP 有三种类型：

offer: 发起方提供的自己对本次通话的描述；
answer: 其他方收到 offer 后，给出的回应；
pranswer: provisional answer，非最终 answer，之后可能被 pranswer 或 answer 更新；

Plan B v.s. Unified Plan

说到 SDP，就不得不提它的两种 Plan，它们是表达传输多路媒体流时的两种 SDP 格式。多路媒体流的例子有：录屏 + 相机，或多个相机（视角）。

Plan B 是 SDP 里同类型的媒体流只有一个 m line，同类型的多个媒体流之间通过 msid 区分，而 Unified Plan 则是每个媒体流都有一个 m line，因此如果有两路视频，那就会有两个 video m line。

WebRTC 标准采纳的是 Unified Plan，WebRTC 代码也已支持，所以我们就只关注 Unified Plan 的 API。

参考：Plan B, Unified Plan, Unified Plan vs Plan B。

Plan B 在 WebRTC 源码里对应的是 PC 的 Stream/Sender/Receiver API，Unified Plan 对应的是 Track/Transceiver API。

Capturer/Source/Track/Sink/Transceiver

接着我们梳理一下媒体数据交换过程中的几个关键概念：

Capturer: 负责数据采集，只有视频才有这一层抽象，它有多种实现，相机采集（Android 还有 Camera1/Camera2 两套）、录屏采集、视频文件采集等；
Source: 数据源；
Track: 媒体数据交换的载体，发送端把本地的 Track 发送给远程的接收端，每个 Track 都有自己的 track id，多个关联的 Track 有一个相同的 stream id；
Sink: Track 数据的消费者，只有视频才有这一层封装，发送端视频的本地预览、接收端收到远程视频后的渲染，都由 Sink 负责；
Transceiver: 负责收发媒体数据；

以视频为例，数据由发送端的 Capturer 采集，交给 Source，再交给本地的 Track，然后兵分两路，一路由本地 Sink 进行预览，一路由 Transceiver 发送给接收端；接收端 Track 则把数据交给 Sink 渲染。

Capturer 的创建和销毁完全由 APP 层负责，只需要把它和 Source 关联起来即可；创建 Source 需要调用 PC Factory 接口，创建 Track 也是，并且需要提供 Source 参数；Sink 的创建和销毁也由 APP 层负责，只需要把它们添加到 Track 里即可；创建 Transceiver 则需要调用 PC 接口。

好了，接下来我们就看看 PC Factory 和 PC 的接口。

PeerConnectionFactory 接口

CreatePeerConnectionFactory

默认的编译选项里，rtc_use_builtin_sw_codecs = false，因此 USE_BUILTIN_SW_CODECS 未被定义，CreatePeerConnectionFactory 只有一个重载版本：接收三个 thread、adm、audio/video encoder/decoder factory、AudioMixer 和 AudioProcessing。

CreatePeerConnection

创建 PC 对象，接收 RTCConfiguration 和 PeerConnectionDependencies，前者用来容纳各种配置，后者则用来容纳各种可定制的接口实现，例如 PortAllocator, AsyncResolverFactory, RTCCertificateGeneratorInterface, SSLCertificateVerifier。

目前 Android/iOS 对 dependencies 的支持还未跟上，虽然这种高级用法的用户也不怕在 native 层自己做封装，但就又得重新造一遍 WebRTC Java/ObjC 代码里的轮子了。

CreateAudio/VideoSource/Track

这就是前面我们提到的创建 Audio/Video Source/Track 的接口了。

PeerConnection 接口

准备工作相关：

AddTrack: 添加要发送的 track；
AddTransceiver: 添加 transceiver；
CreateDataChannel: 添加 DataChannel；
RemoveTrack: 移除 track；
GetTransceivers: 获取所有的 transceiver；

建立 P2P 连接相关：

CreateOffer: 创建 offer；
CreateAnswer: 创建 answer；
SetLocalDescription: 设置本地 SDP；
SetRemoteDescription: 设置远端 SDP；
AddIceCandidate: 添加 ICE candidate；
RemoveIceCandidates: 移除 ICE candidate；

注意：CreateOffer/CreateAnswer 时传入的 RTCOfferAnswerOptions 里，有 offer_to_receive_X 字段，它们是为了兼容 Plan B 语义的，一旦设置，即便没有 AddTrack，SDP 里也会包含 audio/video 的 m line。使用 Unified Plan 时，不应设置这两个字段，而应提前调用 AddTrack/AddTransceiver/CreateDataChannel，来表明自己是否需要 audio/video/data。

其他接口：

GetStats: 获取统计数据；
SetBitrate: 设置这个 PC 总的发送码率，包括初始码率、最小码率、最大码率；
SetBitrateAllocationStrategy: 设置自定义码率分配策略，可以通过这个接口实现针对每个 track 的码率分配策略；

注意：SetBitrateAllocationStrategy 在 Android 和 iOS 平台都没有暴露出来，Android 暴露了 SetBitrate 接口，iOS 则没有，不过可以通过 RTCRtpSender setParameters 限制编码器的输出码率。

回调接口 PeerConnectionObserver：

OnSignalingChange: 产生/设置 SDP 后，会触发 signaling state 变化，常见的变化是 stable -> have-local-offer -> stable 或 stable -> have-remote-offer -> stable，具体可以查看 SPEC 4.3 State Definitions；
OnRenegotiationNeeded: 需要重新协商（重新建立 P2P 连接）时回调，例如 ICE restart 时会回调；
OnIceGatheringChange: ICE candidate 收集状态变化后回调；
OnIceConnectionChange: ICE 连接状态变化后回调；
OnIceCandidate: 收集到本地 ICE candidate 后回调；
OnIceCandidatesRemoved: 本地 ICE candidate 被移除后回调；
OnTrack: 调用 SetRemoteDescription 后，如果 SDP 表明将会创建接收用的 transceiver，就会回调这个接口；
OnRemoveTrack: 当确定一个 track 不再接收媒体数据后，会回调这个接口，track 不会移除，但 transceiver 的 recv 方向将会被去掉；

接下来我们重点看一下 transceiver。

RtpTransceiver

SDP 的 m section 里有一行 a=mid:，定义了这种 media 的 id，叫 mid，例如下面这对 offer 和 answer:

# offer
...
a=group:BUNDLE 0 1 2
...
m=video 9 UDP/TLS/RTP/SAVPF 100 96 97 98 99 101 127 124 125
...
a=mid:0
...
m=audio 9 UDP/TLS/RTP/SAVPF 111 103 104 9 102 0 8 106 105 13 110 112 113 126
...
a=mid:1
...
m=application 9 DTLS/SCTP 5000
...
a=mid:2
...
# answer
...
a=group:BUNDLE 0 1 2
...
m=video 9 UDP/TLS/RTP/SAVPF 100 96 97 98 99 101 127 124
...
a=mid:0
...
m=audio 9 UDP/TLS/RTP/SAVPF 103 104 9 102 0 8 106 105 13 110 112 113 126
...
a=mid:1
...
m=application 9 DTLS/SCTP 5000
...
a=mid:2
...

其中有三种 media: video, audio, application，mid 依次为 0, 1, 2。application 是 DataChannel 的 media type。

我们注意到，offer 和 answer 里同一种 media 的 mid 是相同的，也就是说，对某一端来说，他收发的同一种媒体数据，mid 是相同的。

在 WebRTC 标准里，transceiver 表示的就是收发相同 mid 的 sender 和 receiver 的一个组合体，其中会有 media type, mid, direction, sender, receiver 等字段。其中 direction 有几种取值：kSendRecv, kSendOnly, kRecvOnly, kInactive。

AddTrack 时我们 add 的是本地的 track，即要发送的数据流，首次 AddTrack 时，会创建 transceiver，默认其 direction 是 kSendRecv。尽管在 CreateOffer 时我们可以通过设置 RTCOfferAnswerOptions 的 offer_to_receive_X 字段来控制是否 receive，但这两个字段是 legacy 字段，我们应该尽量避免。那如何控制 transceiver 的方向呢？我们可以使用 AddTransceiver 接口。

如果想要创建 kSendOnly 的 transceiver，可以传入 track，并在 RtpTransceiverInit 中设置 direction 为 kSendOnly；或者只传入 media type 和 init 结构体，稍后再 AddTrack。如果想要创建 kRecvOnly 的 transceiver，可以只传入 media type 和 init 结构体，并且不 AddTrack。

transceiver 何时与 SDP 里的 m section 关联呢？offer 端在创建 offer 时，会根据已有的 transceiver 创建 m section，并记下每个 transceiver 在 SDP 里对应的 m section 的 index 值，以便在 SetLocalDescription 时，可以为 transceiver 设置正确的 mid；answer 端在 SetRemoteDescription（offer 端发来的 offer）时，如果 offer 里的 m section 有 recv 方向，那就按 media type 来查找已有的 transceiver，如果能找到就可以将其关联起来，否则就创建一个 kRecvOnly 的 transceiver（因为 offer 只有可能是 kSendOnly 了，不发也不收的 media，不会出现在 SDP 里，那对此 offer 的回应也就只能是 kRecvOnly 了）。

总结一下，无论是 offer 端还是 answer 端，需要发送的 media，才提前添加好有 send 方向的 transceiver，仅接收的 media，无需提前添加 transceiver（提前添加了也不会被使用）。

附录一：SDP 部分细节

m line 里会指明传输协议，例如 UDP/TLS/RTP/SAVPF，最后的 SAVPF 还有其他几种值：AVP, SAVP, AVPF, SAVPF
- AVP 意为 AV profile
- S 意为 secure
- F 意为 feedback
rtpmap 是描述 codec 的，但有特殊的 rtx codec，其实不是 codec，例如 rtx；
fmtp 补充描述 codec 的参数，format parameters
- max-fr: maximum framerate
- profile-level-id: H.264 的 profile level id
rtx 描述重传策略，由 rtpmap 指明，它的参数由 fmtp 描述
- apt: associated payload type，指明所描述的 stream；
- rtx-time: rtp 包在缓冲区保留时间；
rtcp-fb: RTCP 反馈机制
- offer 里面列出一些反馈机制，answer 里应移除不理解、不支持的机制，但不能修改；
- ack rpsi/app
- nack pli/sli/rpsi/app
- rpsi: reference picture selection indication
- app: app 层反馈机制
- pli: picture loss indication，表明收流端丢失了一幅图像的一些数据，发送端可能会发送一个 I 帧（类似于 FIR），但要考虑拥塞控制
- sli: slice loss indication
- ccm fir: codec control message, full intra refresh
fec 类似于 rtx，也由 rtpmap 指明，它的参数由 fmtp 指明；

本文是 Piasy 原创，发表于 https://blog.piasy.com，请阅读原文支持原创 https://blog.piasy.com/2019/01/01/WebRTC-RTP-Mux-Demux/

之前我在为 janus-pp-rec 增加视频旋正功能一文中简单介绍了一点 RTP 协议的内容，重点关注的是视频方向的 RTP header extension，这次我们更深入的了解一下 RTP 协议的内容，看看 H.264 视频数据是如何封装和解封装的。

再谈 RTP 协议

我们首先了解一下 RTP H.264 相关的 RFC，下面的内容是对两篇 RFC 的总结：RTP: A Transport Protocol for Real-Time Applications, RTP Payload Format for H.264 Video。

RTP 包结构

包头有固定 12 个字节部分，以及可选的 csrc 和 ext 数据（在为 janus-pp-rec 增加视频旋正功能一文中有更详细的介绍）：

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X| CC |M| PT | sequence number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| timestamp |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| synchronization source (SSRC) identifier |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| contributing source (CSRC) identifiers |
| .... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

接着是载荷数据，载荷长度在包头中有记录。载荷数据的格式，由不同的 profile 单独定义，profile 的 payload type 值，通过 SDP 协商确定。

下面我们了解一下 H.264 载荷的格式。

H.264 载荷

H.264 载荷数据的第一个字节格式和 NAL 头一样，其 type 定义如下：

Table 1. Summary of NAL unit types and the corresponding packet
types
NAL Unit Packet Packet Type Name Section
Type Type
-------------------------------------------------------------
0 reserved -
1-23 NAL unit Single NAL unit packet 5.6
24 STAP-A Single-time aggregation packet 5.7.1
25 STAP-B Single-time aggregation packet 5.7.1
26 MTAP16 Multi-time aggregation packet 5.7.2
27 MTAP24 Multi-time aggregation packet 5.7.2
28 FU-A Fragmentation unit 5.8
29 FU-B Fragmentation unit 5.8
30-31 reserved -

H.264 载荷数据的封包有三种模式：Single NAL unit mode (0), Non-interleaved mode (1), Interleaved mode (2)。它们各自支持的 type 见下表：

Table 3. Summary of allowed NAL unit types for each packetization
mode (yes = allowed, no = disallowed, ig = ignore)
Payload Packet Single NAL Non-Interleaved Interleaved
Type Type Unit Mode Mode Mode
-------------------------------------------------------------
0 reserved ig ig ig
1-23 NAL unit yes yes no
24 STAP-A no yes no
25 STAP-B no no yes
26 MTAP16 no no yes
27 MTAP24 no no yes
28 FU-A no yes yes
29 FU-B no no yes
30-31 reserved ig ig ig

注意：WebRTC iOS H.264 编码时，无论是 baseline 还是 high profile，都是使用的 Non-interleaved mode，WebRTC Android 也是如此。

因此 WebRTC 里实际使用的只有三种封包模式：NAL unit, STAP-A, FU-A。那我们接下来就看一下这三种模式。

NAL unit

如果 type 为 [1, 23]，则该 RTP 包只包含一个 NALU：

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|F|NRI| Type | |
+-+-+-+-+-+-+-+-+ |
| |
| Bytes 2..n of a single NAL unit |
| |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| :...OPTIONAL RTP padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 2. RTP payload format for single NAL unit packet

包聚合（Aggregation Packets）

为了体现/应对有线网络和无线网络的 MTU 巨大差异，RTP 协议定义了包聚合策略：

STAP-A：聚合的 NALU 时间戳都一样，无 DON（decoding order number）；
STAP-B：聚合的 NALU 时间戳都一样，有 DON；
MTAP16：聚合的 NALU 时间戳不同，时间戳差值用 16 bit 记录；
MTAP24：聚合的 NALU 时间戳不同，时间戳差值用 24 bit 记录；
包聚合时，RTP 的时间戳是所有 NALU 时间戳的最小值；

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|F|NRI| Type | |
+-+-+-+-+-+-+-+-+ |
| |
| one or more aggregation units |
| |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| :...OPTIONAL RTP padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 3. RTP payload format for aggregation packets

STAP-A 示例：

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| RTP Header |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|STAP-A NAL HDR | NALU 1 Size | NALU 1 HDR |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| NALU 1 Data |
: :
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | NALU 2 Size | NALU 2 HDR |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| NALU 2 Data |
: :
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| :...OPTIONAL RTP padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 7. An example of an RTP packet including an STAP-A
containing two single-time aggregation units

包拆分（Fragmentation Units，FUs）

在应用层实现包拆分而不是依赖下层网络的拆分机制，好处有二：

可以支持超过 64 KB（IPv4 包最大长度为 64 KB）的 NALU，高清视频文件可能有超大的 NALU；
可以利用 FEC（forward error correction）；

每个分包都有一个编号，一个 NALU 拆分的 RTP 包其序列必须顺序且连续，中间不得插入其他数据的 RTP 包序号。FU 只能拆分 NALU，STAP 和 MTAP 不能拆分，FU 也不能嵌套。FU-A 没有 DON，FU-B 有 DON。

FU-A 格式如下：

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| FU indicator | FU header | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| |
| FU payload |
| |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| :...OPTIONAL RTP padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 14. RTP payload format for FU-A

FU header 格式如下：

+---------------+
|0|1|2|3|4|5|6|7|
+-+-+-+-+-+-+-+-+
|S|E|R| Type |
+---------------+

S: start bit, 置一表明这是 NALU 的首个 fragment；
E: end bit, 置一表明是 NALU 的最后一个 fragment；
R: reserved，必须置零；
Type: 取值含义与 NALU header 的 type 字段一致；

WebRTC H.264 封包实现

了解完了理论部分，接下来我们看看 WebRTC 里是如何实现的，WebRTC 把视频数据封装成 RTP packet 的逻辑在 RTPSenderVideo::SendVideo 函数中。

`RTPSenderVideo::SendVideo`

其实封包的过程，就是计算一帧数据需要封多少个包、每个包放多少载荷，为此我们需要知道各种封包模式下，每个包的最大载荷（包大小减去头部大小）。

首先计算一个包的最大容量，这个容量是指可以用来容纳 RTP 头部和载荷的容量，FEC、重传的开销排除在外：

// Maximum size of packet including rtp headers.
// Extra space left in case packet will be resent using fec or rtx.
int packet_capacity = rtp_sender_->MaxRtpPacketSize() - fec_packet_overhead -
(rtp_sender_->RtxStatus() ? kRtxHeaderSize : 0);

rtp_sender_->MaxRtpPacketSize 默认会被设置为 1460，但如果需要发送视频，则会被设置为 1200。（why ???）

接着准备四种包的模板：

single_packet: 对应 NAL unit 和 STAP-A 的包；
first_packet: 对应 FU-A 的首个包；
middle_packet: 对应 FU-A 的中间包；
last_packet: 对应 FU-A 的最后一个包；

准备过程包括：

在 RTPSender::AllocatePacket 里设置 ssrc 和 csrcs 字段，预留 AbsoluteSendTime, TransmissionOffset 和 TransportSequenceNumber extension 的空间，并且按需设置 PlayoutDelayLimits 和 RtpMid extension；
在 RTPSenderVideo::SendVideo 里设置 payload_type, rtp_timestamp 和 capture_time_ms 字段；
在 AddRtpHeaderExtensions 里按需设置 VideoOrientation, VideoContentTypeExtension, VideoTimingExtension 和 RtpGenericFrameDescriptorExtension extension；
first_packet, middle_packet 和 last_packet 均是拷贝自 single_packet，因此代码里只调用了 AddRtpHeaderExtensions 设置它们的 extension；

这些模板一是后续封包时可以直接拿来用，二是可以准确地知道 RTP 头部需要多少空间，正如注释所言：

Simplest way to estimate how much extensions would occupy is to set them.

知道了每种包的头部需要多少空间后，就知道每个包最多可以容纳多少载荷了（为 RtpPacketizer::PayloadSizeLimits 的各个字段赋值）：

max_payload_len，最大载荷可用空间：包的最大容量减去中包头部大小；
single_packet_reduction_len，封单包时，载荷可用空间还需要在 max_payload_len 的基础上打个折扣：单包与中包头部大小之差；即包的最大容量减去单包头部大小；
first_packet_reduction_len，封多包时，首包载荷可用空间也需要在 max_payload_len的基础上打个折扣：首包与中包头部大小之差；
last_packet_reduction_len，封多包时，末包载荷可用空间也需要在 max_payload_len的基础上打个折扣：末包与中包头部大小之差；

准备好了模板、知道了 limits 之后，就创建 RtpPacketizer，通过其 NumPackets 接口得知这一帧图像需要封装为多少个包，再调用其 NextPacket 封装每个包。调用 NextPacket 之后还不算完，还得调用 RTPSender::AssignSequenceNumber 分配序列号，如果需要设置 VideoTimingExtension，还得设置 packetization_finish_time_ms。最后，就是调用 FEC 处理，或直接调用 RTPSenderVideo::SendVideoPacket 发送 RTP 报文了。

视频编码为 H.264 时，RtpPacketizer 的实现类是 RtpPacketizerH264，接下来，我们就看一下 H.264 的封包逻辑。

`RtpPacketizerH264`

RtpPacketizerH264 构造时会根据 RTPFragmentationHeader 的内容，生成 RtpPacketizerH264::Fragment 数组 input_fragments_，Fragment 里面包含了每个 NALU 载荷起始字节的指针、NALU 的长度。

RTPFragmentationHeader 其实就是这帧图像里面每个 NALU 的信息：载荷在 buffer 里的 offset、载荷长度。这些信息在编码器输出数据之后解析生成，扫描整个 buffer，查找 NALU start code（001 或 0001），统计每个 NALU 的 offset 和长度。安卓的实现在 sdk/android/src/jni/videoencoderwrapper.cc 的 VideoEncoderWrapper::ParseFragmentationHeader 中，iOS 的实现在 sdk/objc/components/video_codec/nalu_rewriter.cc 的 H264CMSampleBufferToAnnexBBuffer 中。

H.264 规范里定义了一幅图像分片为多个 NALU 的功能，但我观察了一下 iPhone 6 编出来的数据，非关键帧都只有一个 NALU，关键帧有两个 NALU，而且前面都添加了 SPS 和 PPS，所以关键帧会有四个 NALU。

有了 input_fragments_ 后，就会在 GeneratePackets 中遍历之，对每个 Fragment，根据 packetization_mode 执行不同的封包逻辑：

如果是 SingleNalUnit，那就为这个 Fragment（其实就是一个 NALU）生成一个 PacketUnit；
如果是 NonInterleaved（WebRTC Native SDK 实际使用的 mode），那就看这个 Fragment 能否放进单个包里，先计算单个包能容纳多少数据：
```
 
```
1. int single_packet_capacity = limits_.max_payload_len;
2. if (input_fragments_.size() == 1)
3. single_packet_capacity -= limits_.single_packet_reduction_len;
4. else if (i == 0)
5. single_packet_capacity -= limits_.first_packet_reduction_len;
6. else if (i + 1 == input_fragments_.size())
7. single_packet_capacity -= limits_.last_packet_reduction_len;
逻辑并不复杂，max_payload_len 扣除各种情况的折扣之后，剩下的就是 single_packet_capacity；
如果 fragment_len > single_packet_capacity，就说明无法放进单个包，那就要做 Fragmentation 了，即调用 PacketizeFuA，否则说明可以放进单个包，那就可以做 Aggregation，即调用 PacketizeStapA；
PacketizeFuA 就是看怎么把一个 Fragment 分成多个包了，然后生成每个 PacketUnit，这个分的逻辑实现在 SplitAboutEqually 函数中，里面处理了不少边界情况，大体思想就是把数据放进尽可能少的包、每个包的大小尽可能相近；它生成的 PacketUnit 的 aggregated 字段都是 false；
PacketizeStapA 则是看能把多少个 Fragment 放进一个包，这里也会为每个 Fragment 生成一个 PacketUnit，但只会对 num_packets_left_ 做一次加一操作；它生成的 PacketUnit 除了最后一个的 aggregated 字段为 false，其他都为 true；

GeneratePackets 执行完毕后，就算出了 num_packets_left_ 的值，即此帧图像需要多少个 RTP包，并且也准备好了 PacketUnit 数组。

之后在 RTPSenderVideo::SendVideo 里就会调用 num_packets_left_ 次 NextPacket 来实际组装每一个 RTP 包了，我们现在就看看 NextPacket 的逻辑：

检查首个 PacketUnit：
如果 PacketUnit 的 first_fragment 和 last_fragment 字段都是 true，那就直接把载荷拷进去；
这种情况有可能是 SingleNalUnit，也有可能是 NonInterleaved 的 STAP-A 包，因为 NonInterleaved 时，如果 Fragment 可以放进一个包，那就会封为 STAP-A，而如果只生成了一个 PacketUnit，那它的 first_fragment 和 last_fragment 都会是 true；
否则，如果 aggregated 字段为 true，那就调用 NextAggregatePacket 封 STAP-A 包；
- 这里只提一点我看了比较久才看清楚的逻辑：这个函数里通过一个循环不停的消费 PacketUnit，退出循环的条件是 !packet->aggregated 或 packet->last_fragment，由于需要放进一个包的一系列 PacketUnit 里只有最后一个 last_fragment 字段为 true（这个逻辑在 PacketizeStapA 里），因此可以正确退出循环；
如果 aggregated 字段为 false，就调用 NextFragmentPacket 封 FU-A 包；

好了，至此我们就已经看完了 H.264 封装 RTP 包的逻辑，可以长舒一口气了 :)

WebRTC H.264 解包实现

了解了封包的实现，我们接下来看看解包是怎么实现的，解包比封包稍微复杂一点，关键就在于包的到达可能是乱序的（丢包重传也可以认为是一种乱序）。

解包过程包括两大步：先解析出 RTP 的头部和载荷；再解析载荷部分，根据不同的封包模式，对封包过程做一个逆操作，就能得到一帧完整的数据。前者在 Call::DeliverRtp 中调用 RtpPacket::ParseBuffer 中实现，后者则比较复杂，因为需要处理乱序问题，逻辑起始点是 RtpVideoStreamReceiver::ReceivePacket 函数。

`RtpPacket::ParseBuffer`

ParseBuffer 的任务有三点：

解析 RTP 标准头的各个字段，包括 payload_type_, sequence_number_, timestamp_, ssrc_ 等；
解析 RTP 扩展头的元数据，即偏移量和长度；
确定 RTP 载荷的偏移量和长度，完成了第二点后做个减法就可以得到；

`RtpVideoStreamReceiver::ReceivePacket`

首先根据不同的 payload type，创建不同的 RtpDepacketizer 去解析载荷内容，H.264 的解析逻辑在 RtpDepacketizerH264::Parse 中实现，其主要任务就是找到实际数据的位置和大小：

检查载荷的第一个字节里的 type 字段（低五位），以判断包类型（NAL unit, STAP-A, FU-A）；
FU-A 的解析在 ParseFuaNalu 里完成；
NAL unit 和 STAP-A 的解析在 ProcessStapAOrSingleNalu 里完成；
具体解析代码这里不做展开；

然后解析 RTP 扩展头的实际数据，包括 VideoOrientation 等。

最后构造 VCMPacket，并调用 PacketBuffer::InsertPacket 放入包缓冲区中。

`PacketBuffer::InsertPacket`

PacketBuffer 封装了 RTP 包处理乱序到达的逻辑，大体思路就是：

收到每个包之后，检查序列号：
- 确定已经收到过的包，就会直接丢弃；
- 否则就把包放进 data_buffer_ 数组里，并在 sequence_buffer_ 数组里记下这个序列号的一些属性；
- 每个包在上述两个数组里存放的下标是序列号模以数组大小，因此是按序列号顺序存放的；
调用 FindFrames，从已收到的包列表中，找出完整的帧；
把完整的帧交给 RtpFrameReferenceFinder::ManageFrame，由其确保帧可以解码后，再回调出去，进入后续的解码环节；

`PacketBuffer::FindFrames`

每次收到包后，会触发 FindFrames，我们会从刚收到的包的序列号向后查找：

只有一个包满足以下两个条件之一才会进行检查：
- 该包是「帧起始」包；
- 该包前一个序号的包是连续的，何谓连续？就是说它是帧起始包，或它之前的所有序列号都已经收到了；
- 举个例子，假设 1 是帧起始包，那收到 1 的时候肯定会检查，之后收到 2 时，由于 1 是连续的，所以 2 也会检查，但如果收到 4（假设 4 不是帧起始），那 4 就不会检查，再收到 3 时，就会依次检查 3 和 4；
我们首先感兴趣的是「帧末尾」包，即有 packet->is_last_packet_in_frame 标志；
找到帧末尾包后，再反过来向前查找「帧起始」包；
- VP8/VP9 靠 frame_begin（即 packet->is_first_packet_in_frame）标志判断帧起始，H.264 则靠时间戳的变化来判断帧起始；
- 正常情况下，由于我们只检查帧起始包和连续包，所以一旦找到了帧末尾包，向前就一定能找到帧起始包；
找到了帧末尾包和帧起始包，就可以构造完整的帧了，不过这里只是记录元数据，不会做帧数据的拷贝；

`RtpFrameReferenceFinder::ManageFrame`

从载荷里解析出来的帧数据都是完整的帧，但不一定能解码，比如 H.264 有前向参考（P 帧需要参考前面的 I 帧才能解码），也有后向参考（B 帧需要参考前面的 I/P 帧和后面的 P 帧才能解码），因此需要等这一帧的参考帧都收到之后，才能回调出去。

虽然 PacketBuffer 处理了 RTP 报文乱序到达的问题，输出了一个个完整的帧，但并没有保证帧是按序到达的，所以仍需 RtpFrameReferenceFinder 来处理帧乱序到达的问题。

RtpFrameReferenceFinder 的代码细节这里就不展开了，有兴趣/需求的朋友可以自行阅读。

好了，至此我们就已经看完了 H.264 解封装 RTP 包的逻辑，可以再长舒一口气了 :)

WebRTC RTP 封包解包相关数据结构

最后，我们再总结一下 WebRTC RTP 封包解包相关数据结构：

RtpPacket: RTP 报文的数据结构，里面定义了各种标准头部字段、扩展头部、数据缓冲区等；
RtpPacketToSend: 发送端封包用到的数据结构，继承自 RtpPacket，加了一些扩展头部设置逻辑的封装；
RtpPacketReceived: 接收端解包用到的数据结构，也继承自 RtpPacket，加了获取扩展头部逻辑的封装；

最后的最后，我再分享一个内容：序列号的比较算法。

序列号的比较算法

由于序列号可能发生回绕，所以不能直接比较，有一个 RFC 文档专门定义了这个比较算法：Serial Number Arithmetic。

这个 RFC 里首先定义了序列号的定义法：n 位无符号数，最低序列号为 0，最高序列号为 2^n-1，序列号没有最大最小值，每个序列号至少需要 n 位来保存。

接着它定义了序列号的加法：在 [0, 2^n-1] 范围内的合法序列号值 s，加 m 的值为 (s+m) % (2^n)，这里的加法和取模，都是常规定义的加法和取模。

最后它定义了序列号的比较算法（RFC 里为了严谨，引入了另外两个普通正整数，这里简单起见我们就不引入了）：

判等：序列号 s 和 s+m（m 为普通正整数），只有 m 为 0 时，它们才相等；即给定两个序列号值，完全无法判断其是否相等，不过通常我们不需要判等，而是判断大小；
判小：当且仅当 (s1 < s2 && s2 - s1 < 2^(n-1)) || (s1 > s2 && s1 - s2 > 2^(n-1)) 时，序列号 s1 小于 s2；即值小不过一半范围，或大过一半范围，例如 n=3，2-1 < 4，故 1 比 2 小，7-2 > 4，故 7 比 2 小；
判大：当且仅当 (s1 < s2 && s2 - s1 > 2^(n-1)) || (s1 > s2 && s1 - s2 < 2^(n-1)) 时，序列号 s1 大于 s2；即值小过一半范围，或大不过一半范围，例如 n=3，7-2 > 4，故 2 比 7 大，2-1 < 4，故 2 比 1 大；

细心的朋友也许会举出一个例子：7 和 3 谁大谁小？它们其实无法区分大小。就像 3 和 3 是否相等一样，无法区分。RFC 里故意不对这种序列号对的大小问题作出定义，因为着实不好定义。

WebRTC 的实现逻辑主要在 rtc_base/numerics/sequence_number_util.h 和 rtc_base/numerics/mod_ops.h 中：

template <typename T, T M = 0>
inline bool AheadOf(T a, T b) {
static_assert(std::is_unsigned<T>::value,
"Type must be an unsigned integer.");
return a != b && AheadOrAt<T, M>(a, b);
}
template <typename T, T M>
inline typename std::enable_if<(M == 0), bool>::type AheadOrAt(T a, T b) {
static_assert(std::is_unsigned<T>::value,
"Type must be an unsigned integer.");
const T maxDist = std::numeric_limits<T>::max() / 2 + T(1);
if (a - b == maxDist)
return b < a;
return ForwardDiff(b, a) < maxDist;
}
template <typename T, T M>
inline typename std::enable_if<(M == 0), T>::type ForwardDiff(T a, T b) {
static_assert(std::is_unsigned<T>::value,
"Type must be an unsigned integer.");
return b - a;
}

首先序列号必须是无符号数；
然后 WebRTC 定义了「前向距离」这个概念，即后数加多少能加到前数（考虑无符号数的溢出）；
还定义了「最大距离」这个概念，可以理解为两个数之差的绝对值的最大可能取值，也就是最大取值范围的一半（向上取整）；
最后，a 领先于 b 的的条件就是：若 ab 前向距离为最大距离，那 a 大于 b 就是领先于 b，否则，若 ab 前向距离小于最大距离，那 a 就领先于 b；

其实就是通过无符号数减法的溢出，把 RFC 定义的两种或起来的情况统一了，以及对于 RFC 未定义的情况，定义成了值大小的比较。

本文转自：https://blog.csdn.net/chinabinlang/article/details/79151589