1. 概况
1.1. SIP协议及其发展
sip(session initiation protocal)称为会话发起协议,是一个应用层的信令控制协议。 用于创建、修改和释放一个或多个参与者的会话。这些会话可以是Internet多媒体会议、IP电话或多媒体分发。会话的参与者可以通过组播(multicast)、网状单播(unicast)或两者的混合体进行通信。
sip协议由ietf(internet engineering task force)组织于1999年提出的一个在基于ip网络中,特别是在internet这样一种结构的网络环境中,实现实时通讯应用的一种信令协议。而所谓的会话(session),就是指用户之间的数据交换。在基于sip协议的应用中,每一个会话可以是各种不同的数据,可以是普通的文本数据,也可以是经过数字化处理的音频、视频数据,还可以是诸如游戏等应用的数据,应用具有巨大的灵活性。
作为一个ietf提出的标准,sip协议在很大程度上借鉴了其他各种广泛存在的internet协议, 如http(超文本传输协议)、smtp(简单邮件传输协议)等,和这些协议一样,sip也采用的基于文本的编码方式,这也是sip协议同视频通讯领域其他现有标准相比最大的特点之一。
sip协议的提出和发展,是伴随着internet的发展而发展的,到目前为止它走过了以下几个阶段:
l 1996年首先出现了sip的概念,这时sip的主要应用是针对internet上的各种文本应用,如电子邮件、文字聊天等。
l 1999年3月,itef的多方多媒体会晤控制(mmusic)工作组提出了rfc2543建议,供各厂商和机构讨论。
l 1999年9月,sip工作组从mmusic中分离并独立出来,成立了sip工作组,并与2000年7月发表了sip的草案。
l 2000年,许多厂家推出了基于SIP的产品。
l 2001年4月,3GPP宣布将SIP作为其第5版中多媒体域的核心协议。
l 2002年6月,itef的sip工作组又发表了rfc3261建议,以取代rfc2543.
由于网络环境以及相关多媒体技术的不足,在sip协议首次提出的时候,仅仅针对各种文本应用,随着技术的发展,并通过和ietf中ip电话工作组(iptel)、ip网中电话选路(trip)工作组等兄弟工作组配合工作,在sip协议中大大加强了对多媒体通讯的支持。
1.2. Protocol Suite
IETF SIP Protocol Suite
1.3. SIP Implementation Structure
1.4. RFC
RFC | Description |
RFC 1889 | RTP: Real-time Transport Protocol |
RFC 1890 | RTP profile for audio and video conferences with minimal control |
RFC 2327 | SDP: Session Description Protocol |
RFC 2617 | HTTP Authentication Http 鉴权 |
RFC 2806 | URLs for Telephone Calls |
RFC 2833 | RTP payload for DTMF digits, telephone tones and signals, Support inband DTMF tone send/receive with RTP Supports out of band DTMF signaling based on RFC2833 RFC 2833利用RTP流传送DTMF数据 |
RFC 2848 | PINT - IP Access to Telephone Call Services |
RFC 2876 | MIME,用于支持 ISUP 和 QSIG |
RFC 2915 | The NAPTR DNS Resource Record |
RFC 2976 | SIP INFO Method |
RFC 3050 | |
RFC 3204 | MIME media types for ISUP and QSIG Objects |
RFC 3261 | SIP:Session Initiation Protocol |
RFC 3262 | PRACK, Reliability of Provisional Responses in SIP 对临时响应的可靠性作了规定 |
RFC 3263 | Locating SIP Servers SIP 代理服务器的定位规则 |
RFC 3264 | An off/answer model with the Session Description Protocol 提议/应答模型 |
RFC 3265 | SIP specific event notification 具体的事件通知 |
RFC 3266 | SDP IPv6 |
RFC 3267 | RTP AMR payload format and file storage |
RFC 3310 | HTTP digest authentication |
RFC 3311 | SIP UPDATE Method |
RFC 3312 | Integration of Resource Management and SIP |
RFC 3313 | Private SIP extensions for media authorization |
RFC 3320 | SIGComp Support |
RFC 3321 | SIGComp extension headers |
RFC 3323 | A Privacy Mechanism for SIP |
RFC 3325 | Private extensions for NAI(Network Asserted Identity)within Trusted Networks |
RFC 3326 | The Reason Header Field for SIP |
RFC 3327 | SIP Extension Header Field for registering non adjacent contacts |
RFC 3329 | Security Mechanism Agreement for SIP |
RFC 3372 | SIP for Telephones (SIP-T): Context and Architectures |
RFC 3388 | Grouping of Media Lines in SDP |
RFC 3420 | SIP Frag |
RFC 3428 | SIP Extension for Instant Messaging (MESSAGE method). |
RFC 3455 | P-Header extension to SIP for 3GPP |
RFC 3485 | SigComp SIP & SDP static dictionary |
RFC 3486 | Compressing SIP |
RFC 3489 | Simple Traversal of User Datagram Protocol (UDP) through network address translators (NATs) |
RFC 3515 | REFER Method |
RFC 3524 | Mapping of Media Streams to Resource Reservation Flows |
RFC 3581 | An Extension to SIP for Symmetric Response Routing |
RFC 3608 | SIP Extension Header Field for Service Route Discovery during Registration |
RFC 3665 | SIP: Basic Call Flow Examples |
RFC 3666 | SIP: PSTN Call Flows |
RFC 3680 | SIP Event package for Registration |
ITU-T T.38 Annex D and RFC 3362 | for support of T.38 in SIP. |
draft-ietf-sip-session-timer-13 (January 2004) | SIP Extension for Session Timer |
draft-ietf-sip-replaces-05 (February 2004) | The SIP "Replaces" Header |
draft-ietf-sipping-cc-transfer-05 | Call Transfer |
2. SDP
2.1. 概况
SIP和SAP 使用SDP传递的信息:
5W | Description | 对应描述符 |
Who人物 | Convener of the session + contact information | o= c=* |
What主题 | Subject. informal | s= i=* |
When时间 | Date and time | z=* t= r=* |
Where地点 | Multicast addresses,port number | m= |
Which什么 | Capability requirements , bandwidth | m= a=* b=* |
2.2. 类型
2.2.1. 类型表
简表
详表
Optional items are marked with a ‘*’,’[]’sapce.
类型 | 格式 | 说明 |
v= (protocol version) | v=0 | v=0 |
o= (owner/creator and session identifier). | o=<username>[]<session id>[]<version>[]<network type>[]<address type>[]<address> | <username> is the user’s login on the originating host, or it is "-".if the originating host does not support the concept of user ids. <session id> allocation is up to the creating tool, but it has been suggested that a Network Time Protocol (NTP) timestamp be used to ensure uniqueness. <version> is recommended (but not mandatory) that an NTP timestamp is used. <network type> is a text string giving the type of network.Initially "IN" is defined to have the meaning "Internet". 例: mhandley 2890844526 2890842807 IN IP4 126.16.64.4 |
s= (session name) | s=<session name> | 会话主题,主要用于组播而非单播,对于单播对话建议采用一个空格(0x20) 或破折号(-)表示 例: s=SDP Seminar |
i=* (session information) | i=<session description> | 会话信息 例: i=Discussion of Mbone Engineering Issues |
u=* (URI of description) | u=<URI> | 例: u=http://www.cs.ucl.ac.uk/staff/M.Handley/sdp.03.ps |
e=* (email address) | e=<email address> | 例: e=mjh@isi.edu (Mark Handley) e=Mark Handley <mjh@isi.edu> |
p=* (phone number) | p=<phone number> | format - preceded by a “+”and the international country code.There must be a space or a hyphen ("-") between the country code and the rest of the phone number. Spaces and hyphens may be used to split up a phone field to aid readability if desired. 例: p=+44-171-380-7777 or p=+1 617 253 6011 |
c=* (connection information) | c=<network type>[]<address type>[]<connection address> 或 c=<network type>[]<address type>[]<base multicast address>/<ttl>/<number of addresses>
| 例: c=IN IP4 224.2.1.1 c=IN IP4 224.2.1.1/127 例: c=IN IP4 224.2.1.1/127/3 would state that addresses 224.2.1.1, 224.2.1.2 and 224.2.1.3 are to be used at a ttl of 127. This is semantically identical to including multiple "c=" lines in a media description: c=IN IP4 224.2.1.1/127 c=IN IP4 224.2.1.2/127 c=IN IP4 224.2.1.3/127 |
b=* (bandwidth information) | b=<modifier>:<bandwidth-value> | 提供者期望使用的接收带宽,0时不应发送任何媒体。 <bandwidth-value> is in kilobits per second <modifier>: CT(Conference Total); AS (Application-Specific Maximum). Note that CT gives a total bandwidth figure for all the media at all sites. AS gives a bandwidth figure for a single media at a single site, although there may be many sites sending simultaneously. 例: b=CT:128 |
z=* (time zone adjustments) | z=<adjustment time>[]<offset>[]<adjustment time>[]<offset> .... | 时区调整. 例: z=2882844526 -1h 2898848070 0 This specifies that at time 2882844526 the time base by which the session’s repeat times are calculated is shifted back by 1 hour,and that at time 2898848070 the session’s original time base is restored. Adjustments are always relative to the specified start time - they are not cumulative. |
k=* (encryption key) | k=<method> k=<method>:<encryption key> | Methods: k=clear:<encryption key> k=base64:<encoded encryption key> k=uri:<URI to obtain key> k=prompt 例: k=uri:http://www.lut.fi/~svaittin/multimedia/key k=base64:A4wdo2GTiv2T8pRGMqQgG3+3UZD1UodEC4weTCZrRs0 k=h!)8gAe>=?#fQzo4jeI.:](:-)97kV |
t= (time the session is active) | t=<start time>[]<stop time> | 会话活动时间,通常单播会话的媒体流使用外部信令的方式建立和终止, 如SIP,此时,”t=”应为”0 0”. |
r=* (zero or more repeat times) | r=<repeat interval>[]<active duration>[]<list of offsets from starttime> | 例: if a session is active at 10am on Monday and 11am on Tuesday for one hour each week for three months, then the <start time> in the corresponding "t=" field would be the NTP representation of 10am on the first Monday, the <repeat interval> would be 1 week, the <active duration> would be 1 hour, and the <offsets> would be zero and 25 hours. The corresponding "t=" field stop time would be the NTP representation of the end of the last session three months later. By default all fields are in seconds, so the "r=" and "t=" fields might be:
t=3034423619 3042462419 (3个月) r=604800 3600 0 90000 (一星期 一小时 0 25小时) |
m= (media name and transport address) | m=<media>[]<port>[]<transport>[]<fmt list> 或 m=<media>[]<port>/<number of ports>[]<transport>[]<fmt list> | media: audio,video,application,data,control". Transport: RTP/AVP - the IETF’s Realtime Transport Protocol using the Audio/Video profile carried over UDP.
例: m=video 49170/2 RTP/AVP 31 would specify that ports 49170 and 49171 form one RTP/RTCP pair and 49172 and 49173 form the second RTP/RTCP pair. RTP/AVP is the transport protocol and 31 is the format. 例: m=application 32416 udp wb the LBL whiteboard application might be registered as MIME content-type application/wb with encoding considerations specifying that it operates over UDP, with no appropriate file format. |
媒体属性(其他属性 见下节) a=* (zero or more session attribute lines)
| a=rtpmap:<payload type> <encoding name>/<clock rate>[/<encoding parameters>]
| <encoding parameters> may specify the number of audio channels. This parameter may be omitted if the number of channels is one provided no additional parameters are needed. For video streams, no encoding parameters are currently specified.
例 m=audio 49230 RTP/AVP 98 a=rtpmap:98 L16/11025/2 |
2.2.2. 属性
a=keywds:<keywords>
Like the cat attribute, this is to assist identifying wanted
sessions at the receiver. This allows a receiver to select
interesting session based on keywords describing the purpose of
the session. It is a session-level attribute. It is a charset
dependent attribute, meaning that its value should be interpreted
in the charset specified for the session description if one is
specified, or by default in ISO 10646/UTF-8.
a=cat:<category>
This attribute gives the dot-separated hierarchical category of
the session. This is to enable a receiver to filter unwanted
sessions by category. It would probably have been a compulsory
separate field, except for its experimental nature at this time.
It is a session-level attribute, and is not dependent on charset.
The following are examples of the
SDP notation for category attributes
a=cat:News.CNN.LiveUpdate
a=cat:Sports.Soccer
a=tool:<name and version of tool>
This gives the name and version number of the tool used to create
the session description. It is a session-level attribute, and is
not dependent on charset.
a=ptime:<packet time>
提供者希望接收期望的分组的间隔,大于0.
This gives the length of time in milliseconds represented by the
media in a packet. This is probably only meaningful for audio data.
It should not be necessary to know ptime to decode RTP or vat audio,
and it is intended as a recommendation for the encoding/packetisation of audio.
It is a media attribute, and is not dependent on charset.
t=3154305600 3154489200
r=1d 3h 0
a=tool:sdr v2.4a7
a=type:meeting
m=audio 17760 RTP/AVP 0
c=IN IP4 239.255.43.180/15
a=ptime:40
m=video 62978 RTP/AVP 31
c=IN IP4 239.255.152.148/15
m=whiteboard 46510 udp wb
c=IN IP4 239.255.246.30/15
m=text 51400 udp nt
c=IN IP4 239.255.58.76/15
a=recvonly
This specifies that the tools should be started in receive-only
mode where applicable. It can be either a session or media
attribute, and is not dependent on charset.
a=sendrecv
This specifies that the tools should be started in send and
receive mode. This is necessary for interactive conferences with
tools such as wb which defaults to receive only mode. It can be
either a session or media attribute, and is not dependent on
charset.
a=sendonly
This specifies that the tools should be started in send-only
mode. An example may be where a different unicast address is to
be used for a traffic destination than for a traffic source. In
such a case, two media descriptions may be use, one sendonly and
one recvonly. It can be either a session or media attribute, but
would normally only be used as a media attribute, and is not
dependent on charset.
a=orient:<whiteboard orientation>
Normally this is only used in a whiteboard media specification.
It specifies the orientation of a the whiteboard on the screen.
It is a media attribute. Permitted values are ‘portrait’,
‘landscape’ and ‘seascape’ (upside down landscape). It is not
dependent on charset
a=charset:<character set>
This specifies the character set to be used to display the
session name and information data. By default, the ISO-10646
character set in UTF-8 encoding is used. If a more compact
representation is required, other character sets may be used such
as ISO-8859-1 for Northern European languages. In particular,
the ISO 8859-1 is specified with the following SDP attribute:
a=charset:ISO-8859-1
This is a session-level attribute; if this attribute is present,
it must be before the first media field. The charset specified
MUST be one of those registered with IANA, such as ISO-8859-1.
The character set identifier is a US-ASCII string and MUST be
compared against the IANA identifiers using a case-insensitive
comparison. If the identifier is not recognised or not
supported, all strings that are affected by it SHOULD be regarded
as byte strings.
Note that a character set specified MUST still prohibit the use
of bytes 0x00 (Nul), 0x0A (LF) and 0x0d (CR). Character sets
requiring the use of these characters MUST define a quoting
mechanism that prevents these bytes appearing within text fields.
a=type:<conference type>
This specifies the type of the conference. Suggested values are
‘broadcast’, ‘meeting’, ‘moderated’, ‘test’ and ‘H332’.‘recvonly’ should be the default for ‘type:broadcast’ sessions,‘type:meeting’ should imply ‘sendrecv’ and ‘type:moderated’
should indicate the use of a floor control tool and that the media tools are started so as to "mute" new sites joining the conference.Specifying the attribute type:H332 indicates that this loosely coupled session is part of a H.332 session as defined in the ITU
H.332 specification [10]. Media tools should be started ‘recvonly’.Specifying the attribute type:test is suggested as a hint that, unless explicitly requested otherwise, receivers can safely avoid displaying this session description to users. The type attribute is a session-level attribute, and is not dependent on charset.
a=quality:<quality>
This gives a suggestion for the quality of the encoding as an
integer value.
The intention of the quality attribute for video is to specify a
non-default trade-off between frame-rate and still-image quality.
For video, the value in the range 0 to 10, with the following
suggested meaning:
10 - the best still-image quality the compression scheme can
give.
5 - the default behaviour given no quality suggestion.
0 - the worst still-image quality the codec designer thinks is
still usable.
It is a media attribute, and is not dependent on charset.
2.2.3. 举例
例1单播
v=0
o=UserA 2890844526 2890844526 IN IP4 client.here.com
s=
c=IN IP4 100.101.102.103
t=0 0
m=audio 49170 RTP/AVP 0
a=rtpmap:0 PCMU/8000
例2组播
v=0
o=mhandley 2890844526 2890842807 IN IP4 126.16.64.4
s=SDP Seminar
i=A Seminar on the session description protocol
u=http://www.cs.ucl.ac.uk/staff/M.Handley/sdp.03.ps
e=mjh@isi.edu (Mark Handley)
c=IN IP4 224.2.17.12/127
t=2873397496 2873404696
a=recvonly
m=audio 49170 RTP/AVP 0
m=video 51372 RTP/AVP 31
m=application 32416 udp wb
a=orient:portrait
例3
media port:12016 => rtp port 12016
例4
2.3. Offer/Answer Exchanges
2.3.1. Basic Exchange
Assume that the caller, Alice, has included the following description
in her offer. It includes a bidirectional audio stream and two
bidirectional video streams, using H.261 (payload type 31) and MPEG
(payload type 32). The offered SDP is:
v=0
o=alice 2890844526 2890844526 IN IP4 host.anywhere.com
s=
c=IN IP4 host.anywhere.com
t=0 0
m=audio 49170 RTP/AVP 0
a=rtpmap:0 PCMU/8000
m=video 51372 RTP/AVP 31
a=rtpmap:31 H261/90000
m=video 53000 RTP/AVP 32
a=rtpmap:32 MPV/90000
The callee, Bob, does not want to receive or send the first video
stream, so he returns the SDP below as the answer:
v=0
o=bob 2890844730 2890844730 IN IP4 host.example.com
s=
c=IN IP4 host.example.com
t=0 0
m=audio 49920 RTP/AVP 0
a=rtpmap:0 PCMU/8000
m=video 0 RTP/AVP 31
m=video 53000 RTP/AVP 32
a=rtpmap:32 MPV/90000
At some point later, Bob decides to change the port where he will
receive the audio stream (from 49920 to 65422), and at the same time,
add an additional audio stream as receive only, using the RTP payload
format for events [9]. Bob offers the following SDP in the offer:
v=0
o=bob 2890844730 2890844731 IN IP4 host.example.com
s=
c=IN IP4 host.example.com
t=0 0
m=audio 65422 RTP/AVP 0
a=rtpmap:0 PCMU/8000
m=video 0 RTP/AVP 31
m=video 53000 RTP/AVP 32
a=rtpmap:32 MPV/90000
m=audio 51434 RTP/AVP 110
a=rtpmap:110 telephone-events/8000
a=recvonly
Alice accepts the additional media stream, and so generates the
following answer:
v=0
o=alice 2890844526 2890844527 IN IP4 host.anywhere.com
s=
c=IN IP4 host.anywhere.com
t=0 0
m=audio 49170 RTP/AVP 0
a=rtpmap:0 PCMU/8000
m=video 0 RTP/AVP 31
a=rtpmap:31 H261/90000
m=video 53000 RTP/AVP 32
a=rtpmap:32 MPV/90000
m=audio 53122 RTP/AVP 110
a=rtpmap:110 telephone-events/8000
a=sendonly
2.3.2. One of N Codec Selection
A common occurrence in embedded phones is that the Digital Signal
Processor (DSP) used for compression can support multiple codecs at a
time, but once that codec is selected, it cannot be readily changed
on the fly. This example shows how a session can be set up using an
initial offer/answer exchange, followed immediately by a second one
to lock down the set of codecs.
The initial offer from Alice to Bob indicates a single audio stream
with the three audio codecs that are available in the DSP. The
stream is marked as inactive, since media cannot be received until a
codec is locked down:
v=0
o=alice 2890844526 2890844526 IN IP4 host.anywhere.com
s=
c=IN IP4 host.anywhere.com
t=0 0
m=audio 62986 RTP/AVP 0 4 18
a=rtpmap:0 PCMU/8000
a=rtpmap:4 G723/8000
a=rtpmap:18 G729/8000
a=inactive
Bob can support dynamic switching between PCMU and G.723. So, he
sends the following answer:
v=0
o=bob 2890844730 2890844731 IN IP4 host.example.com
s=
c=IN IP4 host.example.com
t=0 0
m=audio 54344 RTP/AVP 0 4
a=rtpmap:0 PCMU/8000
a=rtpmap:4 G723/8000
a=inactive
Alice can then select any one of these two codecs. So, she sends an
updated offer with a sendrecv stream:
v=0
o=alice 2890844526 2890844527 IN IP4 host.anywhere.com
s=
c=IN IP4 host.anywhere.com
t=0 0
m=audio 62986 RTP/AVP 4
a=rtpmap:4 G723/8000
a=sendrecv
Bob accepts the single codec:
v=0
o=bob 2890844730 2890844732 IN IP4 host.example.com
s=
c=IN IP4 host.example.com
t=0 0
m=audio 54344 RTP/AVP 4
a=rtpmap:4 G723/8000
a=sendrecv
If the answerer (Bob), was only capable of supporting one-of-N
codecs, Bob would select one of the codecs from the offer, and place
that in his answer. In this case, Alice would do a re-INVITE to
activate that stream with that codec.
As an alternative to using "a=inactive" in the first exchange, Alice
can list all codecs, and as soon as she receives media from Bob,
generate an updated offer locking down the codec to the one just
received. Of course, if Bob only supports one-of-N codecs, there
would only be one codec in his answer, and in this case, there is no
need for a re-INVITE to lock down to a single codec.