一. AF 和 PF
作为宏定义,两者对应的数值完全相同,区别只存在于文字上:
AF: Address Family
PF: Protocol Family
特别说明,在 Unix/Linux 系统的不同版本中,这两者存在微小差别,对于 BSD 是 AF,对于 POSIX 是 PF。
注: 理论上,创建 socket 是指定协议,应用 PF_xxxx,而设置地址应用 AF_xxxx。但 man socket 出现的都是 AF,故建议统一用 AF。
二. socket
int socket(int domain, int type, int protocol);
creates an endpoint for communication and returns a descriptor.
The domain argument specifies a communication domain; this selects the protocol family which will be used for communication. These families are defined in <sys/socket.h>. The currently understood formats include:
Name Purpose Man page
AF_UNIX, AF_LOCAL Local communication unix(7)
AF_INET IPv4 Internet protocols ip(7)
AF_INET6 IPv6 Internet protocols ipv6(7)
AF_IPX IPX - Novell protocols
AF_NETLINK Kernel user interface device netlink(7)
AF_X25 ITU-T X.25 / ISO-8208 protocol x25(7)
AF_AX25 Amateur radio AX.25 protocol
AF_ATMPVC Access to raw ATM PVCs
AF_APPLETALK Appletalk ddp(7)
AF_PACKET Low level packet interface packet(7)
The socket has the indicated type, which specifies the communication semantics. Currently defined types are:
SOCK_STREAM Provides sequenced, reliable, two-way, connection-based byte streams. An out-of-band data transmission mechanism may be supported.
SOCK_DGRAM Supports datagrams (connectionless, unreliable messages of a fixed maximum length).
SOCK_SEQPACKET Provides a sequenced, reliable, two-way connection-based data transmission path for datagrams of fixed maximum length; a consumer is required to read an entire packet with each input system call.
SOCK_RAW Provides raw network protocol access.
SOCK_RDM Provides a reliable datagram layer that does not guarantee ordering.
SOCK_PACKET Obsolete and should not be used in new programs; see packet(7).
Some socket types may not be implemented by all protocol families; for example, SOCK_SEQPACKET is not implemented for AF_INET.
Since Linux 2.6.27, the type argument serves a second purpose: in addition to specifying a socket type, it may include the bitwise OR of any of the following values, to modify the behavior of socket():
SOCK_NONBLOCK Set the O_NONBLOCK file status flag on the new open file description. Using this flag saves extra calls to fcntl(2) to achieve the same result.
SOCK_CLOEXEC Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful. 当子进程 exec 一个新的程序时,调用进程中打开的文件描述符仍然保持打开,但设置了执行即关 FD_CLOEXEC 的文件描述字除外。
The protocol specifies a particular protocol to be used with the socket. Normally only a single protocol exists to support a par‐ticular socket type within a given protocol family, in which case protocol can be specified as 0. However, it is possible that many protocols may exist, in which case a particular protocol must be specified in this manner. The protocol number to use is spe‐cific to the “communication domain” in which communication is to take place; see protocols(5). See getprotoent(3) on how to map protocol name strings to protocol numbers.
Sockets of type SOCK_STREAM are full-duplex byte streams, similar to pipes(PS: PIPES are not full-duplex). They do not preserve record boundaries. A stream socket must be in a connected state before any data may be sent or received on it. A connection to another socket is created with a connect(2) call. Once connected, data may be transferred using read(2) and write(2) calls or some variant of the send(2) and recv(2) calls. When a session has been completed a close(2) may be performed. Out-of-band data may also be transmitted as described in send(2) and received as described in recv(2).
Q: what's the differences between read/write and send/recv?
A: http://blog.csdn.net/deng529828/article/details/6245254
三. raw socket
利用 raw socket,可以在用户空间实现新的 IPv4 协议。raw socket 收发的报文不包含链路头,即硬件地址。
raw_socket = socket(AF_INET, SOCK_RAW, int protocol);
Q:创建 raw socket,domain 为何应该用 AF_INET 而非 AF_PACKET ?
A:man packet 可以知道,AF_PACKET 更底层,"at device driver level"。故 AF_PACKET 仅用于涉及到链路层的场合,例如网络嗅探等,而 raw socket 用于涉及到网络层及之上的场合,例如篡改原 IP 地址等。另外,对于 AF_PACKET,也可以通过 type 参数,控制是否获取链路头,具体参见 man packet。
packet socket 和 raw socket 的另一个重要区别是,前者不会重组 IP 分片,而后者会:"Note that packet sockets don't reassemble IP fragments, unlike raw sockets.(man raw)"
raw socket 最重要的选项是 IP_HDRINCL(ip header include)。该选项决定 raw socket 发送报文时,是否自动生成 IP 头。默认未enable。未 enable 时,raw socket 将自行构造 IP 头,优点是方便,但无法伪造 IP 头信息如源 IP;enable 时,则需要自行构造 IP 头。此外,该选项仅控制报文的发送,即,在用 raw socket 接收报文时,IP 头总是包含在报文中的。
如果 protocol 是 IPPROTO_RAW(255),则已经自动使能 IP_HDRINCL,此时,该 raw socket 只能发送报文,而不能接收报文。
man raw 中有这么一句:”In Linux 2.2, all IP header fields and options can be set using IP socket options“. 但是 man ip 看过了所有的 option,发现并不能 set 诸如源、目 IP 等字段。此外,man ip 中提到:”When this flag(IP_HDRINCL) is enabled the values set by IP_OPTIONS, IP_TTL and IP_TOS are ignored“. 所以,一旦设置了 IP_HDRINCL,只能自行填充几乎所有的 IP 头字段。
说”几乎所有“,是因为即使 enable 了 IP_HDRINCL 选项,raw socket 也会帮你填充部分选项:
+---------------------------------------------------+
|IP Header fields modified on sending by IP_HDRINCL |
+----------------------+----------------------------+
|IP Checksum |Always filled in. |
+----------------------+----------------------------+
|Source Address |Filled in when zero. |
+----------------------+----------------------------+
|Packet Id |Filled in when zero. |
+----------------------+----------------------------+
|Total Length |Always filled in. |
+----------------------+----------------------------+
不用自行计算校验和,不用自行计算总长度...,不得不说,raw socket 的确很贴心。但这也会导致无法伪造校验和、总长度,不利于分析协议代码的测试。
所有匹配 protocol 的报文或者错误,都将先送给该 raw socket,然后才交给其它的协议挂钩:
”When a packet is received, it is passed to any raw sockets which have been bound to its protocol before it is passed to other protocol handlers (e.g., kernel protocol modules).“
”Raw sockets may tap all IP protocols in Linux, even protocols like ICMP or TCP which have a protocol module in the kernel. In this case, the packets are passed to both the kernel module and the raw socket(s). This should not be relied upon in portable programs, many other BSD socket implementation have limitations here.“
IP 和 TCP 头结构已经在 <linux/ip.h> 和 <linux/tcp.h> 中定义,编写 raw socket 时,无需自行定义。
#include <linux/ip.h>
#include <linux/tcp.h>
四. socket address structures
Each socket domain has its own format for socket addresses, with a domain-specific address structure. Each of these structures begins with an integer "family" field (typed assa_family_t) that indicates the type of the address structure. This allows the various systemcalls eg: connect(2), bind(2), accept(2), getsockname(2), getpeername(2), which are generic to all socket domains, to determine the domain of a particular socket address.
To allow any type of socket address to be passed to interfaces in the sockets API, the type struct sockaddr is
defined. The purpose of this type is purely to allow casting of domain-specific socket address types to a "generic" type, so as to avoid compiler warnings about type mismatches in calls to the sockets API.
struct sockaddr {
sa_family_t sa_family;
char sa_data[14];
};
实际中,常使用等价的结构体:
struct sockaddr_in {
short sin_family;
unsigned short sin_port;
struct in_addr sin_addr;
unsigned char sin_zero[8];
};
然后,在调用 connect、bind 时,强制类型转换成 struct sockaddr 结构。
说白了,就是 struct sockaddr 结构不好用,sa_data[14] 这样的字段不方便填充,所以才出来了 struct sockaddr_in。
提到了 struct sockaddr_in,就不得不提到 struct sockaddr_un:
struct sockaddr_un {
sa_family_t sa_family;
char sun_path[108];
};
该结构用于 domain socket,sa_family 只能是 AF_UNIX 或者 AF_LOCAL,sun_path 不要求存在字符串结束符。
计算 struct sockaddr_un 结构的长度一般采用如下的方式:
size = offsetof(struct sockaddr_un, sun_path) + strlen(un.sun_path);
offsetof 宏在 stddef.h 中定义:
#define offsetof(TYPE, MEMBER) ((int)&((TYPE *)0)->MEMBER)
其实现方式是,将 TYPE 类型的指针首地址设为 0,然后取 MEMBER 成员的地址就是该成员在 TYPE 中的偏移数。
五. bind
int bind(int sockfd, const struct sockaddr *addr, socklen_t addrlen);
When a socket is created with socket(2), it exists in a name space (address family) but has no address assigned to it. bind() assigns the address specified by addr to the socket referred to by the file descriptor sockfd. addrlen specifies the size, in bytes, of the address structure pointed to by addr. Traditionally, this operation is called "assigning a name to a socket".
在 bind 时,如无特别需要,可以使用 INADDR_ANY,表示绑定本地所有地址。由于 INADDR_ANY 为 0.0.0.0,在将 struct sockaddr_in 结构清 0 后不填充 IP 字段,效果也一样。此外,sin_port 和 sin_addr 都必须是网络序。
六. connect
int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen);
The connect() system call connects the socket referred to by the file descriptor sockfd to the address specified by addr. The addrlen argument specifies the size of addr. The format of the address in addr is determined by the address space of the socket sockfd; seesocket(2) for further details.
If the socket sockfd is of type SOCK_DGRAM then addr is the address to which datagrams are sent by default, and the only address from which datagrams are received. If the socket is of type SOCK_STREAM or SOCK_SEQPACKET, this call attempts to make a connection to the socket that is bound to the address specified by addr.
Generally, connection-based protocol sockets may successfully connect() only once; connectionless protocol sockets may use connect() multiple times to change their association. Connectionless sockets may dissolve the association by connecting to an address with the sa_family member of sockaddr set to AF_UNSPEC (supported on Linux since kernel 2.2).
七. listen
int listen(int sockfd, int backlog);
listen() marks the socket referred to by sockfd as a passive socket, that is, as a socket that will be used to accept incoming connection requests using accept(2).
The sockfd argument is a file descriptor that refers to a socket of type SOCK_STREAM orSOCK_SEQPACKET.
The backlog argument defines the maximum length to which the queue of pending connections for sockfd may grow. If a connection request arrives when the queue is full, the client may receive an error with an indication of ECONNREFUSED or, if the underlying protocol supports retransmission, the request may be ignored so that a later reattempt at connection succeeds.
The behavior of the backlog argument on TCP sockets changed with Linux 2.2.
Now it specifies the queue length for completely established sockets waiting to be accepted, instead of the number of incomplete connection requests. The maximum length of the queue for incomplete sockets can be set using proc/sys/net/ipv4/tcp_max_syn_backlog. When syncookies are enabled there is no logical maximum length and this setting is ignored.
在 Linux 2.2 之后,backlog 参数的行为有所变化。现在指的是连接已经建立(3 次握手已完成)、但尚未被 accept 的队列长度。而 SYN 队列的长度定义在 /proc/sys/net/ipv4/tcp_max_syn_backlog 中。而当 syncookies 开启后,该队列长度就无逻辑上限, tcp_max_syn_backlog 中的设置也会被忽略。
开启 syncookies 的方法 http://lijichao.blog.51cto.com/67487/308509
See tcp(7) for more information.
If the backlog argument is greater than the value in proc/sys/net/core/somaxconn, then it is silently truncated to that value; the default value in this file is 128. In kernels before 2.4.25, this limit was a hard coded value, SOMAXCONN, with the value 128.
proc/sys/net/core/somaxconn 是 listen 函数 backlog 参数的上限,为 128。当 backlog 的值超过该设置时,以该设置为准。
在实际使用中,backlog 参数设置成 128 就行了。
八. accept
int accept(int sockfd, const struct sockaddr *addr, socklen_t addrlen);
The accept() system call is used with connection-based socket types (SOCK_STREAM, SOCK_SEQPACKET). It extracts the first connection request on the queue of pending connections for the listening socket, sockfd, creates a new connected socket, and returns a new file descriptor referring to that socket. The newly created socket is not in the listening state. The original socket sockfd is unaffected by this call.
The argument sockfd is a socket that has been created with socket(2), bound to a local address with bind(2), and is listening for connections after a listen(2).
The argument addr is a pointer to a sockaddr structure. This structure is filled in with the address of the peer socket, as known to the communications layer. The exact format of the address returned addr is determined by the socket's address family (seesocket(2) and the respective protocol man pages). When addr is NULL, nothing is filled in; in this case, addrlen is not used, and should also be NULL.
The addrlen argument is a value-result argument: the caller must initialize it to contain the size (in bytes) of the structure pointed to by addr; on return it will contain the actual size of the peer address.
The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call.
If no pending connections are present on the queue, and the socket is not marked as nonblocking, accept() blocks the caller until a connection is present. If the socket is marked nonblocking and no pending connections are present on the queue, accept() fails with the error EAGAIN or EWOULDBLOCK.
In order to be notified of incoming connections on a socket, you can use select(2) or poll(2). A readable event will be delivered when a new connection is attempted and you may then call accept() to get a socket for that connection. Alternatively, you can set the socket to deliver SIGIO when activity occurs on a socket; see socket(7) for details.
On Linux, the new socket returned by accept() does not inherit file status flags such as O_NONBLOCK and O_ASYNC from the listening socket. This behavior differs from the canonical BSD sockets implementation. Portable programs should not rely on inheritance or noninheritance of file status flags and always explicitly set all requiredflags on the socket returned from accept().
九. select
int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);
int pselect(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, const struct timespec *timeout, const sigset_t *sigmask);
void FD_CLR(int fd, fd_set *set);
int FD_ISSET(int fd, fd_set *set);
void FD_SET(int fd, fd_set *set);
void FD_ZERO(fd_set *set);
select() and pselect() allow a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become "ready" for some class of I/O operation (e.g., input possible). A file descriptor is considered ready if it is possible to perform the corresponding I/O operation (e.g.,read(2)) without blocking.
Three independent sets of file descriptors are watched. readfds is watched to see if data is available for reading from any of its file descriptors. After select() has returned, readfds will be cleared of all file descriptors except for those that are immediately available for reading. The same are writefds and exceptfds.
On exit, the sets are modified in place to indicate which file descriptors actually changed status. Each of the three file descriptor sets may be specified as NULL if no file descriptors are to be watched for the corresponding class of events. So, after select() returns, all file descriptors in all sets should be checked to see if they are ready.
The functions read(2),recv(2), write(2), and send(2) as well as the select() call can return -1 with errno set to EINTR, or witherrno set to EAGAIN (EWOULDBLOCK). These results must be properly managed (not done properly above).
so, code must be write like this:
r = select(nfds + 1, &rd, &wr, &er, NULL);
if (r == -1 && errno == EINTR)
continue;
if (r == -1) {
perror("select()");
exit(EXIT_FAILURE);
}
If the functions read(2),recv(2), write(2), and send(2) fail with errors other than those listed above, or one of the input functions returns 0, indicating end of file, then you should not pass that descriptor to select() again.
Four macros are provided to manipulate the sets. FD_ZERO() clears a set. FD_SET() and FD_CLR() respectively add and remove a given file descriptor from a set. FD_ISSET() tests to see if a file descriptor is part of the set; this is useful after select() returns.
nfds is the highest-numbered file descriptor in any of the three sets, plus 1.
The timeout argument specifies the minimum interval that select() should block waiting for a file descriptor to become ready. (This interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount.) If both fields of the timeval structure are zero, thenselect() returns immediately. (This is useful for polling.) If timeout is NULL (no timeout), select() can block indefinitely.
sigmask is a pointer to a signal mask (see sigprocmask(2)); if it is not NULL, then pselect() first replaces the current signal mask by the one pointed to by sigmask, then does the "select" function, and then restores the original signal mask.
The operation of select() and pselect() is identical, other than these three differences:
(i)select() uses a timeout that is a struct timeval (with seconds and microseconds), while pselect() uses a struct timespec (with seconds and nanoseconds).
(ii)select() may update the timeout argument to indicate how much time was left. pselect() does not change this argument.
(iii)select() has no sigmask argument, and behaves as pselect() called with NULL sigmask.
Other than the difference in the precision of the timeout argument, the following pselect() call:
ready = pselect(nfds, &readfds, &writefds, &exceptfds, timeout, &sigmask);
is equivalent to atomically executing the following calls:
sigset_t origmask;
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ready = select(nfds, &readfds, &writefds, &exceptfds, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
The reason that pselect() is needed is that if one wants to wait for either a signal or for a file descriptor to become ready, then an atomic test is needed to prevent race conditions. (Suppose the signal handler sets a global flag and returns. Then a test of this global flag followed by a call ofselect() could hang indefinitely if the signal arrived just after the test but just before the call. By contrast, pselect() allows one to first block signals, handle the signals that have come in, then call pselect() with the desired sigmask, avoiding the race.)
job
1. write a TCP server, support simultaneous connections
2. write a TCP server use epoll
3. write a UDP server, focus on performance
4. UDP client + connect, send datagram to this client from another server
5. raw socket, write a synflooder
6. packet socket, write a sniffer
7. domain socket