tcp协议系列文章(2):从man 7 tcp开始

本文从tcp的man手册内容开始。以下是man 7 tcp的翻译。翻译参考了一些网络上同道中人的翻译成果,对此表示感谢。下文中的错误之处请读者指出。

文末有文中所有/proc选项的列表,文中所有socket选项的列表,及man 7 tcp中出现的所有RFC的列表


(2017.06.11 译完感受:每天往前拱一点点,感觉似乎是第一次花费这么几天时间来阅读一篇文章。一直以来都是快餐式地学习新知识,既不连贯,又不体系。今天翻译完毕,也终于对tcp的细节有了更多的认识。近期仍将坚持编写一些关于tcp协议的文章。)


(2017.06.03 开始)

一、NAME

tcp - 传输控制协议 (TCP)

二、大纲 SYNOPSIS

       #include <sys/socket.h>
       #include <netinet/in.h>
       #include <netinet/tcp.h>
       tcp_socket = socket(AF_INET, SOCK_STREAM, 0);

三、DESCRIPTION

 This is an implementation of the TCP protocol defined in RFC 793, RFC 1122 and RFC 2001 with the NewReno and SACK extensions.  It provides a reliable, stream-oriented, full-duplex con‐nection between two sockets on top of ip(7), for both v4 and v6 versions.  TCP guarantees that the data arrives in order and retransmits lost packets.  It generates and checks a  per-packet checksum to catch transmission errors.  TCP does not preserve record boundaries.
本协议是对 RFC793, RFC1122 和 RFC2001 定义的协议及其 NewReno 和 SACK 扩充部份实现的。它为建立基于 ip(7) 协议之上的两个套接字之间的可靠的面向数据流的全双工连接提供了可能,包括ipv4和ipv6。TCP 协议确保了数据的有序传输及丢包重传。它产生和校验每个数据包的校验和 (checksum) 用以捕捉数据传输时错误。TCP 不保留记录的上下限。
  
       A  newly created  TCP  socket  has no remote or local address and is not fully specified.  To create an outgoing TCP connection use connect(2) to establish a connection to another TCPsocket. To receive new incoming connections, first bind(2) the socket to a local address and port and then call listen(2) to put the socket into the listening state.  After that a newsocket for each incoming connection can be accepted using accept(2).  A socket which has had accept(2) or connect(2) successfully called on it is fully specified and may transmit data.Data cannot be transmitted on listening or not yet connected sockets.

早期的 TCP 接口不包含远端或本地址并且没有规定明确。使用 connect(2) 来与另外一个tcp套接字建立出站 (outgoing) TCP 连接。在接收一个入站 (incoming) 连接时,首先bind(2)socket到一个本地地址和端口,然后调用 listen(2) 使sokcet进入侦听状态。随后可以用accept(2) 接受每一个入站 (incoming) 连接、并为其建立socket。一个经过成功调用accept(2)或connect(2)的socket,表示它已完全明确,可以进行数据传送。数据不能在listening状态或未连接的socket上传送。


Linux supports RFC 1323 TCP high performance extensions.These include Protection Against Wrapped Sequence Numbers (PAWS), Window Scaling and Timestamps.  Window  scaling  allows  theuse  oflarge (> 64K) TCP windows in order to support links with high latency or bandwidth.  To make use of them, the send and receive buffer sizes must be increased.They can be setglobally with the /proc/sys/net/ipv4/tcp_wmem and /proc/sys/net/ipv4/tcp_rmem files, or on individual sockets by using the SO_SNDBUF and SO_RCVBUF socket options with the setsockopt(2)call.

Linux 支持 RFC1323 TCP 高性能扩展。这包括Seq攻击保护(PAWS),滑动窗口及时间戳。滑动窗口允许用于更大的tcp窗口(>64K)以支持高延迟或带宽的连接。为实现这些功能,必须增加接收与发送的数据缓存区。它们可以使用/proc/sys/net/ipv4/tcp_wmem 和/proc/sys/net/ipv4/tcp_rmem files 进行全局设定,或调用setsockopt(2)的SO_SNDBUF 和 SO_RCVBUF 的socket选项对socket进行单独设定。


The  maximum  sizes  for socket buffers declared via the SO_SNDBUF and SO_RCVBUF mechanisms are limited by the values in the /proc/sys/net/core/rmem_max and /proc/sys/net/core/wmem_maxfiles.  Note that TCP actually allocates twice the size of the buffer requested in the setsockopt(2) call, and so a succeeding getsockopt(2) call will not return the same size of  buffer  asrequested  in  the setsockopt(2) call.TCP uses the extra space for administrative purposes and internal kernel structures, and the /proc file values reflect the larger sizes compared to the actual TCP windows.  On individual connections, the socket buffer size must be set prior to the listen(2) or connect(2) calls in order to  have it  take  effect.   Seesocket(7) for more information.

通过SO_SNDBUF和SO_RCVBUF设置的最大socket buffers,受到由系统/proc/sys/net/core/rmem_max 和 /proc/sys/net/core/wmem_maxfiles的 限制。事实上,tcp最大会分配由setsockopt(2)所设置的buffer大小的两倍,所以,成功的getsockopt(2)调用可能返回与setsockopt(2)设置的不相同的buffer大小。tcp会使用额外的空间用于管理目的及内部内核结构,并且/proc文件的值反映了比tcp窗口更大的值。对每一个连接,buffer大小必须在listen(2) 或 connect(2)之前设置以使之生效。详细细节,请参见socket(7).


TCP  supports  urgent  data.  Urgent data is used to signal the receiver that some important message is part of the data stream and that it should be processed as soon as possible.  To send urgent data specify the MSG_OOB option to send(2).When urgent data is received, the kernel sends a SIGURG signal to the process or process group that has been set as thesocket"owner"using  the  SIOCSPGRP or FIOSETOWN ioctls (or the POSIX.1-2001-specified fcntl(2) F_SETOWN operation).When the SO_OOBINLINE socket option is enabled, urgent data is put into the normal data stream (a program can test for its location using the SIOCATMARK ioctl described below), otherwise it can be received only when the MSG_OOB flag is set for  recv(2)  or recvmsg(2).
TCP 支持紧急数据。紧急数据用来通知接收方,在数据流中有需要尽快处理的重要信息。发送紧急数据,需在 send(2). 中指定MSG_OOB 选项。当紧急数据接收后,内核发送SIGURG 信号到读进程或者那些用 ioctl 设置了FIOCSPGRP 或 FIOCSETOWN 套接字的进程或进程组(或POSIX.1-2001的fcntl(2)的F_SETOWN操作的进程组)。当打开了 SO_OOBINLINE socket选项, 那么紧急数据被放入普通数据流中。 (可以用SIOCATMARK ioctl 来测试), 否则只有设置了MSG_OOB 标志的recv(2)或recvmsg(2)才能接收到数据。

(2017.06.04 续写)

Linux  2.4  introduced  a  number  of changes for improved throughput and scaling, as well as enhanced functionality.  Some of these features include support for zero-copy sendfile(2),Explicit Congestion Notification, new management of TIME_WAIT sockets, keep-alive socket options and support for Duplicate SACK extensions.
Linux2.4引入了大量的增强功能性的改进,来提高吞吐量和伸缩性。这些特性中,包括sendfile(2)的zero-copy,显式拥塞通知,TIME_WAIT的新管理方式,socket保活选项,及重复SACK扩展,等。


3.1 Address formats

TCP is built on top of IP (see ip(7)).  The address formats defined by ip(7) apply to TCP.  TCP supports point-to-point communication only; broadcasting and multicasting are  not  supported.
TCP 是建立在 IP 之上(参见 ip(7))。ip(7) 定义定义的地址格式也适用于 TCP。TCP只支持点对点通讯,不支持全局及多址广播。


3.2 /proc interfaces

System-wide  TCP parameter  settings  can  be  accessed by files in the directory /proc/sys/net/ipv4/. In addition, most IP /proc interfaces also apply to TCP; see ip(7).  Variables described as Boolean take an integer value, with a nonzero value ("true") meaning that the corresponding option is enabled, and a zero value ("false") meaning that the option  is  disabled.
系统级的tcp参数设置可以通过目录/proc/sys/net/ipv4/中的文件实现。大多数的IP协议的/proc接口也支持TCP,见ip(7)。使用Boolean定义的变量赋值整数值,使用非0表示true代表enable,用0值表示false代表disabled。

tcp_abc (Integer; default: 0; since Linux 2.6.15)
Control the Appropriate Byte Count (ABC), defined in RFC 3465.  ABC is a way of increasing the congestion window (cwnd) more slowly in response to partial acknowledgments.  Possible values are:
0 increase cwnd once per acknowledgment (no ABC)
1 increase cwnd once per acknowledgment of full sized segment
2 allow increase cwnd by two if acknowledgment is of two segments to compensate for delayed acknowledgments.
tcp_abc(整型Integer,默认0,始自Linux 2.6.15)
在RFC3465定义,控制适当字节计数(ABC)。ABC是一种针对于部分确认应答的更慢地增加拥塞窗口(cwnd)的方法。可能的值为:
0: 每一个应答增加拥塞窗口一次(无ABC)
1: 每一个最大传输段应答增加拥塞窗口一次
2:允许增加拥塞控制窗口两次,如果应答是为了补偿延时应答的针对两个段的应答。

tcp_abort_on_overflow (Boolean; default: disabled; since Linux 2.4)
Enable resetting connections if the listening service is too slow and unable to keep up and accept them. It means that if overflow occurred due to a burst, the connection  will recover. Enable this  option only if you are really sure that the listening daemon cannot be tuned to accept connections faster.  Enabling this option can harm the clients of your server.
在监听的服务运行太缓慢且不能保持或再接收的时候,能够重置连接。这意味着如果发生突发性溢出,这个连接将会被恢复。仅当你十分确定监听服务不能够被调优地更快地接收连接时,请Enable这个选项。Enable这个选项的过程会损害你的服务器上的客户端连接。

tcp_adv_win_scale (integer; default: 2; since Linux 2.4)
Count buffering overhead as bytes/2^tcp_adv_win_scale, if tcp_adv_win_scale is greater than 0; or bytes-bytes/2^(-tcp_adv_win_scale), if tcp_adv_win_scale is less than or  equal to zero.
The  socket receive buffer space is shared between the application and kernel.  TCP maintains part of the buffer as the TCP window, this is the size of the receive window advertised to the other end.  The rest of the space is used as the "application" buffer, used to isolate the network from scheduling and application latencies.  The tcp_adv_win_scale default value of 2 implies that the space used for the application buffer is one fourth that of the total.
如果tcp_adv_win_scale被设置为大于0,则计算缓冲区最大是bytes/2^tcp_adv_win_scale;如果设置为小于等于0,则缓冲区最大为bytes-bytes/2^(-tcp_adv_win_scale)。
socket接收缓冲区大小是应用层和内核层共享。tcp保留了buffer的一部分用于tcp窗口,以此作为向对端通告的接收窗口大小。缓冲区的剩余部分用于应用程序,用于隔离网络调度和应用程序延迟。默认值设置为2意味着应用程序的buffer大小为总大小的1/4。

tcp_allowed_congestion_control (String; default: see text; since Linux 2.4.20)
Show/set the  congestion control algorithm choices available to unprivileged processes (see the description of the TCP_CONGESTION socket option).  The list is a subset of those listed in tcp_available_congestion_control.  The default value for this list is "reno" plus the default setting of tcp_congestion_control.
显示/设置非特权进程可用的拥塞控制算法选择(详见TCP_CONGESTION选项)。这些是tcp_available_congestion_control中所有选项的一个子集。该选项默认值是tcp_congestion_control选项设置中增加“reno”。

tcp_available_congestion_control (String; read-only; since Linux 2.4.20)
Show a list of the congestion-control algorithms that are registered.  This list is a limiting set for the list in tcp_allowed_congestion_control.  More congestion-control algorithms may be available as modules, but not loaded.
列出所有本tcp实现可用的拥塞控制算法。这个列表限制tcp_allowed_congestion_control。其他的拥塞控制算法可做为模块,但不可被使用。

tcp_app_win (integer; default: 31; since Linux 2.4)
This variable defines how many bytes of the TCP window are reserved for buffering overhead.
A maximum of (window/2^tcp_app_win, mss) bytes in the window are reserved for the application buffer.  A value of 0 implies that no amount is reserved.
这个变量定义了最多可用于tcp窗口的字节数。在窗口字节数中,保留window/2^tcp_app_win和mss中的较大值给应用程序。该变量设置为0表示不保留任何值。

tcp_base_mss (Integer; default: 512; since Linux 2.6.17)
The  initial  value of search_low to be used by the packetization layer Path MTU discovery (MTU probing). If MTU probing is enabled, this is the initial MSS used by the connection.
初始值(search_low)用于打包层的路径MTU发现(MTU探测)。如果MTU探测是打开的(enabled),该变量用于连接时MSS的初始值。

(2017.06.06续写)
tcp_bic (Boolean; default: disabled; Linux 2.4.27/2.6.6 to 2.6.13)
Enable BIC TCP congestion control algorithm.  BIC-TCP is a sender-side only change that ensures a linear RTT fairness under large windows while  offering both  scalability  and bounded  TCP-friendliness.  The protocol combines two schemes called additive increase and binary search increase.  When the congestion window is large, additive increase with a large increment ensures linear RTT fairness as well as good scalability. Under small congestion windows, binary search increase provides TCP friendliness.
控制BIC的tcp拥塞控制算法的开启和关闭。BIC-TCP是发送端的变化,确保在大窗口线性RTT公平性的同时提供可扩展性和有界的TCP友好性。该协议结合了两种方案,即附加增加和二分搜索增加。在较大拥塞窗口时,一个较大增量的additive increase确保线性RTT公平性与较好扩展性的平衡兼顾。在较小拥塞窗口时,二分搜索增加提供TCP友好性。

tcp_bic_low_window (integer; default: 14; Linux 2.4.27/2.6.6 to 2.6.13)
Set the threshold window (in packets) where BIC TCP starts to adjust the congestion window.  Below this threshold BIC TCP behaves the same as the default TCP Reno.
设置BIC-TCP算法考试调整拥塞窗口时的窗口阈值(以packets为单位)。当低于该值时,BIC算法的行为跟默认Reno算法相同。

tcp_bic_fast_convergence (Boolean; default: enabled; Linux 2.4.27/2.6.6 to 2.6.13)
Force BIC TCP to more quickly respond to changes in congestion window.  Allows two flows sharing the same connection to converge more rapidly.
强制BIC-TCP算法更快响应去更新拥塞窗口。允许两个共享相同连接的流更快地收敛。

tcp_congestion_control (String; default: see text; since Linux 2.4.13)
Set the default congestion-control algorithm to be used for new connections.  The algorithm "reno" is always available, but additional choices may be available depending on kernel configuration.  The default value for this file is set as part of kernel configuration.
给新建立的连接设置默认的拥塞控制算法。“reno”算法总是可用的,但是在内核配置里可能有更多选择。这个文件的默认值被作为内核配置的一部分。

tcp_dma_copybreak (integer; default: 4096; since Linux 2.6.24)
Lower  limit,  in bytes, of  the  size of socket reads that will be offloaded to a DMA copy engine, if one is present in the system and the kernel was configured with the CONFIG_NET_DMA option.
更低限制的socket的读取大小(byte为单位)可能被转至DMA拷贝引擎,如果一个现在在系统中且内核设置了CONFIG_NET_DMA选项。

tcp_dsack (Boolean; default: enabled; since Linux 2.4)
Enable RFC 2883 TCP Duplicate SACK support.
支持RFC2883中的重复SACK特性。

(2017.06.07续写)
tcp_ecn (Boolean; default: disabled; since Linux 2.4)
Enable RFC 2884 Explicit Congestion Notification. When enabled, connectivity to some destinations could be affected due to older, misbehaving routers  along  the  path causing connections to be dropped.
支持RFC2884中的显式拥塞通知。当开启时,到某些目的地址的连接将会收到既有连接的影响,沿着该路径的错误的路由导致连接被丢弃。

tcp_fack (Boolean; default: enabled; since Linux 2.2)
Enable TCP Forward Acknowledgement support.
支持fack功能。

tcp_fin_timeout (integer; default: 60; since Linux 2.2)
This specifies how many seconds to wait for a final FIN packet before the socket is forcibly closed.  This is strictly a violation of the TCP specification, but required to prevent denial-of-service attacks.  In Linux 2.2, the default value was 180.
这个选项表示在socket被强制关闭之前要等待最后一个FIN包多少秒。这严格地违反了TCP规范,但是需要阻止拒绝服务攻击(DOS)。在linux2.2中的默认值是180秒。

tcp_frto (integer; default: 0; since Linux 2.4.21/2.6)
Enable F-RTO, an enhanced recovery algorithm for TCP retransmission timeouts (RTOs).  It is particularly beneficial in wireless environments where packet loss is typically  due to random radio interference rather than intermediate router congestion. See RFC 4138 for more details.

This file can have one of the following values:
0 Disabled.
1 The basic version F-RTO algorithm is enabled.
2 Enable SACK-enhanced F-RTO if flow uses SACK. The basic version can be used also when SACK is in use though in that case scenario(s) exists where F-RTO interacts badly with the packet counting of the SACK-enabled TCP flow.

Before Linux 2.6.22, this parameter was a Boolean value, supporting just values 0 and 1 above.
支持F-RTO,一个增强版的tcp超时重传恢复算法(RTOs)。在无线传输环境中,当丢包受到随机频率的干扰而不是中间路由拥塞时,该算法特别有优势。详见RFC4138。
这个文件可以赋值以下值中的一个:
0:关闭。
1:启用基础版本F-RTO算法。
2:在开启SAKC前提下,启用基于SACK增强的F-RTO算法。基本版本也可以在SACK使用时使用,尽管在这种情况下存在如下情形,即F-RTO会与开启数据包计数的SACK之间发生严重的内部交互。

tcp_frto_response (integer; default: 0; since Linux 2.6.22)
When F-RTO has detected that a TCP retransmission timeout was spurious (i.e, the timeout would have been avoided had TCP set a longer retransmission timeout),  TCP  has several options concerning what to do next.  Possible values are:

0 Rate halving based; a smooth and conservative response, results in halved congestion window (cwnd) and slow-start threshold (ssthresh) after one RTT.
1 Very conservative response; not recommended because even though being valid, it interacts poorly with the rest of Linux TCP; halves cwnd and ssthresh immediately.
2 Aggressive  response; undoes congestion-control measures that are now known to be unnecessary (ignoring the possibility of a lost retransmission that would require TCP to be more cautious); cwnd and ssthresh are restored to the values prior to timeout.
当F-RTO发现一个tcp伪造的超时重传(比如,如果TCP设置了更长的重新传输超时,则可以避免超时),tcp会有一些选项涉及接下来该做什么。可能的取值如下:
0:基于速率减半。一个平滑保守的响应,导致在一个RTT之后将拥塞窗口(cwnd)和慢启动阈值减半。
1:非常的响应。不推荐,因为即使是有效的,它也与Linux TCP的其他部分进行了很差的交互。会立即使cwnd和ssthresh减半。
2:积极的响应。扰乱拥塞控制测量,在现在看来是不必要的(忽略丢失重传的可能性,这将需要TCP更加谨慎)。cwnd和ssthresh会做超时之前被恢复到值。

(2017.06.08续写)
tcp_keepalive_intvl (integer; default: 75; since Linux 2.4)
The number of seconds between TCP keep-alive probes.
tcp进行保活探测的周期,单位是秒。

tcp_keepalive_probes (integer; default: 9; since Linux 2.2)
The maximum number of TCP keep-alive probes to send before giving up and killing the connection if no response is obtained from the other end.
无法接收到对端保活探测的响应消息,如果超过tcp_keepalive_probes次数,则主动关闭或放弃该连接链路。(可以理解为:tcp保活探测的失败次数)

tcp_keepalive_time (integer; default: 7200; since Linux 2.2)
The number of seconds a connection needs to be idle before TCP begins sending out keep-alive probes.  Keep-alives are sent only when the SO_KEEPALIVE socket option  is  enabled.
The default value is 7200 seconds (2 hours).  An idle connection is terminated after approximately an additional 11 minutes (9 probes an interval of 75 seconds apart) when keep-alive is enabled.
Note that underlying connection tracking mechanisms and application timeouts may be much shorter.
在tcp发送出去保活探测包之前,一个连接需要空闲的秒数。Keep-alives被发送仅当SO_KEEPALIVE选项启用。
默认值是7200秒(2小时)。当keep-alive开启时,一个空闲进程在近似增加11分钟之后(默认的是9次探测,每次75秒)被关闭。
备注:潜在的连接轨道机制和应用层超时可能更短。

tcp_low_latency (Boolean; default: disabled; since Linux 2.4.21/2.6)
If enabled, the TCP stack makes decisions that prefer lower latency as opposed to higher throughput.  It this option is disabled, then higher throughput is preferred.  An  example of an application where this default should be changed would be a Beowulf compute cluster.
如果启用,tcp协议栈会更倾向于低延迟而非高吞吐量。如果关闭该选项,高吞吐量将优先被考虑。一个应用层的例子是,如果是贝奥武夫集群,则默认值应该被改变(笔者:即设置为true)。
(笔者:高性能计算集群采用将计算任务分配到集群的不同计算节点儿提高计算能力,因而主要应用在科学计算领域。比较流行的HPC采用Linux操作系统和其它一些免费软件来完成并行运算。这一集群配置通常被称为Beowulf集群。这类集群通常运行特定的程序以发挥HPC cluster的并行能力。这类程序一般应用特定的, 比如专为科学计算设计的MPI库。)

tcp_max_orphans (integer; default: see below; since Linux 2.4)
The  maximum  number  of orphaned (not attached to any user file handle) TCP sockets allowed in the system.  When this number is exceeded, the orphaned connection is reset and a warning is printed.  This limit exists only to prevent simple denial-of-service attacks. Lowering this limit is not  recommended.   Network  conditions might  require you  to increase the  number  of orphans  allowed,  but note that each orphan can eat up to ~64K of unswappable memory. The default initial value is set equal to the kernel parameter NR_FILE. This initial default is adjusted depending on the memory in the system.
系统中孤儿tcp sockets(没有与任何一个用户态文件句柄关联)的最大数量。当这个值被超过,孤儿连接将被重置并打印一个警告。这个限制的存在仅仅为了阻止简单的DOS攻击。不建议降低这个值。网络状况可能需要你提高孤儿的数量,但是请注意每一个孤儿会最多消耗64K的非交换区内存。默认的初始值与内核参数NR_FILE相同。可根据系统物理内存调整默认值。

tcp_max_syn_backlog (integer; default: see below; since Linux 2.2)
The maximum number of queued connection requests which have still not received an acknowledgement from the connecting client.  If this number is exceeded, the kernel will  begin dropping requests.   The default value of 256 is increased to 1024 when the memory present in the system is adequate or greater (>= 128Mb), and reduced to 128 for those systems with  very  low  memory  (<=  32Mb).   It is  recommended  that if  this  needs to  be increased  above  1024,  TCP_SYNQ_HSIZE  in  include/net/tcp.h be  modified  to  keep TCP_SYNQ_HSIZE*16<=tcp_max_syn_backlog, and the kernel be recompiled.
连接请求队列的最大值,该请求是尚未收到任何来自连接客户端发送的ack消息的tcp包。如果这个值被超过,内核将丢弃连接请求。当系统内存>=128Mb时默认值从256增加到1024,当内存很小(<=32Mb)时默认值要减小到128。建议是,如果设置超过1024,头文件include/net/tcp.h中的TCP_SYNQ_HSIZE需要修改到满足TCP_SYNQ_HSIZE*16<=tcp_max_syn_backlog的条件,并且请重新编译内核。
(笔者注:TCP_SYNQ_HSIZE在linux2.62之后移除了。)

tcp_max_tw_buckets (integer; default: see below; since Linux 2.4)
The  maximum  number of sockets in TIME_WAIT state allowed in the system. This limit exists only to prevent simple denial-of-service attacks.  The default value of NR_FILE*2 is adjusted depending on the memory in the system.  If this number is exceeded, the socket is closed and a warning is printed.
系统中处于TIME_WAIT状态的socket的最大数目。这个限制的存在的目的仅是为了阻止简单的DOS攻击。默认值是NR_FILE*2,依据系统物理内存而调整。如果处于TIME_WAIT状态的socket的数量超过了该配置值,那么多出的socket将被关闭,并将打印警告。

tcp_moderate_rcvbuf (Boolean; default: enabled; since Linux 2.4.17/2.6.7)
If enabled, TCP performs receive buffer auto-tuning, attempting to automatically size the buffer (no greater than tcp_rmem[2]) to match the size required by the path  for  full throughput.
如果启用,tcp会自动调优接收buffer,尝试自动修改buffer大小(但不会大于tcp_rmem[2])以匹配路径上满吞吐量所需要的大小。

tcp_mem (since Linux 2.4)
This  is a  vector  of 3 integers: [low, pressure, high].  These bounds, measured in units of the system page size, are used by TCP to track its memory usage.  The defaults are calculated at boot time from the amount of available memory.  (TCP can only use low memory for this, which is limited to around 900 megabytes on 32-bit systems. 64-bit systems do not suffer this limitation.)

low TCP doesn't regulate its memory allocation when the number of pages it has allocated globally is below this number.

pressure When the amount of memory allocated by TCP exceeds this number of pages, TCP moderates its memory consumption. This memory pressure state is exited once the number of pages allocated falls below the low mark.

high The maximum number of pages, globally, that TCP will allocate. This value overrides any other limits imposed by the kernel.
可以取值3个整数值:[low,pressure(压迫),high]。这些以系统分页大小为单位进行度量的约束,被tcp用来它的内存使用量。默认值是在系统启动时以可用内存值计算的。(tcp对此只能使用最低内存,在32位系统上限制在900MB,在64位系统上不做限制。)
(1)low。当在全局已经分配的分页数量低于此数时,tcp不调节它的内存分配。
(2)pressure。当tcp分配的内存数超过分页数时,tcp会减弱它的内存消耗。一旦分配的分页数低于low标记的值,这种内存压力状态就退出。
(3)high。tcp在全局范围内能分配的分页的最大值。这个值会覆盖系统附加的任何其他限制。(笔者注:即优先级最高)

tcp_mtu_probing (integer; default: 0; since Linux 2.6.17)
This parameter controls TCP Packetization-Layer Path MTU Discovery.  The following values may be assigned to the file:
0 Disabled
1 Disabled by default, enabled when an ICMP black hole detected
2 Always enabled, use initial MSS of tcp_base_mss.
这个参数控制tcp打包层路径MTU发现。下面的值可能赋予这个文件:
0:关闭。
1:默认是关闭,但当检测到ICMP黑洞时会开启。
2:始终开启。使用tcp_base_mss作为默认MSS。

tcp_no_metrics_save (Boolean; default: disabled; since Linux 2.6.6)
By default, TCP saves various connection metrics in the route cache when the connection closes, so that connections established in the near future can use these to  set initial conditions.   Usually,  this increases overall performance, but it may sometimes cause performance degradation.  If tcp_no_metrics_save is enabled, TCP will not cache metrics on closing connections.
默认的,当连接关闭时,TCP在路由缓存中保存各种连接指标,目的是,在不久的将来连接建立时能够使用它们去设置初始条件。通常地,这个变量会提高全部性能,但是有时又会导致性能退化。如果tcp_no_metrics_save设置为启用,tcp将不再在关闭连接时缓存指标。

tcp_orphan_retries (integer; default: 8; since Linux 2.4)
The maximum number of attempts made to probe the other end of a connection which has been closed by our end.
探测一个被对方关闭的连接的对端的最大值尝试次数。

tcp_reordering (integer; default: 3; since Linux 2.4)
The maximum a packet can be reordered in a TCP packet stream without TCP assuming packet loss and going into slow start. It is not advisable to change this number.  This  is  a packet reordering detection metric designed to minimize unnecessary back off and retransmits provoked by reordering of packets on a connection.
在一个没有tcp假包丢失和变为慢启动的tcp包流中,一个包能够被重排序的最大次数。修改这个值并不明智。这是一个包重新排序检测指标,目的是为了减少由于连接上的数据包重新排序而引起的不必要的后退和重新发送。

tcp_retrans_collapse (Boolean; default: enabled; since Linux 2.2)
Try to send full-sized packets during retransmit.
尝试着重传时发送满负载包。

tcp_retries1 (integer; default: 3; since Linux 2.2)
The  number  of  times  TCP  will attempt to retransmit a packet on an established connection normally, without the extra effort of getting the network layers involved. Once we exceed this number of retransmits, we first have the network layer update the route if possible before each new retransmit.  The default is the RFC specified minimum of 3.
在一个正常建立的连接上tcp尝试重传的次数,次数中不包括额外的使网络层卷入进来的努力。一旦我们超过了这个重传次数,在每次重传前,如果可能的话我们首先会先更新网络层路由。默认值是RFC手册中规定的最小值3。

tcp_retries2 (integer; default: 15; since Linux 2.2)
The maximum number of times a TCP packet is retransmitted in established state before giving up. The default value is 15, which corresponds  to a  duration  of  approximately between 13 to 30 minutes, depending on the retransmission timeout.  The RFC 1122 specified minimum limit of 100 seconds is typically deemed too short.
tcp已建立的连接在退出之前可能尝试的最大重传次数。默认值是15,这对应于大约13到30分钟的持续时间,这取决于重新传输超时。RFC 1122指定的最小限制为100秒通常被认为太短。

tcp_rfc1337 (Boolean; default: disabled; since Linux 2.2)
Enable  TCP  behavior  conformant with  RFC 1337.   When disabled,  if a RST is received in TIME_WAIT state, we close the socket immediately without waiting for the end of the TIME_WAIT period.
启用RFC1337中描述的tcp的行为。当关闭时,如果在TIME_WAIT状态时接收到RST包,我们将立刻关闭这个socket而不等到TIME_WAIT的时间走完。

tcp_rmem (since Linux 2.4)
This is a vector of 3 integers: [min, default, max].  These parameters are used by TCP to regulate receive buffer sizes. TCP dynamically adjusts the size of the receive buffer from the defaults listed below, in the range of these values, depending on memory available in the system.

min minimum size  of the receive buffer used by each TCP socket.  The default value is the system page size.  (On Linux 2.4, the default value is 4K, lowered to PAGE_SIZE bytes in low-memory systems.)  This value is used to ensure that in memory pressure mode, allocations below this size will still succeed.  This is not used  to  bound the size of the receive buffer declared using SO_RCVBUF on a socket.

default the  default  size of the receive buffer for a TCP socket.  This value overwrites the initial default buffer size from the generic global net.core.rmem_default defined for all protocols.  The default value is 87380 bytes.  (On Linux 2.4, this will be lowered to 43689 in low-memory  systems.) If  larger  receive  buffer  sizes  are desired, this value should be increased (to affect all sockets).  To employ large TCP windows, the net.ipv4.tcp_window_scaling must be enabled (default).

max the  maximum size of the receive buffer used by each TCP socket.  This value does not override the global net.core.rmem_max.  This is not used to limit the size of the receive buffer declared using SO_RCVBUF on a socket.  The default value is calculated using the formula
        max(87380, min(4MB, tcp_mem[1]*PAGE_SIZE/128))
    (On Linux 2.4, the default is 87380*2 bytes, lowered to 87380 in low-memory systems).
可以取值3个整数值:[min, default, max]。这些tcp使用的参数被用来计算接收缓冲大小。tcp从下面列出的默认值中动态调整接收缓冲大小,这些值的范围依据系统可用物理内存。
(1)min。给每一个tcp socket使用的接收缓冲大小的最小值。默认值是系统分页大小。(在linux2.4,默认值是4K,在低内存系统中会低于PAGE_SIZE字节数。)这个值用来确保,内存紧张时,低于该大小的分配仍然能够成功。这个值不能用于约束使用SO_RCVBUF的socket的接收缓冲的大小。
(2)default。tcp socket的默认值。这个值复写了来自net.core.rmem_default定义的用于全局所有协议的初始默认缓冲大小。该变量默认87380字节。(在linux2.4,在低内存系统中该值可能低于43689。)如果想要一个更大的接收缓冲区,这个值应该被增大(但这会影响所有的socket)。要想采用更大的tcp窗口,net.ipv4.tcp_window_scaling必须开启(默认是开启)。
(3)max。tcp socket的最大值。这个值不会推翻全局的net.core.rmem_max。在使用SO_RCVBUF定义的socket上,这个变量将不起作用。默认值使用下面公式计算得出:
max(87380, min(4MB, tcp_mem[1]*PAGE_SIZE/128))
(在linux2.4,默认值是87380*2字节,在低内存系统中会降低到87380。)

tcp_sack (Boolean; default: enabled; since Linux 2.2)
Enable RFC 2018 TCP Selective Acknowledgements.
开启RFC2018描述的选择ack。

(2017.06.09续写)
tcp_slow_start_after_idle (Boolean; default: enabled; since Linux 2.6.18)
If enabled, provide RFC 2861 behavior and time out the congestion window after an idle period.  An idle period is defined as the current RTO (retransmission timeout).   If  disabled, the congestion window will not be timed out after an idle period.
如果开启,将提供RFC2861的行为,并在空闲期后超时拥塞窗口。空闲期定义为与当前RTO时长相同(即超时重传的超时时间)。如果关闭,拥塞窗口在空闲期过后将不会超时。

tcp_stdurg (Boolean; default: disabled; since Linux 2.2)
If this option is enabled, then use the RFC 1122 interpretation of the TCP urgent-pointer field. According to this interpretation, the urgent pointer points to the last byte of urgent data.  If this option is disabled, then use the BSD-compatible interpretation of the urgent pointer: the urgent pointer points to the first byte after  the  urgent  data.
Enabling this option may lead to interoperability problems.
如果开启,将使用RFC1122所描述的tcp紧急指针域。根据这个解释,紧急指针指向紧急数据的最后一个字节。如果关闭,会使用BSD-compatible描述的紧急指针:紧急指针指向紧急数据之后的第一个字节。
开启该选项可能导致互操作性的问题。

tcp_syn_retries (integer; default: 5; since Linux 2.2)
The  maximum  number  of times initial SYNs for an active TCP connection attempt will be retransmitted.  This value should not be higher than 255.  The default value is 5, which corresponds to approximately 180 seconds.
对一个活动的tcp连接的初始syn包所尝试重传的最大次数。这个值不应当大于255.默认值是5,折算成时间大约180s。
(笔者:重传的次数,不包含第一个。)

tcp_synack_retries (integer; default: 5; since Linux 2.2)
The maximum number of times a SYN/ACK segment for a passive TCP connection will be retransmitted. This number should not be higher than 255.
对一个被动建立tcp连接的syn/ack包所尝试重传的最大次数。该值不应当大于255。

tcp_syncookies (Boolean; since Linux 2.2)
Enable TCP syncookies.  The kernel must be compiled with CONFIG_SYN_COOKIES.  Send out syncookies when the syn backlog queue of  a  socket  overflows.   The  syncookies feature attempts to protect a socket from a SYN flood attack.  This should be used as a last resort, if at all. This is a violation of the TCP protocol, and conflicts with other areas of TCP such as TCP extensions.  It can cause problems for clients and relays.  It is not recommended as a tuning mechanism for heavily loaded servers to help with overloaded  or misconfigured conditions. For recommended alternatives see tcp_max_syn_backlog, tcp_synack_retries, and tcp_abort_on_overflow.
开启syncookies。内核必须编译为CONFIG_SYN_COOKIES(即必须启用)。当syn backlog队列满时要发出syncookies。syncookies特性尝试保护一个socket以避免syn泛红攻击。如果有的话,这应该是最后的选择。这是违背了tcp协议,也和其他tcp区域有冲突,比如tcp扩展。这会导致客户端或中继者发生问题。对于负载过重的服务器,不建议将其作为一种调优机制来帮助解决负载过重或配置错误的情况。对于推荐的替代方案,请参见tcpmaxsynbacklog、tcpsynack重试和tcpabortonoverflow。

tcp_timestamps (Boolean; default: enabled; since Linux 2.2)
Enable RFC 1323 TCP timestamps.
启用RFC1323的tcp时间戳。

tcp_tso_win_divisor (integer; default: 3; since Linux 2.6.9)
This  parameter  controls what percentage of the congestion window can be consumed by a single TCP Segmentation Offload (TSO) frame.  The setting of this parameter is a tradeoff between burstiness and building larger TSO frames.
这个参数控制一个tcp分段清除(TSO)能占用拥塞窗口的多大比例。这个参数的设置是在突发性和构建较大TSO帧之间的折中。
(笔者注:TSO是一种利用网卡的少量处理能力,降低CPU发送数据包负载的技术,需要网卡硬件及驱动的支持。)

tcp_tw_recycle (Boolean; default: disabled; since Linux 2.4)
Enable fast recycling of TIME_WAIT sockets.  Enabling this option is not recommended since this causes problems when working with NAT (Network Address Translation).
开启TIME_WAIT状态的socket的快速回收。并不建议开启该选项,原因是工作在NAT环境中将导致问题。

tcp_tw_reuse (Boolean; default: disabled; since Linux 2.4.19/2.6)
Allow to reuse TIME_WAIT sockets for new connections when it is safe from protocol viewpoint.  It should not be changed without advice/request of technical experts.
当从协议视角看是安全的时候,允许重用TIME_WAIT状态的socket给新的连接。如果没有技术专家的建议或请求的话请不要修改这个值。

tcp_vegas_cong_avoid (Boolean; default: disabled; Linux 2.2 to 2.6.13)
Enable TCP Vegas congestion avoidance algorithm. TCP Vegas is a sender-side only change to TCP that anticipates the onset of congestion by estimating the bandwidth.  TCP  Vegas adjusts the sending rate by modifying the congestion window.  TCP Vegas should provide less packet loss, but it is not as aggressive as TCP Reno.
启用tcp的Vegas拥塞避免算法。Vegas是仅对发送端进行更改的算法,它通过估计带宽来预测拥塞的开始。Vegas算法通过修改拥塞窗口来调节发送比率。Vegas应该提供少量的丢包,但它不想Reno算法那么激进。

tcp_westwood (Boolean; default: disabled; Linux 2.4.26/2.6.3 to 2.6.13)
Enable TCP Westwood+ congestion control algorithm.  TCP Westwood+ is a sender-side only modification of the TCP Reno protocol stack that optimizes the performance of TCP congestion control.  It is based on end-to-end bandwidth estimation to set congestion window and slow start threshold after a congestion episode.  Using this estimation, TCP Westwood+ adaptively  sets a  slow start threshold and a congestion window which takes into account the bandwidth used at the time congestion is experienced.  TCP Westwood+ significantly increases fairness with respect to TCP Reno in wired networks and throughput over wireless links.
开启tcp的Westwood+拥塞控制算法。Westwood+算法是一个发送方的修改版的Reno协议栈,能够优化的拥塞控制算法的性能。在拥塞发生后,它基于端到端的带宽估算来设置拥塞窗口和慢启动阈值。使用这个估计,考虑到在拥塞期间所使用的带宽,Westwood+ 算法适应性地设置一个慢启动阈值和一个拥塞窗口。TCP Westwood+显著地提高了有线网络中对TCP的公平性和无线链路的吞吐量。

tcp_window_scaling (Boolean; default: enabled; since Linux 2.2)
Enable RFC 1323 TCP window scaling.  This feature allows the use of a large window (> 64K) on a TCP connection, should the other end support it. Normally,  the 16  bit window length  field in the TCP header limits the window size to less than 64K bytes.  If larger windows are desired, applications can increase the size of their socket buffers and the window scaling option will be employed.  If tcp_window_scaling is disabled, TCP will not negotiate the use of window scaling with the other end during connection setup.
支持RFC1323 tcp窗口缩放。这个特性允许在一个tcp连接数使用较大窗口(>64K),但需要对端也支持。通常,tcp头部的16bit的窗口长度域限制了窗口大小必须小于64K。如果想要更大的窗口,应用程序可以增加它们的socket buffer的大小,并且可以使用窗口扩展选项。如果tcp_window_scaling禁用,在连接建立时tcp将不能与对端协商使用窗口缩放。

tcp_wmem (since Linux 2.4)
This is a vector of 3 integers: [min, default, max].  These parameters are used by TCP to regulate send buffer sizes.  TCP dynamically adjusts the size of the send  buffer  from the default values listed below, in the range of these values, depending on memory available.

min Minimum size  of  the send buffer used by each TCP socket.  The default value is the system page size.  (On Linux 2.4, the default value is 4K bytes.)  This value is used to ensure that in memory pressure mode, allocations below this size will still succeed.  This is not used to bound the size of  the  send buffer declared  using SO_SNDBUF on a socket.

default The  default  size  of the send buffer for a TCP socket.  This value overwrites the initial default buffer size from the generic global /proc/sys/net/core/wmem_default defined for all protocols.  The default value is 16K bytes.  If larger send buffer sizes are desired, this value should be  increased  (to  affect  all sockets).   To employ large TCP windows, the /proc/sys/net/ipv4/tcp_window_scaling must be set to a nonzero value (default).

max The  maximum  size  of the send buffer used by each TCP socket.  This value does not override the value in /proc/sys/net/core/wmem_max.  This is not used to limit the size of the send buffer declared using SO_SNDBUF on a socket.  The default value is calculated using the formula
        max(65536, min(4MB, tcp_mem[1]*PAGE_SIZE/128))
(On Linux 2.4, the default value is 128K bytes, lowered 64K depending on low-memory systems.)
可以取值3个整数值:[min, default, max]。tcp使用这些参数来规定发送缓冲区大小。tcp从以下列表的值中动态调整发送缓冲区大小,这些值的范围取决于系统可用内存。
(1)min。每个tcp socket的发送缓冲区可使用的最小大小。默认值是系统分页大小。(在linux2.4,默认值是固定的4K。)这个值用于确保在内存有压力情况下,低于该值的分配仍然能够成功。该值无法限制那些使用了SO_SNDBUF来设置发送缓冲区大小的socket。
(2)default。一个tcp socket发送缓冲区的默认值。这个值覆写了/proc/sys/net/core/wmem_default给所有协议设置的全局的初始默认缓冲区大小。默认值是16K字节。如果想要申请更大的缓冲区,这个值需要提高,但那样做会影响全部的socket。为了采用更大的tcp窗口,/proc/sys/net/ipv4/tcp_window_scaling必须被设置为非零值(这也是默认的样子)。
(3)max。一个tcp socket发送缓冲区的最大值。这个值不能超过/proc/sys/net/core/wmem_max。如果使用SO_SNDBUF设置了socket的发送缓冲区大小,这个值将失效。默认值使用下面公式计算:
            max(65536, min(4MB, tcp_mem[1]*PAGE_SIZE/128))
(在linux2.4,默认值是128K,在低内存系统中降低到64K。)

tcp_workaround_signed_windows (Boolean; default: disabled; since Linux 2.6.26)
If enabled, assume that no receipt of a window-scaling option means that the remote TCP is broken and treats the window as a signed  quantity.   If  disabled,  assume  that  the remote TCP is not broken even if we do not receive a window scaling option from it.
如果启用,没有收到窗口缩放选项接收意味着远程TCP被破坏,并将窗口视为一个有符号的量。如果禁用了,即使我们没有收到来自它的窗口扩展选项,也认为远端TCP没有断开。

(2017.06.10续写)

3.3 Socket options

To set or get a TCP socket option, call getsockopt(2) to read or setsockopt(2) to write the option with the option level argument set to IPPROTO_TCP.  Unless otherwise noted, optval is a pointer to an int.  In addition, most IPPROTO_IP socket options are valid on TCP sockets.  For more information see ip(7).
调用getsockopt(2)来获取socket 选项,调用setsockopt(2)来设置,设置时使用选项级别的参数设置IPPROTO_TCP。除非特别指出,一般地optval是一个int型指针。另外大部分的IPPROTO_IP socket选项也适用于tcp socket。更多信息请见ip(7)。

TCP_CONGESTION (since Linux 2.6.13)
  Get or set the congestion-control algorithm for this socket.  The optval argument is a pointer to a character-string buffer.

  For getsockopt() *optlen specifies the amount of space available in the buffer pointed to by optval, which should be at least 16 bytes (defined by the  kernel-internal  constant TCP_CA_NAME_MAX).  On  return,  the  buffer pointed to by optval is set to a null-terminated string containing the name of the congestion-control algorithm for this socket, and *optlen is set to the minimum of its original value and TCP_CA_NAME_MAX. If the value passed in *optlen is too small, then the string returned in *optval is silently truncated, and no terminating null byte is added.  If an empty string is returned, then the socket is using the default congestion-control algorithm, determined as described under tcp_congestion_control above.

  For setsockopt() optlen specifies the length of the congestion-control algorithm name contained in the buffer pointed to by optval; this length need not include any  terminating null byte.  The algorithm "reno" is always permitted; other algorithms may be available, depending on kernel configuration.  Possible errors from setsockopt() include: algorithm not found/available (ENOENT); setting this algorithm requires the CAP_NET_ADMIN capability (EPERM); and failure getting kernel module (EBUSY).
  get或set一个socket的拥塞控制算法。optval执行一个字符串缓冲区。
  对getsockopt(),参数*optlen指出optval指向的缓冲区的有效字符数,但最少也是16字节(由内核内部定义的常量TCP_CA_NAME_MAX规定)。返回时,optval缓冲区被设置为以空结尾的字符串,包含有该socket的拥塞控制算法名称,*optlen被设置为字符串长度或TCP_CA_NAME_MAX中的较小值。如果传入的*optlen太小,那么*optval的返回值将被截断,且不添加空结束符。如果返回空字符串,那么这个socket正在使用tcp_congestion_control中声明的默认的拥塞控制算法。
  对setsockopt(),optlen指出optval中拥塞控制算法名字的长度,该长度不包含空结尾符。reno算法总是允许,其他算法可能有效,取决于内核配置。setsockopt()可能的错误包括:ENOENT(算法未找到或无效),EPERM(使用该算法需要具备CAP_NET_ADMIN能力),EBUSY(获取内核模块出错)。

TCP_CORK (since Linux 2.2)
If set, don't send out partial frames.  All queued partial frames are sent when the option is cleared again.  This is useful for prepending headers before  calling  sendfile(2),  or  for  throughput  optimization.  As currently implemented, there is a 200 millisecond ceiling on the time for which output is corked by TCP_CORK.  If this ceiling is reached,  then queued data is automatically transmitted.  This option can be combined with TCP_NODELAY only since Linux 2.5.71.  This option should not be used in code intended to be portable.
如果设置,将不能够发送出部分帧。当该选项被重新清楚后,所有队列中的半帧将被发送。这对于调用sendfile(2)前优先考虑头部或做吞吐量优化将会很有用。基于当前实现,输出被TCP_CORK塞住的时间会有一个200毫秒的上限。如果上限达到,队列中的数据将自动发送。自linux2.5.71之后,这个选项能够与TCP_NODELAY结合起来使用。该选项不具备可移植性。

TCP_DEFER_ACCEPT (since Linux 2.4)
Allow  a listener to be awakened only when data arrives on the socket.  Takes an integer value (seconds), this can bound the maximum number of attempts TCP will make to complete the connection.  This option should not be used in code intended to be portable.
仅当数据到达socket时,允许一个监听者被唤醒。该值是以秒为单位的整数值,将限制tcp能够完成连接的尝试的最大次数。该选项不应用于可移植的代码中。

TCP_INFO (since Linux 2.4)
Used to collect information about this socket.  The kernel returns a struct tcp_info as defined in the file /usr/include/linux/tcp.h.  This option should not  be used  in  code intended to be portable.
用于收集关于这个socket的信息。由内核返回一个结构体tcp_info,该结构体在/usr/include/linux/tcp.h中定义。该选项不应用于可移植的代码中。

TCP_KEEPCNT (since Linux 2.4)
The maximum number of keepalive probes TCP should send before dropping the connection.  This option should not be used in code intended to be portable.
在关闭连接前,tcp最大尝试发送keepalive探测的数目。该选项不应用于可移植的代码中。

TCP_KEEPIDLE (since Linux 2.4)
The time (in seconds) the connection needs to remain idle before TCP starts sending keepalive probes, if the socket option SO_KEEPALIVE has been set on this socket.  This option should not be used in code intended to be portable.
如果SO_KEEPALIVE选项被设置,在tcp开始发送keepalive探测前,连接需要保持空闲的时长(以秒为单位)。该选项不应用于可移植的代码中。

TCP_KEEPINTVL (since Linux 2.4)
The time (in seconds) between individual keepalive probes.  This option should not be used in code intended to be portable.
keepalive探测发送的周期(以秒为单位)。该选项不应用于可移植的代码中。

TCP_LINGER2 (since Linux 2.4)
The lifetime of orphaned FIN_WAIT2 state sockets. This option can be used to override the system-wide setting in the file /proc/sys/net/ipv4/tcp_fin_timeout  for  this socket. This is not to be confused with the socket(7) level option SO_LINGER.  This option should not be used in code intended to be portable.
处于FIN_WAIT2状态的孤儿socket的生命期。该选项能够覆写中文件/proc/sys/net/ipv4/tcp_fin_timeout中系统级的设置。这与socket(7)级别的选项SO_LINGER并不混淆。该选项不应用于可移植的代码中。

TCP_MAXSEG
The  maximum  segment  size  for outgoing  TCP packets. In Linux 2.2 and earlier, and in Linux 2.6.28 and later, if this option is set before connection establishment, it also changes the MSS value announced to the other end in the initial packet.  Values greater than the (eventual) interface MTU have no effect. TCP will also impose its  minimum  and maximum bounds over the value provided.
出站的tcp包的最大段大小。在linux2.2或更早版本中,或linux2.6.28或更晚版本中,如果在连接建立前设置了这个选项,它仍然能够修改在初始包中通知对端的MSS值。比最终网卡MTU大的值不生效。TCP还将对所提供的值进行最小和最大限度的限制。

TCP_NODELAY
If set, disable the Nagle algorithm.  This means that segments are always sent as soon as possible, even if there is only a small amount of data. When not set, data is buffered until there is a sufficient amount to send out, thereby avoiding the frequent sending of small packets, which results in poor utilization of the network. This option  is  overridden by TCP_CORK; however, setting this option forces an explicit flush of pending output, even if TCP_CORK is currently set.
如果设置,将关闭Nagle算法。这意味着每个段将以最快速度发送,即使数据量很小。如不设置,数据将被缓存起来直到足够的量才发送,因此可以避免会导致网络较差利用的频繁小包发送。这个选项会被TCP_CORK覆写,然而,这个选项会强制出站发送,即使TCP_CORK当前被设置。

(2017.06.11续写)
TCP_QUICKACK (since Linux 2.4.4)
Enable  quickack mode if set or disable quickack mode if cleared.  In quickack mode, acks are sent immediately, rather than delayed if needed in accordance to normal TCP operation.  This flag is not permanent, it only enables a switch to or from quickack mode.  Subsequent operation of the TCP protocol will once again enter/leave quickack mode depending on internal protocol processing and factors such as delayed ack timeouts occurring and data transfer. This option should not be used in code intended to be portable.
如果设置则是启用quickack模式,如果清除则是禁用。在quickack模式,ack包被即刻发送,而不像是按照正常的需要延迟。这个标志并非永久生效,它仅仅是跟一个开关一样切换到quickack模式或从quickack模式切换走。TCP协议的后续操作将再次进入/离开quickack模式,这取决于内部协议处理以及延迟的ack超时和数据传输等因素。该选项不应用于可移植的代码中。

TCP_SYNCNT (since Linux 2.4)
Set the number of SYN retransmits that TCP should send before aborting the attempt to connect.  It cannot exceed 255.  This option should not be used in code intended to be portable.
设置tcp在尝试关闭连接前重传SYN包的次数。该值不能超过255。该选项不应用于可移植的代码中。

TCP_WINDOW_CLAMP (since Linux 2.4)
Bound the size of the advertised window to this value.  The kernel imposes a minimum size of SOCK_MIN_RCVBUF/2.  This option should not be used in code intended to be portable.
将广告窗口的大小限制到这个值。内核会强制最小值为SOCK_MIN_RCVBUF/2。该选项不应用于可移植的代码中。

3.4 Sockets API

TCP provides limited support for out-of-band data, in the form of (a single byte of) urgent data.  In Linux this means if the other end sends newer out-of-band data  the  older urgent data is inserted as normal data into the stream (even when SO_OOBINLINE is not set).  This differs from BSD-based stacks.
TCP提供对带外数据(OOB)的有限支持,以(单个字节的)紧急数据的形式。在linux中,这意味着如果对端发送新的带外数据,旧的紧急数据将被以普通数据的身份插入到传输流中(即使SO_OOBINLINE没有被设置)。这点与BSD协议栈不同。

Linux uses the BSD compatible interpretation of the urgent pointer field by default.  This violates RFC 1122, but is required for interoperability with other stacks.  It can be changed via /proc/sys/net/ipv4/tcp_stdurg.
linux默认使用BSD兼容的紧急指针域的解释。这违反了RFC1122,但却是其他协议栈互操作的需要。通过修改/proc/sys/net/ipv4/tcp_stdurg可改变配置。

It is possible to peek at out-of-band data using the recv(2) MSG_PEEK flag.
使用recv(2)的MSG_PEEK标志位可一窥带外数据(OOB)。

Since version 2.4, Linux supports the use of MSG_TRUNC in the flags argument of recv(2) (and recvmsg(2)).  This flag causes the received bytes of data  to  be  discarded,  rather  than passed back in a caller-supplied buffer. Since Linux 2.4.4, MSG_PEEK also has this effect when used in conjunction with MSG_OOB to receive out-of-band data.
自2.4版本,linux支持在recv(2) (和 recvmsg(2))的flag参数中MSG_TRUNC的使用。这个标志位导致接收到的数据(字节)被丢弃,而不是放回到调用者提供的缓冲区。自linux2.4.4,MSG_PEEK也有这个影响,当与MSG_OOB联合使用用于接收带外数据时。

3.5 ioctls

The following ioctl(2) calls return information in value.  The correct syntax is:
     int value;
     error = ioctl(tcp_socket, ioctl_type, &value);

ioctl_type is one of the following:

SIOCINQ
Returns  the  amount  of queued unread data  in  the  receive buffer. The socket must not be in LISTEN state, otherwise an error (EINVAL) is returned.  SIOCINQ is defined in <linux/sockios.h>.  Alternatively, you can use the synonymous FIONREAD, defined in <sys/ioctl.h>.

SIOCATMARK
Returns true (i.e., value is nonzero) if the inbound data stream is at the urgent mark.

If the SO_OOBINLINE socket option is set, and SIOCATMARK returns true, then the next read from the socket will return the urgent data.  If the SO_OOBINLINE socket option is  not set, and SIOCATMARK returns true, then the next read from the socket will return the bytes following the urgent data (to actually read the urgent data requires the recv(MSG_OOB) flag).

Note that a read never reads across the urgent mark.  If an application is informed of the presence of urgent data via select(2) (using the exceptfds argument) or through delivery of a SIGURG signal, then it can advance up to the mark using a loop which repeatedly tests SIOCATMARK and performs a read (requesting any number of bytes) as long as SIOCATMARK returns false.

SIOCOUTQ
Returns the amount of unsent data in the socket send queue.  The socket must not be in  LISTEN  state,  otherwise an  error  (EINVAL)  is  returned.   SIOCOUTQ is  defined  in <linux/sockios.h>.  Alternatively, you can use the synonymous TIOCOUTQ, defined in <sys/ioctl.h>.

下面的ioctl(2)调用在参数value中携带返回的信息。正确的语法是:
     int value;
     error = ioctl(tcp_socket, ioctl_type, &value);
ioctl_type可取如下值:
(1)SIOCINQ。
    返回在接收缓冲中未读取的数据的数量。该socket不能处于LISTEN状态,否则将会返回错误(EINVAL被设置)。SIOCINQ在<linux/sockios.h>中定义。备选的,你可以使用定义在<sys/ioctl.h>中的FIONREAD。
(2)SIOCATMARK。value非0时,如果入站数据流处在urgent mark时将返回true。
如果SO_OOBINLINE选项被设置,且SIOCATMARK返回true,那么下一次从socket读取到的数据将是紧急数据。如果SO_OOBINLINE未设置,且SIOCATMARK返回true。那么下一次从sock读取到紧急数据后面的数据(实际上,读取紧急数据需要使用recv(MSG_OOB)标志)。
注意,读取不会越过urgent mark。如果一个应用程序,通过select(2)(使用exceptfds参数)或SIGURG信号传递,被通知存在紧急数据,那么它就能使用一个循环推进到mark,该循环重复监测SIOCATMARK,只要SIOCATMARK返回false就能执行一次读取操作(请求任何数量的字节)。
(3)SIOCOUTQ。返回发送队列中未发送数据的数量。该socket不能处于LISTEN状态,否则将会返回错误(EINVAL被设置)。SIOCOUTQ在<linux/sockios.h>中定义。备选的,你可以使用定义在<sys/ioctl.h>中的TIOCOUTQ。
笔者:笔者目前理论有限,上面这一小节翻译的不知道翻译的对与错。

3.6 Error handling

When a network error occurs, TCP tries to resend the packet.  If it doesn't succeed after some time, either ETIMEDOUT or the last received error on this connection is reported.

Some  applications  require a quicker error notification.  This can be enabled with the IPPROTO_IP level IP_RECVERR socket option.  When this option is enabled, all incoming errors are immediately passed to the user program.Use this option with care — it makes TCP less tolerant to routing changes and other normal network conditions.

当一个网络错误发生后,tcp将尝试重新发送数据包。如果在一段时间之后没有成功,要么产生ETIMEDOUT,要么在这个连接上最后一个接收到的错误被上报。

一些应用程序需要快速的错误通知。这在IPPROTO_IP级别的IP_RECVERR选项可以设置启用。当这个选项启用,所有来到的错误将立即被传送给用户程序。请谨慎使用这个选项:在路由改变和其他普通网络情况发生时,它将使tcp减少容忍度。

四、ERRORS

EAFNOTSUPPORT
   Passed socket address type in sin_family was not AF_INET.
   传入sin_family的地址类型不是AF_INET。
   
EPIPE  The other end closed the socket unexpectedly or a read is executed on a shut down socket.
   对端意外关闭socket,或read操作中一个已经关闭了的sock上执行。
   
ETIMEDOUT
   The other end didn't acknowledge retransmitted data after some time.
   对端在一段时间后仍然没有ack重传数据。

Any errors defined for ip(7) or the generic socket layer may also be returned for TCP.
给ip(7)或一般sock层定义的任何错误,都可能返回给tcp。

五、VERSIONS

Support for Explicit Congestion Notification, zero-copy sendfile(2), reordering support and some SACK extensions (DSACK) were introduced in 2.4.Support  for  forwardacknowledgement (FACK), TIME_WAIT recycling, and per-connection keepalive socket options were introduced in 2.3.


对于显示拥塞通知、sendfile(2)零拷贝、重排序、和一些SACK扩展(如DSACK)的支持,在2.4中介绍。
对于FACK、TIME_WAIT回收、连接前保活选项的支持,在2.3中介绍。

六、BUGS

Not all errors are documented.
IPv6 is not described.

七、SEE ALSO

accept(2), bind(2), connect(2), getsockopt(2), listen(2), recvmsg(2), sendfile(2), sendmsg(2), socket(2), ip(7), socket(7)


RFC 793 for the TCP specification.
RFC 1122 for the TCP requirements and a description of the Nagle algorithm.
RFC 1323 for TCP timestamp and window scaling options.
RFC 1337 for a description of TIME_WAIT assassination hazards.
RFC 3168 for a description of Explicit Congestion Notification.
RFC 2581 for TCP congestion control algorithms.
RFC 2018 and RFC 2883 for SACK and extensions to SACK.

八、版权COLOPHON

This   page   is  partof   release3.53   of   the Linuxman-pages  project.   A  description  of  theproject,  and  information  about  reporting  bugs,  can  be  found  at http://www.kernel.org/doc/man-pages/.


九、文中所有/proc选项的列表


十、文中所有socket选项的列表


十一、文中所有RFC文档列表

  • RFC793  :tcp协议规范手册
  • RFC1122:Nagle算法
  • RFC1323:时间戳和窗口缩放
  • RFC1337:TIME_WAIT
  • RFC3168:显示拥塞通知
  • RFC2581:拥塞控制算法
  • RFC2018:SACK
  • RFC2883:SACK
  • RFC2884:显示拥塞通知
  • RFC2861:空闲期后慢启动
  • RFC4138:frto

更多补充中......



  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值