TCPIP校验值的优化

最新推荐文章于 2023-03-28 18:05:38 发布

atmgnd

最新推荐文章于 2023-03-28 18:05:38 发布

阅读量634

点赞数

原文: http://locklessinc.com/articles/tcp_checksum/

TCP/IP 校验优化

TCP/IP校验值用来检测经过TCP/IPV4传输的数据的正确性. 如果一个位发生了翻转, 或其他原因导致数据被破坏, 则数据接收者可以因为校验值不一致而发现数据被破坏(缺失). 这提供了端对端的数据正确性保证.

IPV4使用校验值来检测包头数据的正确性. 即源, 目的和其他元数据. TCP协议包含了一项额外的校验值用以保证载荷数据的正确性. 另IPV6没有校验值, 而是假定更底层的协议或更上层的协议会包含检验.

TCP与IPV4所使用的检验算法是完全一致的.计算过程中一次处理一个字的数据.(如果数据的长度是奇数位的, 则在其末尾添加一个字节的0使其长度为偶数). 首先将保存结果的位置置0, 然后按字使用1的补码相加, 并将其结果取反后存储在即定位置. 重复以上过程, 结果则为0.

上面描述的算法有许多重要的特征. 首先计算是基于加法的, 因此满足结合性与交换性. 也就是说我们也许可以通过改变计算的顺序来提高效率.

第二个特征则更为微秒: 结果是端序无关的. 1的补码加法可以在2的补码加法机器上通过2的补码加法计算,只要将可能的进位再加到结果上即可. 也就是有点循环进位的意思.这个特性使得无论我们使用何种端序, 其计算结果将是一致的.

最后一个特性是计算是基于加法的, 因此可以以非常低的代价更新检验值. 我们只需要对改变的位计算1的补码加法. 比如, 一个更新TTL的过程, 不必要对整个数据进行重新计算.

对校验值的计算可以成为网络应用的瓶颈. 因此有这么多的网卡提供硬件级的计算. 但也有许多的环境使用CPU来计算, 此时优化将是非常重要的.而现在将讲讲如何优化.

TCP/IP校验算法

C语言里2的补码使用无符号数, 并且(C语言)也不能指定有符号数的格式. 因此我们不能直接计算1的补码. 同时也没法读取进位或是使用ADC指令. 但如果我们使用大于一个字来存放和, 则可以累积多个进位. 因为加法具有交换与结合性, 我们可以在最后将进位合并到低16位中. 具体的实现应该如下:

unsignedshort checksum1(constchar*buf,unsigned size)

{

unsigned sum =0;

int i;

/* Accumulate checksum */

for(i =0; i < size -1; i +=2)

{

unsignedshort word16 =*(unsignedshort*)&buf[i];

sum += word16;

}

/* Handle odd-sized case */

if(size &1)

{

unsignedshort word16 =(unsignedchar) buf[i];

sum += word16;

}

/* Fold to get the ones-complement result*/

while(sum >>16) sum =(sum &0xFFFF)+(sum >>16);

/* Invert to get the negative inones-complement arithmetic */

return~sum;

}

我们将上面的片段运行2^24次并统计消耗时间以衡量其性能. 显然结果会因为实验数据的大小而变化. 我们分别选择64字节, 1023字节, 1024字节长度的数据计算三次. 64字节的结果表明小数据量时的消耗, 而1024字节的结果则会显示出在大数据量时的性能, 1023字节可以实验也奇数大小长度对性能的影响. 实验结果如下:

Size	64	1023	1024
Time (s)	0.88	10.99	11.03

上面的算法一次处理2个字节. 一般地, 如果所使用的处理器一次处理的数据更多, 则性能会更好. 因为我们所使用的计算机是64位的, 我们更偏向于改进算法使得一次处理更多的数据. 问题是该如何处理进位. 幸运的是, 编写一个能检测进位的算法不是特别困难. 如果a + b < a, 则说明发生了进位. 当发生进位时, 则向结果多加1. 最后将64位的结果向下叠加到16位. 最后算法可能是:

unsignedshort checksum2(constchar*buf,unsigned size)

{

unsignedlonglong sum =0;

constunsignedlonglong*b =(unsignedlonglong*) buf;

unsigned t1, t2;

unsignedshort t3, t4;

/* Main loop - 8 bytes at a time */

while(size >=sizeof(unsignedlonglong))

{

unsignedlonglong s =*b++;

sum += s;

if(sum < s) sum++;

size -=8;

}

/* Handle tail less than 8-bytes long */

buf =(constchar*) b;

if(size &4)

{

unsigned s =*(unsigned*)buf;

sum += s;

if(sum < s) sum++;

buf +=4;

}

if(size &2)

{

unsignedshort s =*(unsignedshort*) buf;

sum += s;

if(sum < s) sum++;

buf +=2;

}

if(size & 1)[a1]

{

unsignedchar s =*(unsignedchar*) buf;

sum += s;

if(sum < s) sum++;

}

/* Fold down to 16 bits */

t1 = sum;

t2 = sum >>32;

t1 += t2;

if(t1 < t2) t1++;

t3 = t1;

t4 = t1 >>16;

t3 += t4;

if(t3 < t4) t3++;

return~t3;

}

与期望的一样, 这一次的结果稍微快一点:

Size	64	1023	1024
Time (s)	0.29	2.90	2.93

性能并没有与处理量呈线性关系. 但接近于线性. 现在我们想要更快. 但是C没有存取进位的能力, 想要描述清楚我们想让计算机做什么变得极其困难. 因此, 我们选择直接使用汇编语言, 其可以让我们使用所有技术.

将上面的C语言直接翻译成汇编语言其结果如下:

.globlchecksum3

.type checksum3,@function

.align16

checksum3:

xor %eax, %eax

cmp $8, %esi

jl2f

#The main loop

1: add(%rdi), %rax

adc $0, %rax

add $8, %rdi

sub $8, %esi

cmp $8, %esi

jge1b

#Handle the tail

2: test $4, %esi

je 3f

movl (%rdi), %edx

add %rdx, %rax

adc $0, %rax

add $4, %rdi

3: test $2, %esi

je4f

xor %edx, %edx

movw (%rdi), %dx

add %rdx, %rax

adc $0, %rax

add $2, %rdi

4: test $1, %esi

je5f

xor %edx, %edx

movb (%rdi), %dl

add %rdx, %rax

adc $0, %rax

#Fold down to 16-bits

5: mov %eax, %edx

shr $32, %rax

add %edx, %eax

adc $0, %eax

mov %eax, %edx

shr $16, %eax

add %dx, %ax

adc $0, %ax

#Invert to get the final checksum

not %ax

retq

.sizechecksum3, .-checksum3

主循环相当简单. 其使用add指令累积结果到rax. 进位使用其后的adc指令处理. 其余的代码仅仅是更新数据指针, 递减迭代计数器. 处理最后没有成倍时的数据. 使用adc指定也很容易叠加最后的结果到16位.

因为上面的代码是对原来C代码的直接翻译, 我们不指望有多少性能提升. 测试结果如下:

Size	64	1023	1024
Time (s)	0.28	2.88	2.91

因为使用汇编语言的固有开销少于C语言, 可以看到还是有少许提升的. 要想计算更快, 我们需要使用更加灵活的算法. 最容易想到的是能不能对主循环优化更多. 不难发现, 我们不必每次都处理进位. 我们可以去掉主循环中的adc指令, 而是在主循环结束后使用一条adc指令处理进位. 现在的主要问题是更新数据指针时会影响(清除)进位, 因此我们必须使用新指令来更新进位(lea), 使用dec指令来递减循环计数. 这两条指令都不会改变进位.

.globlchecksum4

.type checksum4,@function

.align16

checksum4:

mov %esi, %ecx

xor %eax, %eax

#Divide by 8 to get the total number of iterations

shr $3, %ecx

je2f

#Clear the carry before starting the loop

clc

#The new smaller main loop

1: adc(%rdi), %rax

lea8(%rdi), %rdi

dec %ecx

jne1b

#Fold in the final carry

adc $0, %rax

2: test $4, %esi

je 3f

movl (%rdi), %edx

add %rdx, %rax

adc $0, %rax

add $4, %rdi

3: test $2, %esi

je4f

xor %edx, %edx

movw (%rdi), %dx

add %rdx, %rax

adc $0, %rax

add $2, %rdi

4: test $1, %esi

je5f

xor %edx, %edx

movb (%rdi), %dl

add %rdx, %rax

adc $0, %rax

5: mov %eax, %edx

shr $32, %rax

add %edx, %eax

adc $0, %eax

mov %eax, %edx

shr $16, %eax

add %dx, %ax

adc $0, %ax

not %ax

retq

.sizechecksum4, .-checksum4

上面的代码对主循环进行了大大的简化, 显然应当跑得快很多. 但现实往往是残酷的:

Size	64	1023	1024
Time (s)	0.27	2.87	2.90

妈的, 修改似乎一点影响都没有. 这也表明adc是非常快的指令. 主循环中的adc 指令与其他指令有某种重叠, 因此移出adc 指令并没有太多的性能提升. 如果真如此, 那么循环展开应该会有效:

.globlchecksum5

.type checksum5,@function

.align16

checksum5:

mov %esi, %ecx

xor %eax, %eax

#Now handle 16 bytes at a time

shr $4, %ecx

je2f

clc

#The main loop now uses two 64-bit additions

1: adc(%rdi), %rax

adc8(%rdi), %rax

lea16(%rdi), %rdi

dec %ecx

jne1b

adc $0, %rax

# Weneed to handle anything up to 15 tail bytes.

2: test $8, %esi

je 3f

add(%rdi), %rax

adc $0, %rax

add $8, %rdi

3: test $4, %esi

je 4f

movl (%rdi), %edx

add %rdx, %rax

adc $0, %rax

add $4, %rdi

4: test $2, %esi

je5f

xor %edx, %edx

movw (%rdi), %dx

add %rdx, %rax

adc $0, %rax

add $2, %rdi

5: test $1, %esi

je6f

xor %edx, %edx

movb (%rdi), %dl

add %rdx, %rax

adc $0, %rax

# Since we accumulate with 64-bits still,this doesn't change.

6: mov %eax, %edx

shr $32, %rax

add %edx, %eax

adc $0, %eax

mov %eax, %edx

shr $16, %eax

add %dx, %ax

adc $0, %ax

not %ax

retq

.sizechecksum5, .-checksum5

这一次, 我们的怀疑得到了验证, 性能得到了巨大的提升:

Size	64	1023	1024
Time (s)	0.20	1.54	1.56

自然想到, 如果展开2倍有提升, 展开更多是否会更快. 如果我们展开4倍, 得到:

.globlchecksum6

.type checksum6,@function

.align16

checksum6:

mov %esi, %ecx

xor %eax, %eax

# 32 bytes at a time

shr $5, %ecx

je2f

clc

#Lets make sure our quite-largeloop is now aligned

.align16

#Four 64-bit adds per iteration

1: adc(%rdi), %rax

adc8(%rdi), %rax

adc16(%rdi), %rax

adc24(%rdi), %rax

lea32(%rdi), %rdi

dec %ecx

jne1b

adc $0, %rax

#Handle the 31 bytes or less remaining

2: test $16, %esi

je 3f

add(%rdi), %rax

adc8(%rdi), %rax

adc $0, %rax

add $16, %rdi

3: test $8, %esi

je 4f

add(%rdi), %rax

adc $0, %rax

add $8, %rdi

4: test $4, %esi

je 5f

movl (%rdi), %edx

add %rdx, %rax

adc $0, %rax

add $4, %rdi

5: test $2, %esi

je6f

xor %edx, %edx

movw (%rdi), %dx

add %rdx, %rax

adc $0, %rax

add $2, %rdi

6: test $1, %esi

je7f

xor %edx, %edx

movb (%rdi), %dl

add %rdx, %rax

adc $0, %rax

7: mov %eax, %edx

shr $32, %rax

add %edx, %eax

adc $0, %eax

mov %eax, %edx

shr $16, %eax

add %dx, %ax

adc $0, %ax

not %ax

retq

.sizechecksum6, .-checksum6

是的, 性能的确得到了提升, 但是并没有上一次多. 但已值得:

Size	64	1023	1024
Time (s)	0.18	1.10	1.08

继续展开似乎不再有帮助. 为了进一步提升速度, 我们需要使用其他的优化方法. 循环展开是增加了相关指令段(顺序)的长度. 如果我们可以处理次数进一步缩减, CPU也许可能得到更高的并行度. 不幸的是, Intel只提供了一个进位. 但我们仍可以结合使用原始C代码中的一些技术. 使用lea指令将结果累加到更宽的寄存器, 我们可以进行更多的并行计算而之间没有相互影响. 代码如下:

.globlchecksum7

.type checksum7,@function

.align16

checksum7:

mov %esi, %ecx

xor %eax, %eax

shr $5, %ecx

je2f

#Use %r8 to accumulate as well

xor %r8, %r8

.align16

1: movl 24(%rdi), %edx