checksum 工具
It is common practice to calculate the checksums for files to check its integrity. For large files, the checksum computation is slow. Now I am wondering why it is so slow and whether choosing another tool will be better. In this post, I try three common tools md5sum
, sha1sum
and crc32
to compute checksums on a relatively large file to see which checksum tool on Linux is faster to help us decide the choices of the checksum tool.
通常的做法是计算文件的校验和以检查其完整性。 对于大文件,校验和计算很慢。 现在我想知道为什么它这么慢,以及选择其他工具是否会更好。 在本文中,我尝试使用三种常用的工具md5sum
, sha1sum
和crc32
来计算相对较大文件的校验和,以查看Linux上哪种校验和工具更快,以帮助我们确定校验和工具的选择。
File to be checsum’ed is a 15GB text file:
要检查的文件是一个15GB的文本文件:
$ ls -lha wiki.txt
-rw-r--r-- 1 zma zma 15G Jun 14 10:28 wiki.txt
表现 (The performance)
Now, let’s see how does the three tools perform for computing the checksum of the file.
现在,让我们看看这三个工具如何执行计算文件的校验和。
sha1sum速度 (sha1sum speed)
$ time sha1sum wiki.txt
251dcb5c08c6a2fabd258f2c8a9b95e15c0cc098 wiki.txt
real 1m21.143s
user 0m21.647s
sys 0m4.668s
crc32速度 (crc32 speed)
$ time crc32 wiki.txt
0080f7a1
real 1m21.051s
user 0m16.194s
sys 0m4.890s
md5sum速度 (md5sum speed)
$ time md5sum wiki.txt
e2e649030c795ffa9f33a99bcb39dde7 wiki.txt
real 1m27.392s
user 0m25.563s
sys 0m3.936s
摘要 (Summary)
From the results, crc32
is the fasted. But it is just a tiny bit faster than sha1sum
and md5sum
. md5sum
is the slowest but just a little bit slower.
从结果来看, crc32
是禁食的。 但这仅比sha1sum
和md5sum
快一点。 md5sum
是最慢的,但稍微慢一点。
Why there is no much differences? To compute the checksums, the tools need to read these files and do the computation. Now, let’s check how much time is needed to read the file content out.
为什么没有太多差异? 要计算校验和,工具需要读取这些文件并进行计算。 现在,让我们检查一下读取文件内容需要多少时间。
$ time dd if=wiki.txt of=/dev/null bs=8192
1953039+1 records in
1953039+1 records out
15999296457 bytes (16 GB) copied, 80.4203 s, 199 MB/s
real 1m20.447s
user 0m0.202s
sys 0m7.091s
The I/O read speed is around 200MB/s. That’s not bad for a single magnetic disk I/O storage.
I / O读取速度约为200MB / s。 对于单个磁盘I / O存储来说,这还不错。
So, almost all time are on reading the file content. The algorithms and the tools themselves are not yet the limitation. The disk I/O speed is.
因此,几乎所有时间都在读取文件内容上。 算法和工具本身还不是限制。 磁盘I / O速度是。
The conclusion is that use any tools that work the best for you (you may need to be aware of the the collisions for these algorithms, check Simard’s comment) without worrying a lot about the speed (it still consumes time) on a relatively modern computer. If you want higher speed, improve your I/O speed first till CPU is the bottleneck (CPU usage reaches 100%).
结论是,使用任何最适合您的工具(您可能需要了解这些算法的冲突,请查看Simard的评论 ),而不必担心相对现代计算机上的速度(仍然会浪费时间) 。 如果要提高速度,请先提高I / O速度,直到CPU成为瓶颈(CPU使用率达到100%)。
如果I / O不是瓶颈怎么办 (What if I/O was not the bottleneck)
Pádraig comments that we can avoid the I/O and measure the computational cost. I did a little bit change to the suggested command to do checksum on a file under /dev/shm/ as crc32
does not accept input from STDIN. The system is the same one on which I did the previous tests. It can only support 3GB by the time I did this test. The results are as follows.
Pádraig 评论说,我们可以避免I / O并测量计算成本。 我对建议的命令做了一些更改,以便对/ dev / shm /下的文件执行校验和,因为crc32
不接受来自STDIN的输入。 该系统与我之前进行测试的系统相同。 进行此测试时,它只能支持3GB。 结果如下。
[zma@host:/dev/shm]$ head -c 3G /dev/zero >test
[zma@host:/dev/shm]$ for chk in crc32 md5sum sha1sum ; do echo $chk; time $chk test; done
crc32
480bbe37
real 0m3.411s
user 0m2.931s
sys 0m0.482s
md5sum
c698c87fb53058d493492b61f4c74189 test
real 0m5.103s
user 0m4.697s
sys 0m0.409s
sha1sum
6e7f6dca8def40df0b21f58e11c1a41c3e000285 test
real 0m4.451s
user 0m4.082s
sys 0m0.372s
To summarize the speed if we consider md5sum
‘s speed as the baseline:
如果将md5sum
的速度作为基线,则总结速度:
md5sum
: 1.00x
crc32
: 1.50x
sha1sum
: 1.15x
md5sum
:1.00x
crc32
:1.50x
sha1sum
:1.15倍
crc32
is the fastest here. It is a Perl 5 program using Archive::Zip::computeCRC32()
to compute the crc32.
crc32
是这里最快的。 这是一个Perl 5程序,使用Archive::Zip::computeCRC32()
计算crc32。
The throughput here for md5sum
is above 600MB/s. This is not a number that can not be achieved by an SSD or a RAID of SSDs. On the system I tested, if the I/O is much improved, the computation will likely affect much of the time spent.
md5sum
的吞吐量在600MB / s以上。 这不是SSD或RAID的RAID无法达到的数字。 在我测试的系统上,如果I / O得到很大改善,则计算可能会影响所花费的大部分时间。
CPU型号和使用的校验和工具版本 (CPU model and versions of checksum tools used)
Here are the CPU model and versions of the checksum tools used during the test.
这是测试期间使用的CPU型号和校验和工具的版本。
$ lscpu | grep "Model name"
Model name: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
$ md5sum --version
md5sum (GNU coreutils) 8.23
Copyright (C) 2014 FreeSoftware Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Ulrich Drepper, Scott Miller, and David Madore.
$ sha1sum --version
sha1sum (GNU coreutils) 8.23
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Ulrich Drepper, Scott Miller, and David Madore.
$ rpm -qf `which crc32`
perl-Archive-Zip-1.46-1.fc22.noarch
翻译自: https://www.systutorials.com/which-checksum-tool-on-linux-is-faster/
checksum 工具