用diff命令计算源代码（如C源码）差异时忽略注释的最详细方法及应用

最新推荐文章于 2023-11-17 23:42:54 发布

蛐蛐蛐

最新推荐文章于 2023-11-17 23:42:54 发布

阅读量1.5k

点赞数

分类专栏：科研工具 Python技巧

本文链接：https://blog.csdn.net/qysh123/article/details/107157815

版权

科研工具同时被 2 个专栏收录

122 篇文章 12 订阅

订阅专栏

Python技巧

91 篇文章 1 订阅

订阅专栏

这里我准备系统地总结一下这个问题。如果我们要统计一份源码不同版本之间的差异，当然希望是忽略所有的注释，我之前也写过博客文章总结：https://blog.csdn.net/qysh123/article/details/72866251，但是今天发现之前总结的不是很仔细，所以这里再仔细总结一下，还是基于上次博客中给的例子，这次稍微复杂了一点，假设我有两个c代码，分别是a.c和b.c：

a.c：

test

//command
/*1
 *1
*1
 */

1

b.c：

test


//command2

/*2
 *2
*2
 */

12

我们希望diff的时候仅仅返回最后两行的不同，在ubuntu中我们可以这样写：

diff -B --ignore-matching-lines='^\(//\| \*\|\*\|/\*\)' a.c b.c

那么返回的结果是：

9c11
< 1
---
> 12

说明a文件的第9行需要变到b文件的第11行（这种输出格式，大家具体可以参考这里：https://www.cnblogs.com/moxiaopeng/articles/4853352.html）

那么这种情况下是正确的，看来我们确实通过--ignore-matching-lines和后面的正则表达式过滤掉了含有注释的这些行（这些正则的具体规则可以参考我之前的博客：https://blog.csdn.net/qysh123/article/details/72866251）。

但是事情并没有这么简单，如果我们把b变成这样：

test


//command2

2

/*2
 *2
*2
 */

12

我们依然希望正则表达式能够把注释这些的过滤掉，但是这时候返回的结果竟然成了：

3,6c3,10
< //command
< /*1
<  *1
< *1
---
> 
> //command2
> 
> 2
> 
> /*2
>  *2
> *2
9c13
< 1
---
> 12

也就是说这时候我们的正则表达式似乎统统不起作用了！这真的是太反直觉了。其实这里网友有讨论：https://stackoverflow.com/questions/2747091/how-to-ignore-lines-starting-with-a-string-with-diff

However, -I only ignores the insertion or deletion of lines that contain the regular expression if every changed line in the hunk (every insertion and every deletion) matches the regular expression.

In other words, for each non-ignorable change, diff prints the complete set of changes in its vicinity, including the ignorable ones. You can specify more than one regular expression for lines to ignore by using more than one -I option. diff tries to match each line against each regular expression, starting with the last one given.

简单说，这里的意思是说，我们用--ignore-matching-lines的时候，diff并不是以行来进行过滤的，而是以它认为的hunk（代码块）来进行过滤的，也就是说它会用正则去匹配某个hunk的每一行开头，只有改变的每一行开头都能和正则中的某一些条件匹配上的时候，才能忽略整个hunk。这个真是太反人类了！

在这里看到了倒数第二个回答，给了我一些启示：https://stackoverflow.com/questions/2747091/how-to-ignore-lines-starting-with-a-string-with-diff

还是针对上面修改后的b.c，我们如果输入：

diff -B <(grep -v '^\(//\| \*\|\*\|/\*\)' a.c) <(grep -v '^\(//\| \*\|\*\|/\*\)' b.c)

就可以得到下面的结果：

4c4,8
< 1
---
> 
> 2
> 
> 
> 12

这已经算是比较完美解决我们需求了。在这个基础上，如果我们想统计有多少行不同，应该怎么办呢？

看到了这里网友的介绍：https://stackoverflow.com/questions/27236891/diff-command-to-get-number-of-different-lines-only/27236972，还是第二个答案比较靠谱（看来StackOverflow的第二个答案才往往是正确答案啊）。

所以我们运行：

diff -B -y --suppress-common-lines <(grep -v '^\(//\| \*\|\*\|/\*\)' a.c) <(grep -v '^\(//\| \*\|\*\|/\*\)' b.c) | wc -l

得到的输出结果是5，看看上面的一个输出即能明白5是什么含义。关于diff不同参数的含义，可以参考这里：https://www.cnblogs.com/peida/archive/2012/12/12/2814048.html。个人感觉目前已经能够比较完美地实现我们希望的功能了。

2020年7月7日更新：

上面写的grep规则没有过滤这种情况：

       /*
        * Defining _WIN32_WINNT here in e_os.h implies certain "discipline."
        * Most notably we ought to check for availability of each specific
        * routine with GetProcAddress() and/or guard NT-specific calls with
        * GetVersion() < 0x80000000. One can argue that in latter "or" case
        * we ought to /DELAYLOAD some .DLLs in order to protect ourselves
        * against run-time link errors. This doesn't seem to be necessary,
        * because it turned out that already Windows 95, first non-NT Win32
        * implementation, is equipped with at least NT 3.51 stubs, dummy
        * routines with same name, but which do nothing. Meaning that it's
        * apparently sufficient to guard "vanilla" NT calls with GetVersion
        * alone, while NT 4.0 and above interfaces ought to be linked with
        * GetProcAddress at run-time.
        */

也就是说/*, *, */前面有多个空格的情况。以及//前面有多个空格的情况，那么这种情况应该用\s\+来匹配一个或多个空格，例如这里介绍的：https://blog.csdn.net/tterminator/article/details/52792959

那么这时候grep的匹配应该改成：

grep -v '^\(//\|\*\|/\*\|\s\+//\|\s\+\*\|\s\+/\*\)' a.c

是不是看起来特别复杂？但只要稍微耐心分析一下就能立刻明白。就简单更新总结这一点。

蛐蛐蛐

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
用diff命令计算源代码（如C源码）差异时忽略注释的最详细方法及应用

这里我准备系统地总结一下这个问题。如果我们要统计一份源码不同版本之间的差异，当然希望是忽略所有的注释，我之前也写过博客文章总结：https://blog.csdn.net/qysh123/article/details/72866251，但是今天发现之前总结的不是很仔细，所以这里再仔细总结一下，还是基于上次博客中给的例子，这次稍微复杂了一点，假设我有两个c代码，分别是a.c和b.c：a.c：test//command/*1 *1*1 */1b.c：test//com
复制链接

扫一扫