Linux处理求两个文件交集、差集

最新推荐文章于 2021-10-29 09:12:44 发布

北纬34度停留

最新推荐文章于 2021-10-29 09:12:44 发布

阅读量1.8k

点赞数

分类专栏：杂

本文链接：https://blog.csdn.net/fsx2550553488/article/details/80837647

版权

杂专栏收录该内容

21 篇文章 0 订阅

订阅专栏

两个文件交集、差集

两个文件，如：

[root@localhost grep]# cat 1.txt
a
b
c
a
d
aa
bb
aa
[root@localhost grep]# cat 2/txt
a
b
c
bb
fsx

分析，文件1.txt和文件2.txt：


1.txt - 2.txt	(a) d aa (aa)
2.txt - 1.txt	fsx
1.txt 交 2.txt	a b c bb

comm命令

NAME
       comm - compare two sorted files line by line

SYNOPSIS
       comm [OPTION]... FILE1 FILE2

DESCRIPTION
       Compare sorted files FILE1 and FILE2 line by line.

comm命令比较两个文件差异和交集，并输出为三列，第一列为FILE1-FILE2、第二列为FILE2-FILE1、第三列为FILE1和FILE2的交集。

comm命令要求文件内容必须是排序的且唯一的

直接使用comm命令，如果文件内容没有排序或者内容不唯一，则会有提示

[root@localhost grep]# comm 1.txt 2.txt 
        a
        b
        c
a
    bb
d
comm: file 1 is not in sorted order
aa
bb
aa
    fsx

使用sort进行先排序，在比较

[root@localhost grep]# comm <(sort 1.txt) <(sort 2.txt)
        a
a
aa
aa
        b
        bb
        c
d
    fsx
第一列：1.txt比2.txt多的数据
第二列：2.txt比1.txt没有的数据
第三列：1.txt和2.txt数据重叠的部分

使用uniq进行数据唯一化

[root@localhost grep]# comm <(sort 1.txt |uniq) <(sort 2.txt | uniq)
        a
aa
        b
        bb
        c
d
    fsx
第一列：1.txt有而2.txt没有的数据
第二列：2.txt有而1.txt没有的数据
第三列：1.txt和2.txt数据重叠的部分

差集

grep命令

NAME
       grep, egrep, fgrep - print lines matching a pattern

SYNOPSIS
       grep [OPTIONS] PATTERN [FILE...]
       grep [OPTIONS] [-e PATTERN | -f FILE] [FILE...]

DESCRIPTION
       grep  searches the named input FILEs (or standard input if no files are
       named, or if a single hyphen-minus (-) is given as file name) for lines
       containing  a  match to the given PATTERN.  By default, grep prints the
       matching lines.
       
 -F, --fixed-strings, --fixed-regexp
              Interpret  PATTERN  as  a  list  of  fixed strings, separated by
              newlines, any of which is to be matched.  (-F  is  specified  by
              POSIX,  --fixed-regexp  is an obsoleted alias, please do not use
              it in new scripts.)将PATTERN解释为由换行符分隔的固定字符串列表，其中任何一个将被匹配。
-f FILE, --file=FILE
              Obtain patterns  from  FILE,  one  per  line.   The  empty  file
              contains  zero  patterns, and therefore matches nothing.  (-f is
              specified by POSIX.)从FILE获取模式，每行一个。 空文件包含零模式，因此不匹配任何内容。
-v, --invert-match
              Invert the sense of matching, to select non-matching lines.  (-v
              is specified by POSIX.)，取反

取交集

[root@localhost grep]# grep -F -f 1.txt 2.txt | sort | uniq
a
b
bb
c

grep不要求排序，但是因为是集合操作，必须是唯一

取差集