linxu去除一个文件中包含另一个文件的行

yc_starlight

已于 2023-08-31 12:21:51 修改

阅读量294

点赞数

分类专栏： linux 文章标签： linux

于 2023-04-24 20:48:54 首次发布

本文链接：https://blog.csdn.net/qq_40837206/article/details/130348707

版权

linux 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

一、背景

在工作中遇到大型文件去除相同内容，利用 python 效率较低，直接使用 linux 命令更效率，现有两个文件 a.txt

1212
asdsa
cdsa
cd121

和 b.txt

1212
asds
cdsa
cd12

二、解决方式

1：grep

grep --h

下面是 grep 的一些参数

Regexp selection and interpretation:
  -E, --extended-regexp     PATTERN is an extended regular expression (ERE)
  -F, --fixed-strings       PATTERN is a set of newline-separated fixed strings
  -G, --basic-regexp        PATTERN is a basic regular expression (BRE)
  -P, --perl-regexp         PATTERN is a Perl regular expression
  -e, --regexp=PATTERN      use PATTERN for matching
  -f, --file=FILE           obtain PATTERN from FILE
  -i, --ignore-case         ignore case distinctions
  -w, --word-regexp         force PATTERN to match only whole words
  -x, --line-regexp         force PATTERN to match only whole lines
  -z, --null-data           a data line ends in 0 byte, not newline

Miscellaneous:
  -s, --no-messages         suppress error messages
  -v, --invert-match        select non-matching lines
  -V, --version             display version information and exit
      --help                display this help text and exit

Output control:
  -m, --max-count=NUM       stop after NUM matches
  -b, --byte-offset         print the byte offset with output lines
  -n, --line-number         print line number with output lines
      --line-buffered       flush output on every line
  -H, --with-filename       print the file name for each match
  -h, --no-filename         suppress the file name prefix on output
      --label=LABEL         use LABEL as the standard input file name prefix
  -o, --only-matching       show only the part of a line matching PATTERN
  -q, --quiet, --silent     suppress all normal output
      --binary-files=TYPE   assume that binary files are TYPE;
                            TYPE is 'binary', 'text', or 'without-match'
  -a, --text                equivalent to --binary-files=text
  -I                        equivalent to --binary-files=without-match
  -d, --directories=ACTION  how to handle directories;
                            ACTION is 'read', 'recurse', or 'skip'
  -D, --devices=ACTION      how to handle devices, FIFOs and sockets;
                            ACTION is 'read' or 'skip'
  -r, --recursive           like --directories=recurse
  -R, --dereference-recursive
                            likewise, but follow all symlinks
      --include=FILE_PATTERN
                            search only files that match FILE_PATTERN
      --exclude=FILE_PATTERN
                            skip files and directories matching FILE_PATTERN
      --exclude-from=FILE   skip files matching any file pattern from FILE
      --exclude-dir=PATTERN directories that match PATTERN will be skipped.
  -L, --files-without-match print only names of FILEs containing no match
  -l, --files-with-matches  print only names of FILEs containing matches
  -c, --count               print only a count of matching lines per FILE
  -T, --initial-tab         make tabs line up (if needed)
  -Z, --null                print 0 byte after FILE name

Context control:
  -B, --before-context=NUM  print NUM lines of leading context
  -A, --after-context=NUM   print NUM lines of trailing context
  -C, --context=NUM         print NUM lines of output context
  -NUM                      same as --context=NUM
      --group-separator=SEP use SEP as a group separator
      --no-group-separator  use empty string as a group separator
      --color[=WHEN],
      --colour[=WHEN]       use markers to highlight the matching strings;
                            WHEN is 'always', 'never', or 'auto'
  -U, --binary              do not strip CR characters at EOL (MSDOS/Windows)
  -u, --unix-byte-offsets   report offsets as if CRs were not there
                            (MSDOS/Windows)

'egrep' means 'grep -E'.  'fgrep' means 'grep -F'.
Direct invocation as either 'egrep' or 'fgrep' is deprecated.
When FILE is -, read standard input.  With no FILE, read . if a command-line
-r is given, - otherwise.  If fewer than two FILEs are given, assume -h.
Exit status is 0 if any line is selected, 1 otherwise;
if any error occurs and -q is not given, the exit status is 2.

Report bugs to: bug-grep@gnu.org
GNU Grep home page: <http://www.gnu.org/software/grep/>
General help using GNU software: <http://www.gnu.org/gethelp/>

这里使用 -v、-w、-f 三个参数

# 在 b.txt，不在 a.txt
grep -vwf a.txt b.txt
# 输出下面内容
# asds
# cd12

如果需要相同的内容，将 -v 去掉

grep -wf a.txt b.txt
# 1212
# cdsa

注意：文件中出现特殊符号，会报错，如 \ 等

2：awk

awk --h

下面是 awk 的一些参数

Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options:		GNU long options: (standard)
	-f progfile		--file=progfile
	-F fs			--field-separator=fs
	-v var=val		--assign=var=val
Short options:		GNU long options: (extensions)
	-b			--characters-as-bytes
	-c			--traditional
	-C			--copyright
	-d[file]		--dump-variables[=file]
	-e 'program-text'	--source='program-text'
	-E file			--exec=file
	-g			--gen-pot
	-h			--help
	-L [fatal]		--lint[=fatal]
	-n			--non-decimal-data
	-N			--use-lc-numeric
	-O			--optimize
	-p[file]		--profile[=file]
	-P			--posix
	-r			--re-interval
	-S			--sandbox
	-t			--lint-old
	-V			--version

To report bugs, see node `Bugs' in `gawk.info', which is
section `Reporting Problems and Bugs' in the printed version.

gawk is a pattern scanning and processing language.
By default it reads standard input and writes standard output.

Examples:
	gawk '{ sum += $1 }; END { print sum }' file
	gawk -F: '{ print $1 }' /etc/passwd

# 在 b.txt，不在 a.txt
awk 'NR==FNR{a[$0]=1}NR>FNR{if(a[$0]!=1)print}' a.txt b.txt
# 输出下面内容
# asds
# cd12

NR==FNR{a[$0]=1} 指读取第一个参数（a.txt）的内容存到变量 a 中，并赋值 1；
NR>FNR{if(a[$0]!=1)print} 指读取第二个参数（b.txt）的内容，如果在变量 a 中的值不为1（即不在 a.txt ），那么打印出来；

如果需要相同的内容，将 != 变为 == 即可

awk 'NR==FNR{a[$0]=1}NR>FNR{if(a[$0]==1)print}' a.txt b.txt
# 1212
# cdsa

参考
https://blog.csdn.net/shishui07/article/details/52775361
https://blog.csdn.net/ysdaniel/article/details/7988140

yc_starlight

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
linxu去除一个文件中包含另一个文件的行

1)print} 指读取第二个参数（b.txt）的内容，如果在变量 a 中的值不为1（及不在 a.txt ），那么打印出来；在工作中遇到大型文件去除相同内容，利用 python 效率较低，直接使用 linux 命令更效率，现有两个文件 a.txt。NR==FNR{a[$0]=1} 指读取第一个参数（a.txt）的内容存到变量 a 中，并赋值 1；这里使用 -v、-w、-f 三个参数。= 变为 == 即可。下面是 grep 的一些参数。下面是 awk 的一些参数。的内容，将 -v 去掉。
复制链接

扫一扫