shell脚本读取文件的内容至数组中，数组的交并补集，文件合并与去重

yogima

已于 2024-05-10 16:24:47 修改

阅读量4.8k

点赞数 5

CC 4.0 BY-SA版权

分类专栏： linux 文章标签： linux

于 2023-02-01 16:16:46 首次发布

本文链接：https://blog.csdn.net/yogima/article/details/128833660

linux 专栏收录该内容

10 篇文章

订阅专栏

文章介绍了如何使用Shell脚本按行读取文件内容，将文件内容存入数组，遍历数组，以及计算两个数组的交并补集和差集。还讨论了在不同Shell版本中处理数组和文件操作的兼容性问题，并提供了创建文件差集的两种方法。

文章目录

1.按行读取文件中的内容

#!/bin/bash

for line in `cat /tmp/readTest.txt`
do
    echo $line
done

readTest.txt的内容为：

111
222
333
444

则执行该shell脚本输出就也是

111
222
333
444

2.将文件中的内容读取到数组中

#!/bin/bash
LISTS=($(cat /tmp/readTest.txt))
echo "size: ${#LISTS[*]}"
echo ${LISTS[*]}

输出：

size: 4
111 222 333 444

打印单个数组元素： ${数组名[下标]}
打印全部数组内容：${数组名[@]}或 ${数组名[*]}
打印数组元素的个数：${#数组名[@]}或 ${#数组名[*]}

3.对数组的遍历

for i in $( seq 0 $((${#LISTS[*]} - 1)) )
do
	#这个i前面的$符号必须加，如果不加默认是字符串
	echo "index: $i"
	#下面这两个的$可以不用加，放在bash里面是变色的
	echo "第: $((i + 1)) 个"
	echo ${LISTS[i]}
done

输出：

index: 0
第: 1 个
111
index: 1
第: 2 个
222
index: 2
第: 3 个
333
index: 3
第: 4 个
444

更简洁的方式：

for var in ${LISTS[@]}
do
   echo $var
done

4.两个数组的交并补集

#!/bin/bash

file_list_1=("test1" "test2" "test3" "test4" "test5" "test6")
file_list_2=("test5" "test6" "test7" "test8")

# 获取并集，A ∪ B
file_list_union=(`echo ${file_list_1[*]} ${file_list_2[*]}|sed 's/ /\n/g'|sort|uniq`)
echo ${file_list_union[*]}

# 获取交集，A n B
file_list_inter=(`echo ${file_list_1[*]} ${file_list_2[*]}|sed 's/ /\n/g'|sort|uniq -c|awk '$1!=1{print $2}'`)
echo ${file_list_inter[*]}

# 对称差集，不属于 A n B
file_list_4=(`echo ${file_list_1[*]} ${file_list_2[*]}|sed 's/ /\n/g'|sort|uniq -c|awk '$1==1{print $2}'`)
echo ${file_list_4[*]}

# 差集
file_list_3=($(comm -23 <(printf "%s\n" "${file_list_1[@]}" | sort) <(printf "%s\n" "${file_list_2[@]}" | sort)))
echo ${file_list_3[*]}

输出：

test1 test2 test3 test4 test5 test6 test7 test8
test5 test6
test1 test2 test3 test4 test7 test8
test1 test2 test3 test4

用gpt来做一个解释，以file_list_4为示例：
1、${file_list_1[*]} ${file_list_2[*]}：这部分将数组 file_list_1 和 file_list_2 的元素通过空格连接起来，形成一个字符串。
2、| sed 's/ /\n/g'：通过管道将前一部分的字符串传递给 sed 命令。sed 命令使用正则表达式替换，将空格字符替换为换行符\n。这样做是为了将字符串中的元素分割成独立的行。
3、| sort：通过管道将前一部分的结果传递给 sort 命令。sort 命令对输入进行排序，按照字典顺序对行进行排序。
4、| uniq -c：通过管道将前一部分的结果传递给 uniq -c 命令。uniq -c 命令会对相邻的重复行进行计数，并在每行前面显示计数结果。
5、| awk '$1==1{print $2}'：通过管道将前一部分的结果传递给 awk 命令。awk 命令会筛选出计数结果为1的行，并打印出该行的第二个字段。

注意，在实际使用时发现有些shell版本并不支持进程替换（进程替换是一种在命令行中执行子命令并将其输出作为参数传递给另一个命令的方法），以差集做示例，报错如下所示：

syntax error near unexpected token (' test.sh: command substitution: line 6: comm -23 <(printf "%s\n" "${file_list_1[@]}" | sort) <(printf "%s\n" "${file_list_2[@]}" | sort))'

可以改为这样：

# 创建临时文件
temp1=$(mktemp)
temp2=$(mktemp)

# 将 "${file_list_1[@]}" 的内容按行打印并排序，将结果保存到临时文件1
printf "%s\n" "${file_list_1[@]}" | sort > temp1

# 将 "${file_list_2[@]}" 的内容按行打印并排序，将结果保存到临时文件2
printf "%s\n" "${file_list_2[@]}" | sort > temp2

# 使用 comm 命令比较两个临时文件的内容，并将结果赋值给结果数组
file_list_3=($(comm -23 temp1 temp2))

# 删除临时文件
rm temp1 temp2

5.取两个文件的差集并输出到另一文件中

可以像如下这样：

#!/bin/bash

file_list_1=($(cat readTest.txt))
file_list_2=($(cat readTest2.txt))
file_list_3=(`echo ${file_list_1[*]} ${file_list_2[*]}|sed 's/ /\n/g'|sort|uniq -c|awk '$1==1{print $2}'`)
echo ${file_list_3[*]} >> readTest3.txt

但是其实有时候不需要这么麻烦，文件有文件级别的操作，读取文件中的内容是为了对里面的数据做操作，如果只是简单地合并文件，不需要去人为筛选或是更改其中的数据，可以这样：
1.取出两个文件的并集(重复的行只保留一份)

cat readTest.txt readTest2.txt| sort | uniq > readTest3.txt

2.取出两个文件的交集(只留下同时存在于两个文件中的文件)

cat readTest.txt readTest2.txt| sort | uniq -d > readTest3.txt

3.删除交集，留下其他的行

cat readTest.txt readTest2.txt| sort | uniq -u > readTest3.txt