如何在Linux上使用join命令

A terminal prompt ready for a command on a Linux system.
Fatmawati Achmad Zaenuri/Shutterstock Fatmawati Achmad Zaenuri / Shutterstock

If you want to merge data from two text files by matching a common field, you can use the Linux join command. It adds a sprinkle of dynamism to your static data files. We’ll show you how to use it.

如果要通过匹配公共字段来合并两个文本文件中的数据,则可以使用Linux join命令。 它为您的静态数据文件增加了一点活力。 我们将向您展示如何使用它。

跨文件匹配数据 (Matching Data Across Files)

Data is king. Corporations, businesses, and households alike run on it. But data stored in different files and collated by different people is a pain. In addition to knowing which files to open to find the information you want, the layout and format of the files are likely to be different.

数据为王。 公司,企业和家庭都在其上运行。 但是,存储在不同文件中并由不同人员整理的数据是一种痛苦。 除了知道要打开哪些文件以查找所需的信息外,文件的布局和格式可能会有所不同。

You also have to deal with the administrative headache of which files need to be updated, which need to be backed up, which are legacy, and which can be archived.

您还必须处理哪些文件需要更新,哪些文件需要备份,哪些文件是旧文件以及哪些文件可以存档的管理难题。

Plus, if you need to consolidate your data or conduct some analysis across an entire data set, you’ve got an additional problem. How do you rationalize the data across the different files before you can do what you need to do with it? How do you approach the data preparation phase?

另外,如果您需要合并数据或对整个数据集进行一些分析,那么还会遇到其他问题。 在执行所需的操作之前,如何合理化不同文件中的数据? 您如何进入数据准备阶段?

The good news is if the files share at least one common data element, the Linux join command can pull you out of the mire.

好消息是,如果文件共享至少一个公共数据元素,则Linux join命令可以使您摆脱困境。

数据文件 (The Data Files)

All the data we’ll use to demonstrate the use of the join command is fictional, starting with the following two files:

我们将用来演示join命令使用的所有数据都是虚构的,从以下两个文件开始:

cat file-1.txt
cat file-2.txt
The contents of "cat file-1.txt" and "cat file-2.txt" in a terminal window.

The following is the contents of file-1.txt:

以下是file-1.txt的内容:

1 Adore Varian avarian0@newyorker.com Female 192.57.150.231
2 Nancee Merrell nmerrell1@ted.com Female 22.198.121.181
3 Herta Friett hfriett2@dagondesign.com Female 33.167.32.89
4 Torie Venmore tvenmore3@gmpg.org Female 251.9.204.115
5 Deni Sealeaf dsealeaf4@nps.gov Female 210.53.81.212
6 Fidel Bezley fbezley5@lulu.com Male 72.173.218.75
7 Ulrikaumeko Standen ustanden6@geocities.jp Female 4.204.0.237
8 Odell Jursch ojursch7@utexas.edu Male 1.138.85.117

We have a set of numbered lines, and each line contains all the following information:

我们有一组编号的行,每行包含以下所有信息:

  • A number

    一个号码

  • A first name

  • A surname

  • An email address

    电邮地址

  • The person’s sex

    人的性别

  • An IP Address

    IP地址

The following is the contents of file-2.txt:

以下是file-2.txt的内容:

1 Varian avarian0@newyorker.com Female Western New York $535,304.73
2 Merrell nmerrell1@ted.com Female Finger Lakes $309,033.10
3 Friett hfriett2@dagondesign.com Female Southern Tier $461,664.44
4 Venmore tvenmore3@gmpg.org Female Central New York $175,818.02
5 Sealeaf dsealeaf4@nps.gov Female North Country $126,690.15
6 Bezley fbezley5@lulu.com Male Mohawk Valley $366,733.78
7 Standen ustanden6@geocities.jp Female Capital District $674,634.93
8 Jursch ojursch7@utexas.edu Male Hudson Valley $663,821.09

Each line in file-2.txt contains the following information:

file-2.txt中的每一行包含以下信息:

  • A number

    一个号码

  • A surname

  • An email address

    电邮地址

  • The person’s sex

    人的性别

  • A region of New York

    纽约地区

  • A dollar value

    一美元价值

The join command works with “fields,” which, in this context, means a section of text surrounded by whitespace, the start of a line, or the end of a line. For join to match up lines between the two files, each line must contain a common field.

join命令与“字段”一起使用,在这种情况下,“字段”是指由空白,行首或行尾包围的一段文本。 对于join投其所好两个文件之间行,每行必须包含一个共同的领域。

Therefore, we can only match a field if it appears in both files. The IP address only appears in one file, so that’s no good. The first name only appears in one file, so we can’t use that either. The surname is in both files, but it would be a poor choice, as different people have the same surname.

因此,我们只能匹配两个文件中都出现的字段。 IP地址仅出现在一个文件中,所以这不好。 名字仅出现在一个文件中,因此我们也不能使用该名字。 这两个文件中都有该姓氏,但这是一个糟糕的选择,因为不同的人具有相同的姓氏。

You can’t tie the data together with the male and female entries, either, because they’re too vague. The regions of New York and the dollar values only appear in one file, too.

您也不能将数据与男性和女性条目捆绑在一起,因为它们太模糊了。 纽约的区域和美元值也仅出现在一个文件中。

However, we can use the email address because it’s present in both files, and each is unique to an individual. A quick look through the files also confirms the lines in each correspond to the same person, so we can use the line numbers as our field to match (we’ll use a different field later).

但是,我们可以使用该电子邮件地址,因为它存在于两个文件中,并且每个文件对于一个人来说都是唯一的。 快速浏览文件还会确认每个行中的行都对应同一个人,因此我们可以将行号用作匹配的字段(稍后将使用不同的字段)。

Note there are a different number of fields in the two files, which is fine—we can tell join which field to use from each file.

请注意,两个文件中字段的数量不同,这很好-我们可以告诉join每个文件中要使用哪个字段。

However, watch out for fields like the regions of New York; in a space-separated file, each word in the name of a region looks like a field. Because some regions have two- or three-word names, you’ve actually got a different number of fields within the same file. This is okay, as long as you match on fields that appear in the line before the New York regions.

但是,请注意纽约地区等领域; 在以空格分隔的文件中,区域名称中的每个单词看起来像一个字段。 由于某些区域使用两个或三个单词的名称,因此实际上在同一文件中具有不同数量的字段。 只要您匹配显示在纽约地区之前的行中的字段,就可以。

联接命令 (The join Command)

First, the field you’re going to match must be sorted. We’ve got ascending numbers in both files, so we meet that criteria. By default, join uses the first field in a file, which is what we want. Another sensible default is that join expects the field separators to be whitespace. Again, we’ve got that, so we can go ahead and fire up join.

首先,必须对要匹配的字段进行排序。 我们在两个文件中都有升序的数字,因此我们符合该标准。 默认情况下, join使用文件中的第一个字段,这就是我们想要的。 另一个明智的默认设置是join期望字段分隔符为空白。 同样,我们已经做到了,所以我们可以继续进行并启动join

As we’re using all the defaults, our command is simple:

由于我们使用所有默认值,因此命令很简单:

join file-1.txt file-2.txt
The "join file-1.txt file-2.txt" command in a terminal window.

join considers the files to be “file one” and “file two” according to the order in which they’re listed on the command line.

join根据命令行上列出的顺序将文件视为“文件一”和“文件二”。

The output is as follows:

输出如下:

1 Adore Varian avarian0@newyorker.com Female 192.57.150.231 Varian avarian0@newyorker.com Female Western New York $535,304.73
2 Nancee Merrell nmerrell1@ted.com Female 22.198.121.181 Merrell nmerrell1@ted.com Female Finger Lakes $309,033.10
3 Herta Friett hfriett2@dagondesign.com Female 33.167.32.89 Friett hfriett2@dagondesign.com Female Southern Tier $461,664.44
4 Torie Venmore tvenmore3@gmpg.org Female 251.9.204.115 Venmore tvenmore3@gmpg.org Female Central New York $175,818.02
5 Deni Sealeaf dsealeaf4@nps.gov Female 210.53.81.212 Sealeaf dsealeaf4@nps.gov Female North Country $126,690.15
6 Fidel Bezley fbezley5@lulu.com Male 72.173.218.75 Bezley fbezley5@lulu.com Male Mohawk Valley $366,733.78
7 Ulrikaumeko Standen ustanden6@geocities.jp Female 4.204.0.237 Standen ustanden6@geocities.jp Female Capital District $674,634.93
8 Odell Jursch ojursch7@utexas.edu Male 1.138.85.117 Jursch ojursch7@utexas.edu Male Hudson Valley $663,821.09

The output is formatted in the following way: The field the lines were matched on is printed first, followed by the other fields from file one, and then the fields from file two without the match field.

输出的格式如下:首先打印匹配行的字段,然后打印文件一中的其他字段,然后打印文件二中没有匹配字段的字段。

未分类的字段 (Unsorted Fields)

Let’s try something we know won’t work. We’ll put the lines in one file out of order so join won’t be able to process the file correctly. The contents of file-3.txt are the same as file-2.txt, but line eight is between lines five and six.

让我们尝试一些我们知道不会起作用的东西。 我们将这些行按顺序放在一个文件中,因此join将无法正确处理该文件。 内容file-3.txt是相同的file-2.txt ,但八号线是线五和六之间。

The following is the contents of file-3.txt:

以下是file-3.txt的内容:

1 Varian avarian0@newyorker.com Female Western New York $535,304.73
2 Merrell nmerrell1@ted.com Female Finger Lakes $309,033.10
3 Friett hfriett2@dagondesign.com Female Southern Tier $461,664.44
4 Venmore tvenmore3@gmpg.org Female Central New York $175,818.02
5 Sealeaf dsealeaf4@nps.gov Female North Country $126,690.15
8 Jursch ojursch7@utexas.edu Male Hudson Valley $663,821.09
6 Bezley fbezley5@lulu.com Male Mohawk Valley $366,733.78
7 Standen ustanden6@geocities.jp Female Capital District $674,634.93

We type the following command to try to join file-3.txtto file-1.txt:

我们输入以下命令,尝试将file-3.txtfile-1.txt

join file-1.txt file-3.txt
The "join file-1.txt file-3.txt" command in a terminal window.

join reports that the seventh line in file-3.txt is out of order, so it’s not processed. Line seven is the one that begins with the number six, which should come before eight in a correctly sorted list. The sixth line in the file (which begins with “8 Odell”) was the last one processed, so we see the output for it.

join报告file-3.txt中的第七行顺序file-3.txt ,因此未处理。 第7行是从数字6开始的数字,在正确排序的列表中,数字应该在数字8之前。 文件的第六行(以“ 8 Odell”开头)是最后处理的一行,因此我们看到了它的输出。

You can use the --check-order option if you want to see whether join is happy with the sort order of a files—no merging will be attempted.

如果要查看join是否对文件的排序顺序满意,可以使用--check-order选项-不会尝试合并。

To do so, we type the following:

为此,我们键入以下内容:

join --check-order file-1.txt file-3.txt
The "join --check-order file-1.txt file-3.txt" command in a terminal window.

join tells you in advance there’s going to be a problem with line seven of file file-3.txt.

join会提前告诉您文件file-3.txt行会出现问题。

缺少行的文件 (Files with Missing Lines)

In file-4.txt, the last line has been removed, so there isn’t a line eight. The contents are as follows:

file-4.txt ,最后一行已被删除,因此没有第八行。 内容如下:

1 Varian avarian0@newyorker.com Female Western New York $535,304.73
2 Merrell nmerrell1@ted.com Female Finger Lakes $309,033.10
3 Friett hfriett2@dagondesign.com Female Southern Tier $461,664.44
4 Venmore tvenmore3@gmpg.org Female Central New York $175,818.02
5 Sealeaf dsealeaf4@nps.gov Female North Country $126,690.15
6 Bezley fbezley5@lulu.com Male Mohawk Valley $366,733.78
7 Standen ustanden6@geocities.jp Female Capital District $674,634.93

We type the following and, surprisingly, join doesn’t complain and processes all the lines it can:

我们输入以下内容,令人惊讶的是, join不会抱怨并会处理所有可能的行:

join file-1.txt file-4.txt
The "join file-1.txt file-4.txt" command in a terminal window.

The output lists seven merged lines.

输出列出了七个合并的行。

The -a (print unpairable) option tells join to also print the lines that couldn’t be matched.

-a (打印不可配对)选项告诉join也打印无法匹配的行。

Here, we type the following command to tell join to print the lines from file one that can’t be matched to lines in file two:

在这里,我们键入以下命令告诉join从文件1中打印出无法与文件2中的行匹配的行:

join -a 1 file-1.txt file-4.txt
The "join -a 1 file-1.txt file-4.txt" command in a terminal window.

Seven lines are matched, and line eight from file one is printed, unmatched. There isn’t any merged information because file-4.txt didn’t contain a line eight to which it could be matched. However, at least it still appears in the output so you know it doesn’t have a match in file-4.txt.

匹配了七行,并且打印了文件一中的第八行,但不匹配。 没有任何合并的信息,因为file-4.txt不包含可以与之匹配的第8行。 但是,至少它仍然出现在输出中,因此您知道在file-4.txt中没有匹配file-4.txt

We type the following -v (suppress joined lines) command to reveal any lines that don’t have a match:

我们键入以下-v (禁止连接的行)命令以显示任何不匹配的行:

join -v file-1.txt file-4.txt
The "join -v file-1.txt file-4.txt" command in a terminal window.

We see that line eight is the only one that doesn’t have a match in file two.

我们看到第八行是文件2中唯一不匹配的行。

匹配其他领域 (Matching Other Fields)

Let’s match two new files on a field that isn’t the default (field one). The following is the contents of file-7.txt:

让我们在非默认字段(字段1)上匹配两个新文件。 以下是file-7.txt的内容:

avarian0@newyorker.com Female 192.57.150.231
dsealeaf4@nps.gov Female 210.53.81.212
fbezley5@lulu.com Male 72.173.218.75
hfriett2@dagondesign.com Female 33.167.32.89
nmerrell1@ted.com Female 22.198.121.181
ojursch7@utexas.edu Male 1.138.85.117
tvenmore3@gmpg.org Female 251.9.204.115
ustanden6@geocities.jp Female 4.204.0.237

And the following is the contents of file-8.txt:

以下是文件8.txt的内容:

Female avarian0@newyorker.com Western New York $535,304.73
Female dsealeaf4@nps.gov North Country $126,690.15
Male fbezley5@lulu.com Mohawk Valley $366,733.78
Female hfriett2@dagondesign.com Southern Tier $461,664.44
Female nmerrell1@ted.com Finger Lakes $309,033.10
Male ojursch7@utexas.edu Hudson Valley $663,821.09
Female tvenmore3@gmpg.org Central New York $175,818.02
Female ustanden6@geocities.jp Capital District $674,634.93

The only sensible field to use for joining is the email address, which is field one in the first file and field two in the second. To accommodate this, we can use the -1 (file one field) and -2 (file two field) options. We’ll follow these with a number that indicates which field in each file should be used for joining.

唯一可用于加入的字段是电子邮件地址,该电子邮件地址在第一个文件中是字段1,在第二个文件中是字段2。 为了适应这一点,我们可以使用-1 (文件一字段)和-2 (文件二字段)选项。 我们将在其后跟随一个数字,该数字指示应使用每个文件中的哪个字段进行连接。

We type the following to tell join to use the first field in file one and the second in file two:

我们键入以下内容,告诉join使用文件一中的第一个字段,并使用文件二中的第二个字段:

join -1 1 -2 2 file-7.txt file-8.txt
The "join -1 1 -2 2 file-7.txt file-8.txt" command in a terminal window.

The files are joined on the email address, which is displayed as the first field of each line in the output.

文件在电子邮件地址上合并,该电子邮件地址显示为输出中每行的第一字段。

使用不同的场分离器 (Using Different Field Separators)

What if you have files with fields that are separated by something other than whitespace?

如果您的文件的字段之间用空格分隔,该怎么办?

The following two files are comma-delimited—the only whitespace is between the multiple-word place names:

以下两个文件以逗号分隔-唯一的空格在多字地名之间:

cat file-5.txt
cat file-6.txt
The contents of "cat file-5.txt" and "cat file-6.txt" in a terminal window.

We can use the -t (separator character) to tell join which character to use as the field separator. In this case, it’s the comma, so we type the following command:

我们可以使用-t (分隔符)来告诉join哪个字符用作字段分隔符。 在这种情况下,它是逗号,因此我们键入以下命令:

join -t, file-5.txt file-6.txt
The "join -t, file-5.txt file-6.txt" command in a terminal window.

All the lines are matched, and the spaces are preserved in the place names.

所有行都匹配,并且空格保留在地名中。

忽略字母大小写 (Ignoring Letter Case)

Another file, file-9.txt, is almost identical to file-8.txt. The only difference is some of the email addresses have a capital letter, as shown below:

另一个文件file-9.txtfile-8.txt几乎相同。 唯一的区别是某些电子邮件地址带有大写字母,如下所示:

Female avarian0@newyorker.com Western New York $535,304.73
Female dsealeaf4@nps.gov North Country $126,690.15
Male Fbezley5@lulu.com Mohawk Valley $366,733.78
Female hfriett2@dagondesign.com Southern Tier $461,664.44
Female nmerrell1@ted.com Finger Lakes $309,033.10
Male Ojursch7@utexas.edu Hudson Valley $663,821.09
Female tvenmore3@gmpg.org Central New York $175,818.02
Female ustanden6@geocities.jp Capital District $674,634.93

When we joined file-7.txt and file-8.txt, it worked perfectly. Let’s see what happens with file-7.txt and file-9.txt.

当我们加入file-7.txtfile-8.txt ,它可以完美工作。 让我们看看file-7.txtfile-9.txt会发生什么。

We type the following command:

我们输入以下命令:

join -1 1 -2 2 file-7.txt file-9.txt
The "join -1 1 -2 2 file-7.txt file-9.txt" in a terminal window.

We only matched six lines. The differences in upper- and lowercase letters prevented the other two email addresses from being joined.

我们只匹配了六行。 大写和小写字母的差异阻止了其他两个电子邮件地址的合并。

However, we can use the -i (ignore case) option to force join to ignore those differences and match fields that contain the same text, regardless of case.

但是,我们可以使用-i (忽略大小写)选项来强制join忽略这些差异并匹配包含相同文本的字段,而不管大小写如何。

We type the following command:

我们输入以下命令:

join -1 1 -2 2 -i file-7.txt file-9.txt
The "join -1 1 -2 2 -i file-7.txt file-9.txt" command in a terminal window.

All eight lines are matched and joined successfully.

所有八行都匹配并成功加入。

连连看 (Mix and Match)

In join, you have a powerful ally when you’re wrestling with awkward data preparation. Perhaps you need to analyze the data, or maybe you’re trying to massage it into shape to perform an import to a different system.

join ,当您为笨拙的数据准备工作时,您将拥有强大的盟友。 也许您需要分析数据,或者您正在尝试将数据调整为一定形状以执行导入到其他系统的操作。

No matter what the situation is, you’ll be glad you have join in your corner!

无论发生什么情况,您都将很高兴join自己的角落!

翻译自: https://www.howtogeek.com/542677/how-to-use-the-join-command-on-linux/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值