简单正则表达式_对R的正则表达式的简要介绍

简单正则表达式

We live in a data-centric age. Data has been described as the new oil. But just like oil, data isn’t always useful in its raw form. One form of data that is particularly hard to use in its raw form is unstructured data.

我们生活在以数据为中心的时代。 数据被描述为新油 。 但是,就像石油一样,数据并非总是以原始形式有用。 非结构化数据是一种特别难以以原始形式使用的数据形式。

A lot of data is unstructured data. Unstructured data doesn’t fit nicely into a format for analysis, like an Excel spreadsheet or a data frame. Text data is a common type of unstructured data and this makes it difficult to work with. Enter regular expressions, or regex for short. They may look a little intimidating at first, but once you get started, using them will be a picnic!

许多数据是非结构化数据。 非结构化数据不太适合用于分析格式,例如Excel电子表格或数据框 。 文本数据是非结构化数据的一种常见类型,这使其难以使用。 输入正则表达式 ,或正则表达式的简称。 一开始它们可能看起来有些吓人,但是一旦您开始使用它们,那将是一次野餐!

More comfortable with python? Try my tutorial for using regex with python instead:

使用python更舒服吗? 试试我的教程,将regex与python结合使用:

stringr(The stringr Library)

We’ll use the stringr library. The stringr library is built off a C library, so all of its functions are very fast.

我们将使用stringr库。 stringr库是基于C库构建的,因此其所有功能都非常快。

To install and load the stringr library in R, use the following commands:

要在R中安装和加载stringr库,请使用以下命令:

See how easy that is? To make things even easier, most function names in the stringr package start with str. Let’s take a look at a couple of the functions we have available to us in this module:

看到那有多么容易? 为了使事情变得更加简单, stringr包中的大多数函数名称都以str开头。 让我们看一下本模块中提供给我们的几个功能:

  1. str_extract_all(string, pattern): This function returns a list with a vector containing all instances of pattern in string

    str_extract_all(string, pattern) :此函数返回一个带有向量的列表,该向量包含string中所有pattern实例

  2. str_replace_all(string, pattern, replacement): This function returns string with instances of pattern in string replaced with replacement

    str_replace_all(string, pattern, replacement) :该函数返回string与实例patternstring替换为replacement

You may have already used these functions. They have pretty straightforward applications without adding regex. Think back to the times before social distancing and imagine a nice picnic in the park, like the image above. Here’s an example string with what everyone is bringing to the picnic. We can use it to demonstrate the basic usage of the regex functions:

您可能已经使用了这些功能。 他们有非常简单的应用程序,而无需添加正则表达式。 回想一下与社会隔离之前的时代,想象一下在公园里野餐的乐趣,就像上面的图片一样。 这是每个人带去野餐的字符串示例。 我们可以用它来演示正则表达式功能的基本用法:

basicString <- "Drew has 3 watermelons, Alex has 4 hamburgers, Karina has 12 tamales, and Anna has 6 soft pretzels"

If I want to pull every instance of one person’s name from this string, I would simply pass the name and basic_string to str_extract_all():

如果我想从该字符串中提取一个人的名字的每个实例,我只需将名字和basic_string传递给str_extract_all()

The result will be a list with all occurrences of the pattern. Using this example, basicExtractAll will have the following list with 1 vector as output:

结果将是包含所有出现的模式的列表。 使用此示例, basicExtractAll将具有以下列表,其中1个向量作为输出:

[[1]]
[1] "Drew"

Now let’s imagine that Alex left his 4 hamburgers unattended at the picnic and they were stolen by Shawn. str_replace_all can replace any instances of Alex with Shawn:

现在,让我们想象一下亚历克斯在野餐时无人看管的情况下留下了他的4个汉堡,它们被肖恩偷走了。 str_replace_all可以用Shawn替换Alex的任何实例:

The resulting string will show that Shawn now has 4 hamburgers. What a lucky guy 🍔.

结果字符串将显示Shawn现在有4个汉堡包。 真幸运。

"Drew has 3 watermelons, Shawn has 4 hamburgers, Karina has 12 tamales, and Anna has 6 soft pretzels"

The examples so far are pretty basic. There is a time and place for them, but what if we want to know how many total food items there are at the picnic? Who are all the people with items? What if we need this data in a data frame for further analysis? This is where you will start to see the benefits of regex.

到目前为止,这些示例都是非常基本的。 他们有时间和地点,但是如果我们想知道野餐中总共有多少食物,该怎么办? 谁都是物品的人? 如果我们需要数据框中的数据进行进一步分析怎么办? 在这里,您将开始看到正则表达式的好处。

正则表达式 (Regex Vocab)

There are several concepts that drive regex:

有几种驱动正则表达式的概念:

  1. Character sets

    字符集
  2. Meta characters

    元字符
  3. Quantifiers

    量词
  4. Capture Groups

    捕获组

This is not an exhaustive list, but is plenty to help us hit the ground running.

这不是一个详尽的清单,但足以帮助我们起步。

字符集 (Character Sets)

Character sets represent options inside of brackets, with regex matching only one of the options. There are multiple things we can do with character sets:

字符集表示方括号内的选项,而正则表达式仅匹配选项之一。 我们可以使用字符集进行多种操作:

  • Match a group of characters: We can find all of the vowels in our string by putting every vowel in brackets, for example,[aeiou]

    匹配一组字符 :通过将每个元音放在方括号中,我们可以找到字符串中的所有元音,例如[aeiou]

[[1]]
[1] "e" "a" "a" "e" "e" "o" "e" "a" "a" "u" "e" "a" "i" "a"
[15] "a" "a" "a" "e" "a" "a" "a" "o" "e" "e"
  • Match a range of characters: We can find any capital letter from “A” to “F,” by using a hyphen, [A-F]. Character sets are case sensitive, so [A-F] is not the same as [a-f]

    匹配一系列字符 :我们可以使用连字符 [AF]来找到从“ A”到“ F”的任何大写字母 。 字符集区分大小写 ,因此[AF][af]

[[1]]
[1] "D" "A" "K" "A"
  • Match a range of numbers: We can find numbers between a range by adding numbers to our character set, [0-9] to find any number. Notice that the numbers are extracted as strings, not converted to numbers

    匹配数字范围 :我们可以通过将数字添加到字符集[0-9]来查找任意数字,从而找到一个范围内的数字。 请注意,数字被提取为字符串,而不是转换为数字

[[1]]
[1] "3" "4" "1" "2" "6"

Character sets can contain everything from this section simultaneously, so something like [A-Ct-z7-9] is still valid. It will match every character from capital “A” to capital “C,” lowercase “t” to lowercase “z,” and 7 through 9.

字符集可以同时包含此部分中的所有内容,因此[A-Ct-z7-9]内容仍然有效。 它将匹配从大写字母“ A”到大写字母“ C”,小写字母“ t”到小写字母“ z”以及7到9的每个字符。

So far we can’t answer any of the questions posed earlier with just bracket groups. Let’s add some more weapons to our regex arsenal.

到目前为止,仅使用括号组就无法回答前面提出的任何问题。 让我们在正则表达式武器库中添加更多武器。

元字符 (Meta Characters)

Meta characters represent a type of character. They will typically begin with a backslash \. Since the backslash \ is a special character in R, it needs to be escaped each time it is used with another backslash. In other words, R requires 2 backslashes when using meta characters. Each meta character will match to a single character. Here are some of the most important ones in action:

元字符代表一种字符。 它们通常以反斜杠\开头。 由于反斜杠\是R中的特殊字符,因此每次与另一个反斜杠一起使用时,都必须对其进行转义。 换句话说, 使用元字符时R需要2个反斜杠 。 每个元字符将匹配一个字符。 以下是一些最重要的操作:

  • \\s: This meta character represents spaces. This will match to each space, tab, and newline character. You may also specify \\t and \\n for tab and newline characters respectively. Side note: our example string does not have any tabs, but be cautious when looking for them. Many integrated development environments, or IDE’s, have a setting that will replace all tabs with spaces while you are typing. In the example string, \\s returns a list of a vector of 17 spaces, the exact number of spaces in our example string!

    \\s :此meta字符表示空格 。 这将与每个空格,制表符和换行符匹配。 您也可以分别为制表符和换行符指定\\t\\n 。 旁注:我们的示例字符串没有任何选项卡,但是在寻找它们时请务必小心。 许多集成开发环境或IDE的设置都将在您键入时将所有选项卡替换为空格。 在示例字符串中, \\s返回一个由17个空格组成的向量的列表,这是示例字符串中空格的确切数目!

[[1]]
[1] " " " " " " " " " " " " " " " " " " " " " " " " " " " "
[15] " " " " " "
  • \\w: This meta character represents alphanumeric characters. This includes all the letters a-z, capital and lowercase, and the numbers 0–9. This would be the equivalent of the bracket group [A-Za-z0-9], just much quicker to write. Take caution in remembering that the \\w meta character on its own only captures a single character, not entire words or numbers. You’ll see that in the example. Don’t worry, we’ll handle that in the next section.

    \\w :此元字符表示字母数字字符 。 其中包括所有字母az,大写和小写字母以及数字0–9。 这相当于括号组[A-Za-z0-9] ,只是编写起来要快得多。 请注意, \\w元字符本身仅捕获单个字符,而不捕获整个单词或数字,因此请务必小心。 您将在示例中看到。 不用担心,我们将在下一部分中处理。

[[1]]
[1] "D" "r" "e" "w" "h" "a" "s" "3" "w" "a" "t" "e" "r" "m"
[15] "e" "l" "o" "n" "s" "A" "l" "e" "x" "h" "a" "s" "4" "h"
[29] "a" "m" "b" "u" "r" "g" "e" "r" "s" "K" "a" "r" "i" "n"
[43] "a" "h" "a" "s" "1" "2" "t" "a" "m" "a" "l" "e" "s" "a"
[57] "n" "d" "A" "n" "n" "a" "h" "a" "s" "6" "s" "o" "f" "t"
[71] "p" "r" "e" "t" "z" "e" "l" "s"
  • \\d: This meta character represents numeric digits. Using our picnic example, see how it only finds the digits in the string. You’ll notice that like bracket groups, it picks up 5 numbers instead of the 4 we expect. This is because it is looking for each individual digit, not groups of digits. We’ll see how to fix that with quantifiers next.

    \\d :此元字符表示数字 。 以我们的野餐示例为例,了解它如何仅找到字符串中的数字。 您会注意到,像括号组一样,它会选择5个数字,而不是我们期望的4个数字。 这是因为它正在查找每个单独的数字,而不是数字组。 接下来,我们将介绍如何使用量词解决此问题。

[[1]]
[1] "3" "4" "1" "2" "6"

量词 (Quantifiers)

As we saw in the previous section, a single meta character can have somewhat limited functionality. When it comes to words or numbers, we usually want to find more than 1 character at a time. This is where quantifiers come in. They allow you to quantify how many of a character you are expecting. They always come after the character they are quantifying and come in a few flavors:

正如我们在上一节中看到的,单个元字符的功能可能有所限制。 当涉及单词或数字时, 我们通常希望一次查找多个字符 。 这是量词出现的地方。它们使您可以量化期望的字符数。 他们总是他们被量化和进来几个口味的字符来

  • + quantifies 1 or more matches. Let’s look at a new example to develop some intuition about what each quantifier will return: quantExample

    +量化1个或多个匹配项。 让我们看一个新的示例,以直观地了解每个量词将返回的内容: quantExample

When we use the + quantifier on quantExample, it will return 4 matches. This is a good point to mention that regex looks for non-overlapping matches. In this case, it looks at each B and the character that follows it. Since we used the + quantifier, it continues to match until it reaches the end of a group of B’s.

当我们在quantExample上使用+量词时,它将返回4个匹配项。 提到regex寻找不重叠的匹配是一个很好的观点。 在这种情况下,它将查看每个B及其后面的字符。 由于我们使用了+量词,因此它将继续匹配,直到到达一组B的末尾。

[[1]]
[1] "B" "BB" "BBB" "BBBB"
  • {} quantifies a specific number or range of matches. When written like {2} it will match exactly 2 of the preceding character. We’ll see some interesting results. It picked up 4 matches. This is because it is looking for each non-overlapping group of 2 B’s. There is a match in the 1st group, only 1 non-overlapping match in the 2nd group, and 2 non-overlapping matches in the 4th.

    {}量化匹配的特定数量或范围 。 当像{2}一样书写时,它将恰好匹配前一个字符中的2个。 我们将看到一些有趣的结果。 它赢得了4场比赛。 这是因为它正在寻找2个B的每个不重叠组。 第一组中有一场比赛,第二组中只有1次非重叠比赛,第四组中只有2次非重叠比赛。

[[1]]
[1] "BB" "BB" "BB" "BB"

When written like {2,4}, it will match any number of B’s from 2 to 4 occurrences. Note that putting a space in your regex will NOT work. It will return an empty list.

当像{2,4}这样写时,它将匹配2到4次出现的任意数量的B。 请注意,在正则表达式中放置空格将不起作用。 它将返回一个空列表。

[[1]]
[1] "BB" "BBB" "BBBB"

We can also write this quantifier and omit the upper bound like {2,}. This will match 2 or more instances. For quantExample, it will return the exact same result as {2,4}.

我们还可以编写此量词,并省略上限,例如{2,} 。 这将匹配2个或更多实例。 对于quantExample ,它将返回与{2,4}完全相同的结果。

  • * quantifies zero or more matches. This can be helpful when we are looking for something that may or may not be in our string.

    *量化零个或多个匹配。 当我们寻找字符串中可能存在或不存在的东西时,这可能会有所帮助。

The * quantifier returns some strange matches when used by itself, so we can omit an example with quantExample. We will see in a following example how it can be applied when someone at our picnic is bringing a food item with a multiple word name. Without it, we wouldn’t correctly capture that Anna is bringing soft pretzels!

*量词quantExample使用时会返回一些奇怪的匹配项,因此我们可以使用quantExample省略一个示例。 在下面的示例中,我们将看到在野餐时有人带上多个单词名称的食品时如何应用它。 没有它,我们将无法正确地捕捉到安娜带来的软椒盐脆饼!

Let’s combine what we know so far about character sets, meta characters, and quantifiers to answer some questions about our picnic string. We want to know all of the words that are in the string and also the numbers in the string.

让我们结合到目前为止对字符集,元字符和量词的了解,回答一些有关野餐字符串的问题。 我们想知道字符串中的所有单词以及字符串中的数字。

For words, we can use a character set with all upper and lower case letters, adding a + quantifier to it. This will find any length of alpha characters grouped together. Said another way, it finds all of the words. Regex is starting to look much more helpful.

对于单词,我们可以使用所有大写和小写字母的字符集,并为其添加一个+量词。 这将找到组合在一起的任意长度的字母字符。 换句话说,它找到所有单词。 正则表达式开始看起来更加有用。

[[1]]
[1] "Drew" "has" "watermelons" "Alex"
[5] "has" "hamburgers" "Karina" "has"
[9] "tamales" "and" "Anna" "has"
[13] "soft" "pretzels"

To find the quantity of each food item, we can use the \\d meta character and the quantifier {1,2}. This will find the groups of digits that are 1 or 2 characters long. This is a much more useful output as we have the same number of quantities as we have food items and people!

要找到每种食品的数量,我们可以使用\\d元字符和量词{1,2} 。 这将找到1或2个字符长的数字组。 这是一个非常有用的输出,因为我们拥有的数量与食物和人的数量相同!

[[1]]
[1] "3" "4" "12" "6"

To find the quantity and name of each food item, we can combine quantifiers with meta characters. We know that each number has a food item directly after it, so we can just add on to the previous example. We know there is a space and a word (\\s\\w+) that could be followed by another word like how “soft pretzel” appears. To specify the second word might not be there, we can use the * quantifier with the second word. Just like that we have a list containing the quantity and name of every good at our picnic.

为了找到每种食品的数量和名称,我们可以将量词与元字符结合使用。 我们知道每个数字都紧随其后有一个食品,因此我们可以在前面的示例中添加一个食品。 我们知道有一个空格和一个单词( \\s\\w+ ),后面可以跟另一个单词,例如“软椒盐脆饼”的显示方式。 要指定第二个单词可能不存在,可以对第二个单词使用*量词。 就像这样,我们有一个清单,其中包含野餐中每种商品的数量和名称。

[[1]]
[1] "3 watermelons" "4 hamburgers" "12 tamales"
[4] "6 soft pretzels"

捕获组 (Capture Groups)

Capture groups allow you to look for entire phrases and only return parts of them. With our example, I want each person’s name, what they are bringing, and how much of it they are bringing. Up until this point we have been using str_extract_all. It has a clean output that is easy to read for our examples, but it doesn’t actually work with capture groups. Helpfully, stringr provides str_match_all which does work with capture groups. It does however output the results in a list containing a matrix as opposed to a list containing a vector.

捕获组使您可以查找整个短语,只返回其中的一部分。 在我们的示例中,我想要每个人的名字,他们带来的东西以及他们带来多少。 到目前为止,我们一直在使用str_extract_all 。 它具有清晰的输出,对于我们的示例而言很容易阅读,但实际上不适用于捕获组。 str_match_all是, stringr提供了str_match_all ,它可以与捕获组一起使用。 但是,它确实将结果输出到包含矩阵的列表中,而不是包含矢量的列表中。

[[1]]
[,1]
[1,] "Drew has 3 watermelons"
[2,] "Alex has 4 hamburgers"
[3,] "Karina has 12 tamales"
[4,] "Anna has 6 soft pretzels"

The regex we used in captureGroup1 is looking for a name, which starts with a capital letter and has any amount of lowercase letters after it ([A-Z][a-z]+). Then after a space it matches the pattern space, word, space \\s\\w+\\s. Next we are looking for a 1 to 2 digit number followed by a space and a word (\\d{1,2}\\s\\w+). You can see in the output each row of the matrix is a character string with the details for each person.

我们在captureGroup1使用的正则表达式正在寻找一个名称,该名称以大写字母开头, captureGroup1任意数量的小写字母( [AZ][az]+ )。 然后,在空格之后,它与模式空格,单词,空格\\s\\w+\\s匹配。 接下来,我们要寻找一个1到2位的数字,后跟一个空格和一个单词( \\d{1,2}\\s\\w+ )。 您可以在输出中看到矩阵的每一行都是一个字符串,其中包含每个人的详细信息。

Now this is a big step up from where we started, but we don’t really care about the word “has”, and we want to be able to make a data frame out of the quantities. Let’s add in capture groups. By using capture groups, we can return a matrix where each column contains a specific piece of information. We’ll create capture groups containing each name, quantity, and item. Capture groups are simply sections of the regex that you wrap in parenthesis.

现在这是从我们开始的地方迈出的一大步,但我们并不真正在乎“有”一词,我们希望能够从数量上制造出一个数据框架。 让我们添加捕获组。 通过使用捕获组,我们可以返回一个矩阵,其中每一列都包含特定的信息。 我们将创建包含每个名称,数量和项目的捕获组。 捕获组只是您用括号括起来的正则表达式的一部分。

[[1]]
[,1] [,2] [,3] [,4]
[1,] "Drew has 3 watermelons" "Drew" "3" "watermelons"
[2,] "Alex has 4 hamburgers" "Alex" "4" "hamburgers"
[3,] "Karina has 12 tamales" "Karina" "12" "tamales"
[4,] "Anna has 6 soft pretzels" "Anna" "6" "soft pretzels"

The first column in the matrix has the entire regex, ignoring the capture groups. The remaining columns of the matrix each correspond to the capture groups we defined for name, quantity, and item.

矩阵的第一列具有整个正则表达式,而忽略捕获组。 矩阵的其余各列分别对应于我们为名称,数量和项目定义的捕获组。

将我们的文本合并到数据框中 (Combining our Text into a Data Frame)

When doing data analysis, one of the most useful R data structures is a data frame. No doubt you already knew this if you clicked on this article. Data frames enable things like calculating column statistics and plotting data. Since we have a matrix with all of the information we want, turning it into a data frame isn’t too hard. We will use the data.frame function on everything except for the first column of the matrix. data.frame gives default columns names, so we will change those to match up to what is in each column.

进行数据分析时,最有用的R数据结构之一是数据帧。 毫无疑问,如果您单击本文,您已经知道这一点。 数据框使诸如计算列统计信息和绘制数据之类的事情成为可能。 由于我们拥有一个包含所有所需信息的矩阵,因此将其转换为数据帧并不难。 我们将在矩阵的第一列以外的所有内容上使用data.frame函数。 data.frame提供了默认的列名称,因此我们将对其进行更改以匹配每列中的名称。

A quick note on the notation: the first set of brackets after captureGroup2 ([[1]]) accesses the first element of the list, our matrix. The second set of brackets ( [,-1]) selects all rows and every column except the first one.

关于该符号的快速注释: captureGroup2 ( [[1]] )之后的第一组括号访问列表的第一个元素,即矩阵。 第二组括号( [,-1] )选择所有行和第一列以外的每一列。

|   | Name   | Quantity | Item          |
| - | ------ | -------- | ------------- |
| 1 | Drew | 3 | watermelons |
| 2 | Alex | 4 | hamburgers |
| 3 | Karina | 12 | tamales |
| 4 | Anna | 6 | soft pretzels |

结论和进一步学习 (Conclusion and Further Learning)

We only covered a small subset of how regex can help handle unstructured text data. This is a good foundation to get started, but before long you will need to know concepts like how to find everything BUT a character (negation) or find something immediately before or after something else (lookarounds). Depending on your use case you may need to understand how lazy and greedy matching operate. Here are some more resources to help you learn more about these other concepts in regex:

我们仅介绍了正则表达式如何帮助处理非结构化文本数据的一小部分。 这是入门的良好基础,但是不久之后,您将需要了解诸如如何找到所有字符(否定),或在其他事物之前或之后找到事物(环顾四周)之类的概念。 根据您的用例,您可能需要了解惰性匹配和贪婪匹配的工作方式。 以下是一些更多资源,可帮助您了解正则表达式中的其他概念:

  • The official stringr page on the tidyverse site: The folks over at RStudio have compiled resources to help learn packages like stringr. They even included a stringr cheat sheet that you can print out and reference.

    官方 stringr 在tidyverse网站页面 :超过乡亲RStudio已编制资源来帮助学会像包stringr 。 他们甚至还提供了更stringr 备忘单 ,您可以将其打印出来并参考。

  • R for Data Science: Written by Hadley Wickham, author of the stringr package, this book is a good reference for anything in R. There is even a chapter that covers more advanced regex in R. It is available online for free here, or you can purchase a hardcopy here. Disclaimer: I receive a commission of your purchase through this link.

    R for Data Science :由stringr软件包的作者Hadley Wickham撰写,这本书是R中任何内容的很好参考。甚至有一章涵盖了R中更高级的正则表达式。可以在此处免费在线获取,或者您可以可以在这里购买纸质版。 免责声明:我会通过此链接收取您的购买佣金

  • Datacamp Courses: An online learning community dedicated to data science, machine learning, and data visualization. Check out their course “String Manipulation with stringr in R.” The first chapter of every course on the site is free! Disclaimer: You will receive a discount on your subscription and I receive a commission if you sign up for a monthly Datacamp subscription using this link. Support my writing while you learn!

    Datacamp课程 :在线学习社区,致力于数据科学,机器学习和数据可视化。 查看他们的课程“在R中使用Stringr进行字符串操作”。 网站上每门课程的第一章都是免费的! 免责声明:如果您使用 链接 注册Datacamp的每月订阅,您将获得订阅折扣,并且我将获得佣金 在学习的同时支持我的写作!

As always, let me know if you enjoyed the content. Don’t like how I use regex? Tell me about it in the comments. Either way, subscribe so you get notified every time I post new content!

与往常一样,让我知道您是否喜欢其中的内容。 不喜欢我如何使用正则表达式? 在评论中告诉我。 无论哪种方式,都可以订阅,这样我每次发布新内容时都会收到通知!

翻译自: https://towardsdatascience.com/a-gentle-introduction-to-regular-expressions-with-r-df5e897ca432

简单正则表达式

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值