【Python】数据分析 Section 1.3: Regex | from Coursera “Applied Data Science with Python“

Yqalu

已于 2024-05-06 20:45:24 修改

阅读量735

点赞数 39

分类专栏： Applied Data Science in Python 文章标签： python 数据分析

于 2024-05-04 18:48:33 首次发布

本文链接：https://blog.csdn.net/Yqalu/article/details/138447358

版权

Applied Data Science in Python 专栏收录该内容

25 篇文章 0 订阅

订阅专栏

In this lecture we're going to talk about pattern matching in strings using regular expressions. Regular expressions, or regexes, are written in a condensed formatting language. In general, you can think of a regular expression as a pattern which you give to a regex processor with some source data. The processor then parses that source data using that pattern, and returns chunks of text back to the a data scientist or programmer for further manipulation. There's really three main reasons you would want to do this - to check whether a pattern exists within some source data, to get all instances of a complex pattern from some source data, or to clean your source data using a pattern generally through string splitting. Regexes are not trivial, but they are a foundational technique for data cleaning in data science applications, and a solid understanding of regexs will help you quickly and efficiently manipulate text data for further data science application.

Now, you could teach a whole course on regular expressions alone, especially if you wanted to demystify how the regex parsing engine works and efficient mechanisms for parsing text. In this lecture I want to give you basic understanding of how regex works - enough knowledge that, with a little directed sleuthing, you'll be able to make sense of the regex patterns you see others use, and you can build up your practical knowledge of how to use regexes to improve your data cleaning. By the end of this lecture, you will understand the basics of regular expressions, how to define patterns for matching, how to apply these patterns to strings, and how to use the results of those patterns in data processing.

Finally, a note that in order to best learn regexes you need to write regexes. I encourage you to stop the video at any time and try out new patterns or syntax you learn at any time.

# First we'll import the re module, which is where python stores regular expression libraries.
import re

# There are several main processing functions in re that you might use. The first, match() checks for a match that is at the beginning of the string and returns a boolean. Similarly, search(), checks for a match anywhere in the string, and returns a boolean.

# Lets create some text for an example
text = "This is a good day."

# Now, lets see if it's a good day or not:
if re.search("good", text): # the first parameter here is the pattern
    print("Wonderful!")
else:
    print("Alas :(")

>>> Wonderful!

# In addition to checking for conditionals, we can segment a string. The work that regex does here is called
# tokenizing, where the string is separated into substrings based on patterns. Tokenizing is a core activity
# in natural language processing, which we won't talk much about here but that you will study in the future

# The findall() and split() functions will parse the string for us and return chunks. Lets try and example
text = "Amy works diligently. Amy gets good grades. Our student Amy is succesful."

# This is a bit of a fabricated example, but lets split this on all instances of Amy
re.split("Amy", text)

>>> 
['',
 ' works diligently. ',
 ' gets good grades. Our student ',
 ' is succesful.']

# You'll notice that split has returned an empty string, followed by a number of statements about Amy, all as elements of a list. If we wanted to count how many times we have talked about Amy, we could use findall()
re.findall("Amy", text)

>>> ['Amy', 'Amy', 'Amy']

# Ok, so we've seen that .search() looks for some pattern and returns a boolean, that .split() will use a pattern for creating a list of substrings, and that .findall() will look for a pattern and pull out all occurences.
# Now that we know how the python regex API works, lets talk about more complex patterns. The regex specification standard defines a markup language to describe patterns in text. Lets start with anchors.
# Anchors specify the start and/or the end of the string that you are trying to match. The caret character ^ means start and the dollar sign character $ means end. If you put ^ before a string, it means that the text the regex processor retrieves must start with the string you specify. For ending, you have to put the $ character after the string, it means that the text Regex retrieves must end with the string you specify.

# Here's an example
text = "Amy works diligently. Amy gets good grades. Our student Amy is succesful."

# Lets see if this begins with Amy
re.search("^Amy",text)

>>> <re.Match object; span=(0, 3), match='Amy'>

Patterns and Character Classes

# Let's talk more about patterns and start with character classes. Let's create a string of a single learners'
# grades over a semester in one course across all of their assignments
grades="ACAAAABCBCBAA"

# If we want to answer the question "How many B's were in the grade list?" we would just use B
re.findall("B",grades)

>>> ['B', 'B', 'B']

# If we wanted to count the number of A's or B's in the list, we can't use "AB" since this is used to match all A's followed immediately by a B. Instead, we put the characters A and B inside square brackets
re.findall("[AB]",grades)

>>> ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A']

# This is called the set operator. You can also include a range of characters, which are ordered alphanumerically. For instance, if we want to refer to all lower case letters we could use [a-z] Lets build a simple regex to parse out all instances where this student receive an A followed by a B or a C
re.findall("[A][B-C]",grades)

>>> ['AC', 'AB']

# Notice how the [AB] pattern describes a set of possible characters which could be either (A OR B), while the [A][B-C] pattern denoted two sets of characters which must have been matched back to back. You can write this pattern by using the pipe operator, which means OR
re.findall("AB|AC",grades)

>>> ['AC', 'AB']

# We can use the caret with the set operator to negate our results. For instance, if we want to parse out only the grades which were not A's
re.findall("[^A]",grades)

>>> ['C', 'B', 'C', 'B', 'C', 'B']

# Note this carefully - the caret was previously matched to the beginning of a string as an anchor point, but inside of the set operator the caret, and the other special characters we will be talking about, lose their meaning. This can be a bit confusing. What do you think the result would be of this?
re.findall("^[^A]",grades)

>>> []

Quantifiers

# Quantifiers are the number of times you want a pattern to be matched in order to match. The most basic quantifier is expressed as e{m,n}, where e is the expression or character we are matching, m is the minimum number of times you want it to matched, and n is the maximum number of times the item could be matched.

# Let's use these grades as an example. How many times has this student been on a back-to-back A's streak?
re.findall("A{2,10}",grades) # we'll use 2 as our min, but ten as our max

>>> ['AAAA', 'AA']

# So we see that there were two streaks, one where the student had four A's, and one where they had only two A's

# We might try and do this using single values and just repeating the pattern
re.findall("A{1,1}A{1,1}",grades)

>>> ['AA', 'AA', 'AA']

# As you can see, this is different than the first example. The first pattern is looking for any combination of two A's up to ten A's in a row. So it sees four A's as a single streak. The second pattern is looking for two A's back to back, so it sees two A's followed immediately by two more A's. We say that the regex processor begins at the start of the string and consumes variables which match patterns as it does.

# It's important to note that the regex quantifier syntax does not allow you to deviate from the {m,n} pattern. In particular, if you have an extra space in between the braces you'll get an empty result
re.findall("A{2, 2}",grades)

>>> []

# And as we have already seen, if we don't include a quantifier then the default is {1,1}
re.findall("AA",grades)

>>> ['AA', 'AA', 'AA']

# Oh, and if you just have one number in the braces, it's considered to be both m and n
re.findall("A{2}",grades)

>>> ['AA', 'AA', 'AA']

# Using this, we could find a decreasing trend in a student's grades
re.findall("A{1,10}B{1,10}C{1,10}",grades)

>>> ['AAAABC']

# Now, that's a bit of a hack, because we included a maximum that was just arbitrarily large. There are three other quantifiers that are used as short hand, an asterix * to match 0 or more times, a question mark ? to match one or more times, or a + plus sign to match one or more times. Lets look at a more complex example, and load some data scraped from wikipedia
with open("datasets/ferpa.txt","r") as file:
    # we'll read that into a variable called wiki
    wiki=file.read()
# and lets print that variable out to the screen
wiki

# Scanning through this document one of the things we notice is that the headers all have the words [edit] behind them, followed by a newline character. So if we wanted to get a list of all of the headers in this article we could do so using re.findall
re.findall("[a-zA-Z]{1,100}\[edit\]",wiki)

>>> ['Overview[edit]', 'records[edit]', 'records[edit]']

# Ok, that didn't quite work. It got all of the headers, but only the last word of the header, and it really
# was quite clunky. Lets iteratively improve this. First, we can use \w to match any letter, including digits
# and numbers.
re.findall("[\w]{1,100}\[edit\]",wiki)

>>> ['Overview[edit]', 'records[edit]', 'records[edit]']

# This is something new. \w is a metacharacter, and indicates a special pattern of any letter or digit. There are actually a number of different metacharacters listed in the documentation. For instance, \s matches any whitespace character.

# Next, there are three other quantifiers we can use which shorten up the curly brace syntax. We can use an asterix * to match 0 or more times, so let's try that.
re.findall("[\w]*\[edit\]",wiki)

>>> ['Overview[edit]', 'records[edit]', 'records[edit]']

# Now that we have shortened the regex, let's improve it a little bit. We can add in a spaces using the space character
re.findall("[\w ]*\[edit\]",wiki)

>>> 
['Overview[edit]',
 'Access to public records[edit]',
 'Student medical records[edit]']

# Ok, so this gets us the list of section titles in the wikipedia page! You can now create a list of titles by iterating through this and applying another regex
for title in re.findall("[\w ]*\[edit\]",wiki):
    # Now we will take that intermediate result and split on the square bracket [ just taking the first result
    print(re.split("[\[]",title)[0])

>>> Overview
Access to public records
Student medical records

Groups

# Ok, this works, but it's a bit of a pain. To this point we have been talking about a regex as a single pattern which is matched. But, you can actually match different patterns, called groups, at the same time, and then refer to the groups you want. To group patterns together you use parentheses, which is actually pretty natural. Lets rewrite our findall using groups
re.findall("([\w ]*)(\[edit\])",wiki)

>>> 
[('Overview', '[edit]'),
 ('Access to public records', '[edit]'),
 ('Student medical records', '[edit]')]

# Nice - we see that the python re module breaks out the result by group. We can actually refer to groups by number as well with the match objects that are returned. But, how do we get back a list of match objects? Thus far we've seen that findall() returns strings, and search() and match() return individual Match objects. But what do we do if we want a list of Match objects? In this case, we use the function finditer()
for item in re.finditer("([\w ]*)(\[edit\])",wiki):
    print(item.groups())

>>> 
('Overview', '[edit]')
('Access to public records', '[edit]')
('Student medical records', '[edit]')

# We see here that the groups() method returns a tuple of the group. We can get an individual group using group(number), where group(0) is the whole match, and each other number is the portion of the match we are interested in. In this case, we want group(1)
for item in re.finditer("([\w ]*)(\[edit\])",wiki):
    print(item.group(1))

>>> 
Overview
Access to public records
Student medical records

# One more piece to regex groups that I rarely use but is a good idea is labeling or naming groups. In the previous example I showed you how you can use the position of the group. But giving them a label and looking at the results as a dictionary is pretty useful. For that we use the syntax (?P<name>), where the parethesis starts the group, the ?P indicates that this is an extension to basic regexes, and <name> is the dictionary key we want to use wrapped in <>.
for item in re.finditer("(?P<title>[\w ]*)(?P<edit_link>\[edit\])",wiki):
    # We can get the dictionary returned for the item with .groupdict()
    print(item.groupdict()['title'])

>>> 
Overview
Access to public records
Student medical records

# Of course, we can print out the whole dictionary for the item too, and see that the [edit] string is still in there. Here's the dictionary kept for the last match
print(item.groupdict())

>>> {'title': 'Student medical records', 'edit_link': '[edit]'}

Look-ahead and Look-behind

# One more concept to be familiar with is called "look ahead" and "look behind" matching. In this case, the pattern being given to the regex engine is for text either before or after the text we are trying to isolate. For example, in our headers we want to isolate text which  comes before the [edit] rendering, but we actually don't care about the [edit] text itself. Thus far we have been throwing the [edit] away, but if we want to use them to match but don't want to capture them we could put them in a group and use look ahead instead with ?= syntax
for item in re.finditer("(?P<title>[\w ]+)(?=\[edit\])",wiki):
    # What this regex says is match two groups, the first will be named and called title, will have any amount
    # of whitespace or regular word characters, the second will be the characters [edit] but we don't actually
    # want this edit put in our output match objects
    print(item)

>>> 
<re.Match object; span=(0, 8), match='Overview'>
<re.Match object; span=(2715, 2739), match='Access to public records'>
<re.Match object; span=(3692, 3715), match='Student medical records'>

Example: Wikipedia Data

# Let's look at some more wikipedia data. Here's some data on universities in the US which are buddhist-based
with open("datasets/buddhist.txt","r") as file:
    # we'll read that into a variable called wiki
    wiki=file.read()
# and lets print that variable out to the screen
wiki

# We can see that each university follows a fairly similar pattern, with the name followed by an – then the words "located in" followed by the city and state

# I'll actually use this example to show you the verbose mode of python regexes. The verbose mode allows you to write multi-line regexes and increases readability. For this mode, we have to explicitly indicate all whitespace characters, either by prepending them with a \ or by using the \s special value. However, this means we can write our regex a bit more like code, and can even include comments with #
pattern="""
(?P<title>.*)        #the university title
(–\ located\ in\ )   #an indicator of the location
(?P<city>\w*)        #city the university is in
(,\ )                #separator for the state
(?P<state>\w*)       #the state the city is located in"""

# Now when we call finditer() we just pass the re.VERBOSE flag as the last parameter, this makes it much
# easier to understand large regexes!
for item in re.finditer(pattern,wiki,re.VERBOSE):
    # We can get the dictionary returned for the item with .groupdict()
    print(item.groupdict())

>>>
{'title': 'Dhammakaya Open University ', 'city': 'Azusa', 'state': 'California'}
{'title': 'Dharmakirti College ', 'city': 'Tucson', 'state': 'Arizona'}
{'title': 'Dharma Realm Buddhist University ', 'city': 'Ukiah', 'state': 'California'}
{'title': 'Ewam Buddhist Institute ', 'city': 'Arlee', 'state': 'Montana'}
{'title': 'Institute of Buddhist Studies ', 'city': 'Berkeley', 'state': 'California'}
{'title': 'Maitripa College ', 'city': 'Portland', 'state': 'Oregon'}
{'title': 'University of the West ', 'city': 'Rosemead', 'state': 'California'}
{'title': 'Won Institute of Graduate Studies ', 'city': 'Glenside', 'state': 'Pennsylvania'}

Example: New York Times and Hashtags

# Here's another example from the New York Times which covers health tweets on news items. This data came from the UC Irvine Machine Learning Repository which is a great source of different kinds of data
with open("datasets/nytimeshealth.txt","r") as file:
    # We'll read everything into a variable and take a look at it
    health=file.read()
health

# So here we can see there are tweets with fields separated by pipes |. Lets try and get a list of all of the hashtags that are included in this data. A hashtag begins with a pound sign (or hash mark) and continues until some whitespace is found

# So lets create a pattern. We want to include the hash sign first, then any number of alphanumeric characters. And we end when we see some whitespace
pattern = '#[\w\d]*(?=\s)'

# Notice that the ending is a look ahead. We're not actually interested in matching whitespace in the return value. Also notice that I use an asterix * instead of the plus + for the matching of alphabetical characters or digits, because a + would require at least one of each

# Lets searchg and display all of the hashtags
re.findall(pattern, health)

>>> 
['#askwell',
 '#pregnancy',
 '#Colorado',
 '#VegetarianThanksgiving',
 '#FallPrevention',
 '#Ebola',
 '#Ebola',
 '#ebola',
 '#Ebola',
 '#Ebola',
 '#EbolaHysteria',
 '#AskNYT',
 '#Ebola',
 '#Ebola',
 '#Liberia',
 '#Excalibur',
 '#ebola',
 '#Ebola',
 '#dallas',
 '#nobelprize2014',
 '#ebola',
 '#ebola',
 '#monrovia',
 '#ebola',
 '#nobelprize2014',
 '#ebola',
 '#nobelprize2014',
 '#Medicine',
 '#Ebola',
 '#Monrovia',
 '#Ebola',
 '#smell',
 '#Ebola',
 '#Ebola',
 '#Ebola',
 '#Monrovia',
 '#Ebola',
 '#ebola',
 '#monrovia',
 '#liberia',
 '#benzos',
 '#ClimateChange',
 '#Whole',
 '#Wheat',
 '#Focaccia',
 '#Tomatoes',
 '#Olives',
 '#Recipes',
 '#Health',
 '#Ebola',
 '#Monrovia',
 '#Liberia',
 '#Ebola',
 '#Ebola',
 '#Liberia',
 '#Ebola',
 '#blood',
 '#Ebola',
 '#organtrafficking',
 '#EbolaOutbreak',
 '#SierraLeone',
 '#Freetown',
 '#SierraLeone',
 '#ebolaoutbreak',
 '#kenema',
 '#ebola',
 '#Ebola',
 '#ebola',
 '#ebola',
 '#Ebola',
 '#ASMR',
 '#AIDS2014',
 '#AIDS',
 '#MH17',
 '#benzos']

This lecture has been an overview of regular expressions, and really, we've just scratched the surface of what you can do. Now, I actually find regex's really frustrating - they're incredibly powerful, but if you don't use them for a while you're left grasping for memory of some of the details, especially named groups and look ahead searches. But, there are lots of great examples and reference guides on the web, including the python documentation for regex, and with these in hand you should be able to write concise and readable code which performs well too. Having basic regex literacy is a core skill for applied data scientists.

Yqalu

关注

39
点赞
踩
30

收藏

觉得还不错? 一键收藏
0
评论
【Python】数据分析 Section 1.3: Regex | from Coursera “Applied Data Science with Python“

Coursera "Applied Data Science with Python"\Module 1: Introduction to Data Science in Python\Section 1: Regex
复制链接

扫一扫