Python 30 天:第 18 天 -- 正则表达式

<< 第 17 天 || 第 19 天 >>

第 18 天 


正则表达式或 RegEx 是一种特殊的文本字符串,可帮助查找数据中的模式。RegEx 可用于检查不同数据类型中是否存在某种模式。要在 python 中使用 RegEx,首先我们应该导入名为re的 RegEx 模块。

re 模块


import re



  • re.match():仅在字符串第一行的开头搜索,如果找到则返回匹配的对象,否则返回 None。
  • re.findall:返回包含所有匹配项的列表
  • re.split:获取一个字符串,在匹配点处拆分它,返回一个列表
  • re.sub:替换字符串中的一个或多个匹配项


# syntac
re.match(substring, string, re.I)
# substring is a string or a pattern, string is the text we look for a pattern , re.I is case ignore
import re

txt = 'I love to teach python and javaScript'
# It returns an object with span, and match
match = re.match('I love to teach', txt, re.I)
print(match)  # <re.Match object; span=(0, 15), match='I love to teach'>
# We can get the starting and ending position of the match as tuple using span
span = match.span()
print(span)     # (0, 15)
# Lets find the start and stop position from the span
start, end = span
print(start, end)  # 0, 15
substring = txt[start:end]
print(substring)       # I love to teach

从上面的例子可以看出,我们正在寻找的模式(或我们正在寻找的子串)是I love to teach当文本以模式开头时,匹配函数才返回一个对象。

import re

txt = 'I love to teach python and javaScript'
match = re.match('I like to teach', txt, re.I)
print(match)  # None

该字符串不是I like to teach 的字符串,因此没有匹配项,匹配方法返回 None。


# syntax
re.match(substring, string, re.I)
# substring is a pattern, string is the text we look for a pattern , re.I is case ignore flag
import re

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It returns an object with span and match
match ='first', txt, re.I)
print(match)  # <re.Match object; span=(100, 105), match='first'>
# We can get the starting and ending position of the match as tuple using span
span = match.span()
print(span)     # (100, 105)
# Lets find the start and stop position from the span
start, end = span
print(start, end)  # 100 105
substring = txt[start:end]
print(substring)       # first




txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It return a list
matches = re.findall('language', txt, re.I)
print(matches)  # ['language', 'language']

如您所见,在字符串中出现了两次语言一词。让我们再练习一下。现在我们将在字符串中查找 Python 和 python 单词:

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It returns list
matches = re.findall('python', txt, re.I)
print(matches)  # ['Python', 'python']

由于我们使用的是re.I,小写和大写字母都包含在内。如果我们没有 re.I 标志,那么我们将不得不以不同的方式编写我们的模式。让我们检查一下:

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

matches = re.findall('Python|python', txt)
print(matches)  # ['Python', 'python']

matches = re.findall('[Pp]ython', txt)
print(matches)  # ['Python', 'python']


txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

match_replaced = re.sub('Python|python', 'JavaScript', txt, re.I)
print(match_replaced)  # JavaScript is the most beautiful language that a human being has ever created.
# OR
match_replaced = re.sub('[Pp]ython', 'JavaScript', txt, re.I)
print(match_replaced)  # JavaScript is the most beautiful language that a human being has ever created.

让我们再举一个例子。除非我们删除 % 符号,否则以下字符串真的很难阅读。用空字符串替换 % 将清除文本。

txt = '''%I a%m te%%a%%che%r% a%n%d %% I l%o%ve te%ach%ing. 
T%he%re i%s n%o%th%ing as r%ewarding a%s e%duc%at%i%ng a%n%d e%m%p%ow%er%ing p%e%o%ple.
I fo%und te%a%ching m%ore i%n%t%er%%es%ting t%h%an any other %jobs. 
D%o%es thi%s m%ot%iv%a%te %y%o%u to b%e a t%e%a%cher?'''

matches = re.sub('%', '', txt)
I am teacher and I love teaching.
There is nothing as rewarding as educating and empowering people. 
I found teaching more interesting than any other jobs. Does this motivate you to be a teacher?

使用RegEx Split拆分文本

txt = '''I am teacher and  I love teaching.
There is nothing as rewarding as educating and empowering people.
I found teaching more interesting than any other jobs.
Does this motivate you to be a teacher?'''
print(re.split('\n', txt)) # splitting using \n - end of line symbol
['I am teacher and  I love teaching.', 'There is nothing as rewarding as educating and empowering people.', 'I found teaching more interesting than any other jobs.', 'Does this motivate you to be a teacher?']

编写 RegEx 模式

要声明一个字符串变量,我们使用单引号或双引号。声明 RegEx 变量r''。以下模式仅使用小写字母标识苹果,为了使其不区分大小写,我们应该重写我们的模式或者我们应该添加一个标志。

import re

regex_pattern = r'apple'
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away. '
matches = re.findall(regex_pattern, txt)
print(matches)  # ['apple']

# To make case insensitive adding flag '
matches = re.findall(regex_pattern, txt, re.I)
print(matches)  # ['Apple', 'apple']
# or we can use a set of characters method
regex_pattern = r'[Aa]pple'  # this mean the first letter could be Apple or apple
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'apple']
  • []:一组字符
    • [ac] 表示 a 或 b 或 c
    • [az] 表示从 a 到 z 的任意字母
    • [AZ] 表示,从 A 到 Z 的任意字符
    • [0-3] 表示 0 或 1 或 2 或 3
    • [0-9] 表示 0 到 9 之间的任意数字
    • [A-Za-z0-9] 任意单个字符,即 a 到 z、A 到 Z 或 0 到 9
  • \: 用于转义特殊字符
    • \d 表示:匹配字符串包含数字的位置(0-9 的数字)
    • \D 表示:匹配字符串中不包含数字的地方
  • . : 除换行符(\n) 以外的任意字符
  • ^:开始于
    • r'^substring' 例如 r'^love',一个以 love 开头的句子
    • r'[^abc] 表示不是 a,不是 b,不是 c。
  • $: 结束于
    • r'substring$' 例如 r'love$',以 love 结尾的句子
  • *:零次或多次
    • r'[a]*' 表示一个可选的或者它可以出现多次。
  • +:一次或多次
    • r'[a]+'表示至少一次(或多次)
  • ?: 零次或一次
    • r'[a]?' 表示零次或一次
  • {3}:恰好 3 个字符
  • {3,}:至少3个字符
  • {3,8}:3 到 8 个字符
  • |: 要么
    • r'apple|banana' 表示苹果或香蕉
  • (): 捕获和分组




regex_pattern = r'[Aa]pple' # this square bracket mean either A or a
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away.'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'apple']


regex_pattern = r'[Aa]pple|[Bb]anana' # this square bracket means either A or a
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away.'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'banana', 'apple', 'banana']

 使用方括号和或运算符,我们设法提取出 Apple、apple、Banana 和 banana。


regex_pattern = r'\d'  # d is a special character which means digits
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2', '0', '1', '9', '8', '2', '0', '2', '1'], this is not what we want


regex_pattern = r'\d+'  # d is a special character which means digits, + mean one or more times
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2019', '8', '2021'] - now, this is better!

时期 (.)

regex_pattern = r'[a].'  # this square bracket means a and . means any character except new line
txt = '''Apple and banana are fruits'''
matches = re.findall(regex_pattern, txt)
print(matches)  # ['an', 'an', 'an', 'a ', 'ar']

regex_pattern = r'[a].+'  # . any character, + any character one or more times 
matches = re.findall(regex_pattern, txt)
print(matches)  # ['and banana are fruits']



txt = '''I am not sure if there is a convention how to write the word e-mail.
Some people write it as email others may write it as Email or E-mail.'''
regex_pattern = r'[Ee]-?mail'  # ? means here that '-' is optional
matches = re.findall(regex_pattern, txt)
print(matches)  # ['e-mail', 'email', 'Email', 'E-mail']


我们可以使用大括号指定要在文本中查找的子字符串的长度。让我们想象一下,我们对长度为 4 个字符的子字符串感兴趣:

txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'\d{4}'  # exactly four times
matches = re.findall(regex_pattern, txt)
print(matches)  # ['2019', '2021']

txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'\d{1, 4}'   # 1 to 4
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2019', '8', '2021']


txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'^This'  # ^ means starts with
matches = re.findall(regex_pattern, txt)
print(matches)  # ['This']
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'[^A-Za-z ]+'  # ^ in set character means negation, not A to Z, not a to z, no space
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6,', '2019', '8', '2021']

练习: 第 18 天

练习:1 级

  1. 以下段落中出现频率最高的词是什么?
    paragraph = 'I love teaching. If you do not love teaching what else can you love. I love Python if you do not love something which can give you all the capabilities to develop an application what else can you love.
    (6, 'love'),
    (5, 'you'),
    (3, 'can'),
    (2, 'what'),
    (2, 'teaching'),
    (2, 'not'),
    (2, 'else'),
    (2, 'do'),
    (2, 'I'),
    (1, 'which'),
    (1, 'to'),
    (1, 'the'),
    (1, 'something'),
    (1, 'if'),
    (1, 'give'),
    (1, 'develop'),
    (1, 'capabilities'),
    (1, 'application'),
    (1, 'an'),
    (1, 'all'),
    (1, 'Python'),
    (1, 'If')

  1. 部分粒子在水平x轴上的位置负向为-12、-4、-3、-1,原点为0,正向为4、8。从整个文本中提取这些数字并找出两个最远粒子之间的距离。
    points = ['-1', '2', '-4', '-3', '-1', '0', '4', '8']
    sorted_points =  [-4, -3, -1, -1, 0, 2, 4, 8]
    distance = 8 -(-4) # 12

练习: 2级 

编写一个模式来识别一个字符串是否是一个有效的 python 变量

is_valid_variable('first_name') # True
is_valid_variable('first-name') # False
is_valid_variable('1first_name') # False
is_valid_variable('firstname') # True



sentence = '''%I $am@% a %tea@cher%, &and& I lo%#ve %tea@ching%;. There $is nothing; &as& mo@re rewarding as educa@ting &and& @emp%o@wering peo@ple. ;I found tea@ching m%o@re interesting tha@n any other %jo@bs. %Do@es thi%s mo@tivate yo@u to be a tea@cher!?'''

I am a teacher and I love teaching There is nothing as more rewarding as educating and empowering people I found teaching more interesting than any other jobs Does this motivate you to be a teacher
print(most_frequent_words(cleaned_text)) # [(3, 'I'), (2, 'teaching'), (2, 'teacher')]


<< 第 17 天 || 第 19 天 >>





当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则




¥1 ¥2 ¥4 ¥6 ¥10 ¥20



钱包余额 0


