Python 30 天:第 18 天 -- 正则表达式

<< 第 17 天 || 第 19 天 >>

第 18 天 

常用表达

正则表达式或 RegEx 是一种特殊的文本字符串,可帮助查找数据中的模式。RegEx 可用于检查不同数据类型中是否存在某种模式。要在 python 中使用 RegEx,首先我们应该导入名为re的 RegEx 模块。

re 模块

导入模块后,我们可以使用它来检测或查找模式。

import re

re模块中的方法

为了找到一个模式,我们使用不同的re字符集来搜索字符串中的匹配项。

  • re.match():仅在字符串第一行的开头搜索,如果找到则返回匹配的对象,否则返回 None。
  • re.search:如果字符串中的任意位置存在一个匹配对象,则返回一个匹配对象,包括多行字符串。
  • re.findall:返回包含所有匹配项的列表
  • re.split:获取一个字符串,在匹配点处拆分它,返回一个列表
  • re.sub:替换字符串中的一个或多个匹配项

匹配 

# syntac
re.match(substring, string, re.I)
# substring is a string or a pattern, string is the text we look for a pattern , re.I is case ignore
import re

txt = 'I love to teach python and javaScript'
# It returns an object with span, and match
match = re.match('I love to teach', txt, re.I)
print(match)  # <re.Match object; span=(0, 15), match='I love to teach'>
# We can get the starting and ending position of the match as tuple using span
span = match.span()
print(span)     # (0, 15)
# Lets find the start and stop position from the span
start, end = span
print(start, end)  # 0, 15
substring = txt[start:end]
print(substring)       # I love to teach

从上面的例子可以看出,我们正在寻找的模式(或我们正在寻找的子串)是I love to teach当文本以模式开头时,匹配函数才返回一个对象。

import re

txt = 'I love to teach python and javaScript'
match = re.match('I like to teach', txt, re.I)
print(match)  # None

该字符串不是I like to teach 的字符串,因此没有匹配项,匹配方法返回 None。

搜索

# syntax
re.match(substring, string, re.I)
# substring is a pattern, string is the text we look for a pattern , re.I is case ignore flag
import re

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It returns an object with span and match
match = re.search('first', txt, re.I)
print(match)  # <re.Match object; span=(100, 105), match='first'>
# We can get the starting and ending position of the match as tuple using span
span = match.span()
print(span)     # (100, 105)
# Lets find the start and stop position from the span
start, end = span
print(start, end)  # 100 105
substring = txt[start:end]
print(substring)       # first

如您所见,搜索比匹配好得多,因为它可以在整个文本中查找模式。搜索返回一个匹配对象,其中包含找到的第一个匹配项,否则返回None。一个更好的re函数是findall。此函数检查整个字符串的模式并将所有匹配项作为列表返回。

使用findall()搜索所有匹配项

findall()将所有匹配项作为列表返回

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It return a list
matches = re.findall('language', txt, re.I)
print(matches)  # ['language', 'language']

如您所见,在字符串中出现了两次语言一词。让我们再练习一下。现在我们将在字符串中查找 Python 和 python 单词:

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It returns list
matches = re.findall('python', txt, re.I)
print(matches)  # ['Python', 'python']

由于我们使用的是re.I,小写和大写字母都包含在内。如果我们没有 re.I 标志,那么我们将不得不以不同的方式编写我们的模式。让我们检查一下:

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

matches = re.findall('Python|python', txt)
print(matches)  # ['Python', 'python']

#
matches = re.findall('[Pp]ython', txt)
print(matches)  # ['Python', 'python']

替换子字符串

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

match_replaced = re.sub('Python|python', 'JavaScript', txt, re.I)
print(match_replaced)  # JavaScript is the most beautiful language that a human being has ever created.
# OR
match_replaced = re.sub('[Pp]ython', 'JavaScript', txt, re.I)
print(match_replaced)  # JavaScript is the most beautiful language that a human being has ever created.

让我们再举一个例子。除非我们删除 % 符号,否则以下字符串真的很难阅读。用空字符串替换 % 将清除文本。


txt = '''%I a%m te%%a%%che%r% a%n%d %% I l%o%ve te%ach%ing. 
T%he%re i%s n%o%th%ing as r%ewarding a%s e%duc%at%i%ng a%n%d e%m%p%ow%er%ing p%e%o%ple.
I fo%und te%a%ching m%ore i%n%t%er%%es%ting t%h%an any other %jobs. 
D%o%es thi%s m%ot%iv%a%te %y%o%u to b%e a t%e%a%cher?'''

matches = re.sub('%', '', txt)
print(matches)
I am teacher and I love teaching.
There is nothing as rewarding as educating and empowering people. 
I found teaching more interesting than any other jobs. Does this motivate you to be a teacher?

使用RegEx Split拆分文本

txt = '''I am teacher and  I love teaching.
There is nothing as rewarding as educating and empowering people.
I found teaching more interesting than any other jobs.
Does this motivate you to be a teacher?'''
print(re.split('\n', txt)) # splitting using \n - end of line symbol
['I am teacher and  I love teaching.', 'There is nothing as rewarding as educating and empowering people.', 'I found teaching more interesting than any other jobs.', 'Does this motivate you to be a teacher?']

编写 RegEx 模式

要声明一个字符串变量,我们使用单引号或双引号。声明 RegEx 变量r''。以下模式仅使用小写字母标识苹果,为了使其不区分大小写,我们应该重写我们的模式或者我们应该添加一个标志。

import re

regex_pattern = r'apple'
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away. '
matches = re.findall(regex_pattern, txt)
print(matches)  # ['apple']

# To make case insensitive adding flag '
matches = re.findall(regex_pattern, txt, re.I)
print(matches)  # ['Apple', 'apple']
# or we can use a set of characters method
regex_pattern = r'[Aa]pple'  # this mean the first letter could be Apple or apple
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'apple']
  • []:一组字符
    • [ac] 表示 a 或 b 或 c
    • [az] 表示从 a 到 z 的任意字母
    • [AZ] 表示,从 A 到 Z 的任意字符
    • [0-3] 表示 0 或 1 或 2 或 3
    • [0-9] 表示 0 到 9 之间的任意数字
    • [A-Za-z0-9] 任意单个字符,即 a 到 z、A 到 Z 或 0 到 9
  • \: 用于转义特殊字符
    • \d 表示:匹配字符串包含数字的位置(0-9 的数字)
    • \D 表示:匹配字符串中不包含数字的地方
  • . : 除换行符(\n) 以外的任意字符
  • ^:开始于
    • r'^substring' 例如 r'^love',一个以 love 开头的句子
    • r'[^abc] 表示不是 a,不是 b,不是 c。
  • $: 结束于
    • r'substring$' 例如 r'love$',以 love 结尾的句子
  • *:零次或多次
    • r'[a]*' 表示一个可选的或者它可以出现多次。
  • +:一次或多次
    • r'[a]+'表示至少一次(或多次)
  • ?: 零次或一次
    • r'[a]?' 表示零次或一次
  • {3}:恰好 3 个字符
  • {3,}:至少3个字符
  • {3,8}:3 到 8 个字符
  • |: 要么
    • r'apple|banana' 表示苹果或香蕉
  • (): 捕获和分组

 让我们用例子来阐明上面的元字符

方括号

让我们使用方括号来包含小写和大写

regex_pattern = r'[Aa]pple' # this square bracket mean either A or a
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away.'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'apple']

 如果我们想寻找香蕉,我们将模式写成如下:

regex_pattern = r'[Aa]pple|[Bb]anana' # this square bracket means either A or a
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away.'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'banana', 'apple', 'banana']

 使用方括号和或运算符,我们设法提取出 Apple、apple、Banana 和 banana。

正则表达式中的转义字符(\)

regex_pattern = r'\d'  # d is a special character which means digits
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2', '0', '1', '9', '8', '2', '0', '2', '1'], this is not what we want

一次或多次(+) 

regex_pattern = r'\d+'  # d is a special character which means digits, + mean one or more times
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2019', '8', '2021'] - now, this is better!

时期 (.)

regex_pattern = r'[a].'  # this square bracket means a and . means any character except new line
txt = '''Apple and banana are fruits'''
matches = re.findall(regex_pattern, txt)
print(matches)  # ['an', 'an', 'an', 'a ', 'ar']

regex_pattern = r'[a].+'  # . any character, + any character one or more times 
matches = re.findall(regex_pattern, txt)
print(matches)  # ['and banana are fruits']

零次或一次(?) 

零次或一次。该模式可能不会出现,也可能出现一次。

txt = '''I am not sure if there is a convention how to write the word e-mail.
Some people write it as email others may write it as Email or E-mail.'''
regex_pattern = r'[Ee]-?mail'  # ? means here that '-' is optional
matches = re.findall(regex_pattern, txt)
print(matches)  # ['e-mail', 'email', 'Email', 'E-mail']

正则表达式中的量词 

我们可以使用大括号指定要在文本中查找的子字符串的长度。让我们想象一下,我们对长度为 4 个字符的子字符串感兴趣:

txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'\d{4}'  # exactly four times
matches = re.findall(regex_pattern, txt)
print(matches)  # ['2019', '2021']

txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'\d{1, 4}'   # 1 to 4
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2019', '8', '2021']

 Cart^

txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'^This'  # ^ means starts with
matches = re.findall(regex_pattern, txt)
print(matches)  # ['This']
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'[^A-Za-z ]+'  # ^ in set character means negation, not A to Z, not a to z, no space
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6,', '2019', '8', '2021']

练习: 第 18 天

练习:1 级

  1. 以下段落中出现频率最高的词是什么?
    paragraph = 'I love teaching. If you do not love teaching what else can you love. I love Python if you do not love something which can give you all the capabilities to develop an application what else can you love.
    [
    (6, 'love'),
    (5, 'you'),
    (3, 'can'),
    (2, 'what'),
    (2, 'teaching'),
    (2, 'not'),
    (2, 'else'),
    (2, 'do'),
    (2, 'I'),
    (1, 'which'),
    (1, 'to'),
    (1, 'the'),
    (1, 'something'),
    (1, 'if'),
    (1, 'give'),
    (1, 'develop'),
    (1, 'capabilities'),
    (1, 'application'),
    (1, 'an'),
    (1, 'all'),
    (1, 'Python'),
    (1, 'If')
    ]

  1. 部分粒子在水平x轴上的位置负向为-12、-4、-3、-1,原点为0,正向为4、8。从整个文本中提取这些数字并找出两个最远粒子之间的距离。
    points = ['-1', '2', '-4', '-3', '-1', '0', '4', '8']
    sorted_points =  [-4, -3, -1, -1, 0, 2, 4, 8]
    distance = 8 -(-4) # 12

练习: 2级 

编写一个模式来识别一个字符串是否是一个有效的 python 变量

is_valid_variable('first_name') # True
is_valid_variable('first-name') # False
is_valid_variable('1first_name') # False
is_valid_variable('firstname') # True

练习:3级 

清洁以下文本。清理后,统计字符串中出现频率最高的三个单词。

sentence = '''%I $am@% a %tea@cher%, &and& I lo%#ve %tea@ching%;. There $is nothing; &as& mo@re rewarding as educa@ting &and& @emp%o@wering peo@ple. ;I found tea@ching m%o@re interesting tha@n any other %jo@bs. %Do@es thi%s mo@tivate yo@u to be a tea@cher!?'''

print(clean_text(sentence));
I am a teacher and I love teaching There is nothing as more rewarding as educating and empowering people I found teaching more interesting than any other jobs Does this motivate you to be a teacher
print(most_frequent_words(cleaned_text)) # [(3, 'I'), (2, 'teaching'), (2, 'teacher')]

 🎉恭喜!🎉

<< 第 17 天 || 第 19 天 >>

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

舍不得,放不下

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值