Python-Regular Expressions(课堂笔记整理)

最新推荐文章于 2024-04-25 09:45:36 发布

Sober97

最新推荐文章于 2024-04-25 09:45:36 发布

阅读量501

点赞数

文章标签： python 正则表达式

本文链接：https://blog.csdn.net/Sober97/article/details/105919571

版权

简要来说：
学习了调用库函数re，并使用其中的re.search()和re.findall()函数。学习了一些常用的正则表达式，和库函数搭配起来可以很好地提取想要的数据。例如：
^如果在[]外表示以它后面的字符串为开头，如果在[]里表示除了[]里的字符外都可以。

1.调用库函数

import re

2.常用的正则表达式
更多的参考：https://docs.python.org/3/howto/regex.html
Quick-guide

3.find data举例
1）Using re.search() Like find()
find()形式

hand = open('mbox-short.txt')
for line in hand:
	line = line.rstrip()
	if line.find('From:') >= 0:
		print(line)

re.search()形式

import re

hand = open('mbox-short.txt')
for line in hand:
	line = line.rstrip()
	if re.search('From:', line):
		print(line)

2）Using re.search() Like startswith()
startswith()形式

hand = open('mbox-short.txt')
for line in hand:
	line = line.rstrip()
	if line.startswith('From:'):
		print(line)

re.search()形式

import re

hand = open('mbox-short.txt')
for line in hand:
	line = line.rstrip()
	if re.search('^From:', line): # ^说明是一行的开头
		print(line)

3）使用正则表达式来表示以X开头，存在冒号的行（^X.*:）
Wild-Card Characters
例子2
在^X.*:中，X与冒号中间有空格也可以。
如果要让查找更加精准，使用^X-\S+:
precise
5.匹配并提取数据
1）使用库函数和正则表达式提取想要的数据（所有）
库函数： re.findall()
正则表达式：[0-9]+ 、 [AEIOU]+
[0-9]表示提取的是1个数字，+表示0个或多个字符，合起来表示1个或多个数字。
[AEIOU]表示提取的句子中为AEIOU中的1个字符。

import re
x = 'My 2 favorite numbers are 19 and 42'
y = re.findall('[0-9]+', x)
print(y)
# ['2', '19', '42'] 提取出的是string
y = re.findall('[AEIOU]+', x)
print(y)
# [] 空list

2）寻找匹配的字符串（贪心与不贪心）
->Greedy Matching: Match the largest possible string
正则表达式为^F.+:
^F表示以F为首个字符，.表示任何字符，+表示1个或多个字符，:即最后一个字符是:

import re
x = 'From: Using the : character'
y = re.findall('^F.+:', x)
print(y)
# ['From: Using the :']

->Non-Greedy Matching: Use ?
正则表达式为^F.+?:
+?表示1个或多个字符但是不贪心，即能找到的最小的

import re
x = 'From: Using the : character'
y = re.findall('^F.+?:', x)
print(y)
# ['From:']

3）精简提取想要数据的过程
->寻找邮箱地址
正则表达式为 \S+@\S+
\S表示至少一个不是空格的字符（At least one non-whitespace character）
()不是查找数据的一部分，但是它们说明了数据从哪里开始提取，以及在哪里结束。
Parentheses are not part of the match - but they tell where to start and stop what string to extract.

import re
x = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('\S+@\S+', x)
print(y)
# ['stephen.marquard@uct.ac.za']
y = re.findall('^From (\S+@\S+)', x)
print(y)
# ['stephen.marquard@uct.ac.za']

->寻找邮箱的服务器域名
之前我们采用过的方法（using find and string slicing）：

data = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
atpos = data.find('@')
print(atpos) # 21
sppos = data.find(' ', atpos) #从atpos位置开始寻找空格
print(sppos) # 31 
host = data[atpos+1:sppos] #左闭右开
print(host) # uct.ac.za

稍微简化版（Split twice）:

data = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
words = data.split()
email = words[1]
pieces = email.split('@') # ['stephen.marquard', 'uct.ac.za']
print(pieces[1]) # uct.ac.za

正则表达式版本
正则表达式为
1是@([^ ])，表示要提取的是@后的不是空格的1个或多个字符
[^ ]表示除了空格，^的意思是除了。
2是^From .@([^ ]*)，.*表示任何字符，任意多，即寻找以From 开头的字符串，提取@后面的非空格字符串。

import re
data = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('@([^ ]*)', data)
print(y) # ['uct.ac.za']
y = re.findall('^From .*@([^ ]*)', data)
print(y) # ['uct.ac.za']

6.小实例
提取的是特定字符后的浮点数，因为可能会有小数点（没有也没有关系）
[0-9.]表示1个数字或1个小数点

import re
hand = open('mbox-short.txt')
numlist = list()
for line in hand:
	line = line.rstrip()
	stuff = re.findall('^X-DSPAM-Confidence: ([0-9.]+)', line)
	if len(stuff) != 1 : continue
	num = float(stuff[0])
	numlist.append(num)
print('Maximum:', max(numlist))
# Maximum: 0.9907

7.逃逸字符（Escape Character）
如果我们想要寻找一个特定的字符，但是这个字符在正则表达式中可以作为标识符，那么如果我们想要让它正常地表示，可以在它前面加一个\

import re
x = 'We just received $10.00 for cookies.'
y = re.findall('\$[0-9.]+', x)
print(y)
# ['$10.00']

Sober97

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫