正则表达式
正则表达式又称规则表达式。通常被用来检索、替换那些符合某个模式(规则)的文本。
1、目的
给定一个正则表达式和另一个字符串,我们可以达到如下目的:
(1)给定的字符串是否符合正则表达式的过滤逻辑(称作“匹配”);
(2)可以通过正则表达式,从字符串中获取我们想要的特定部分。
2、特点
(1)灵活性、逻辑想和功能性非常强;
(2)可以迅速地用极简单的方式达到字符串的复杂控制;
(3)对于刚接触的人说,比较晦涩难懂。
re模块操作
1、常用方法
re.match(pattern, string, flags=0)
在字符串开头匹配pattern,如果匹配成功(可以是空字符串)返回对应的match对象,否则返回None。
import re
s1 = "hello world lla llb llc"
result = re.match("he", s1)
print(result)
# <re.Match object; span=(0, 2), match='he'>
re.search(pattern, string, flags=0)
扫描整个字符串string,找到与正则表达式pattern的第一个匹配(可以是空字符串)。并返回一个对应的Match对象。如果没有匹配返回None。
result = re.search("he", s1)
print(result)
# <re.Match object; span=(0, 2), match='he'>
re.fullmatch(pattern, string, flags=0)
string 是否整个和pattern匹配,如果是返回对应的match对象,否则返回None。
result = re.fullmatch("hello world lla llb llc", s1)
print(result)
# <re.Match object; span=(0, 11), match='hello world'>
re.findall(pattern, string, flags=0)
返回列表
result = re.findall("ll", s1)
print(result, type(result))
# ['ll', 'll', 'll', 'll'] <class 'list'>
re.split(pattern, string, maxsplit=0, flags=0)
返回分割列表
result = re.split("hello", "hello china hello world")
print(result)
# ['', ' china ', ' world']
re.sub(pattern, repl, string, count=0, flags=0)
使用repl替换pattern匹配到的内容,最多匹配count次。
result = re.sub("ll", "666", s1)
print(result)
# he666o world 666a 666b 666c
re.finditer(pattern, string, flags=0)
返回迭代器
result = re.finditer("he", s1)
print(result)
# <callable_iterator object at 0x0000016D8AE0DA20>
re.compile(pattern, flags=0)
编译得到匹配模型
pat = re.compile("hello")
print(pat, type(pat))
# re.compile('hello') <class 're.Pattern'>
2、flags
re模块的一些函数中将flags作为可选参数,下面列出了常用的几个flag,它们实际对应的是二进制数,可以通过位或将它们组合使用。flags可能改变正则表达式的行为:
re.I :忽略大小写
result = re.match("hello", "Hello", re.I)
print(result)
# <re.Match object; span=(0, 5), match='Hello'>
re.M :多行模式(可以匹配任意字符,但不能匹配换行符)
result = re.findall(".", "hello \n china", re.M)
print(result)
# ['h', 'e', 'l', 'l', 'o', ' ', ' ', 'c', 'h', 'i', 'n', 'a']
re.S :单行模式,可以匹配\r(换行符)
result = re.findall(".", "hello \n china" , re.S)
print(result)
# ['h', 'e', 'l', 'l', 'o', ' ', '\n', ' ', 'c', 'h', 'i', 'n', 'a']
单个字符
1、单字符匹配
. :匹配任意1个字符
result = re.findall(".", "hello \n china")
print(result)
# ['h', 'e', 'l', 'l', 'o', ' ', ' ', 'c', 'h', 'i', 'n', 'a']
[] :匹配[]中列举的字符
result = re.findall("[012].ello", "0hello 1hello 2hello 3hello 4hello")
print(result)
# ['0hello', '1hello', '2hello']
\d :匹配数字,即0-9
result = re.findall("\dhello", "hello 1hello 2hello 5hello")
print(result)
# ['1hello', '2hello', '5hello']
\D :匹配非数字,即不是数字
result = re.findall("\Dhello", "hello hello hello")
print(result)
# [' hello', ' hello']
\s :匹配空白,即空格tab键
result = re.findall("\shello", " hello 1hello 5hello 0hello")
print(result)
# [' hello']
\S :匹配非空白
result = re.findall("\Shello", "1hello shello hello")
print(result)
# ['1hello', 'shello']
\w :匹配单词字符,即a-z, A-Z, 0-9, _
result = re.findall("\wello", "hello aello5ello_ello .ello ^ello")
print(result)
# ['hello', 'aello', '5ello', '_ello']
\W :匹配非单词字符
result = re.findall("\Wello", "hello aelo5ello_ello .ello ^ello")
print(result)
# ['.ello', '^ello']
2、特殊贪婪字符 ?
正则匹配默认贪婪模式即匹配尽可能多个字符
result = re.findall("he*", "hee heeee zzheee0")
print(result)
# ['hee', 'heeee', 'heee']
当?出现在+、?、*、{m}之后开启非贪婪模式
result = re.findall("he*?", "hee heeee zzheee0")
print(result)
# ['h', 'h', 'h']
表示数量
*:匹配前一个字符出现0次或者无限次,即可有可无
result = re.findall("hi*", "hi china hello china")
print(result)
# ['hi', 'hi', 'h', 'hi']
+:匹配前一个字符出现1次或无限次,即至少有1次
result = re.findall("hi+", "hi china hello china")
print(result)
# ['hi', 'hi', 'hi']
?:匹配前一个字符出现1次或者0次,即要么有1次,要么没有
result = re.findall("hi?", "hi china hello china")
print(result)
# ['hi', 'hi', 'h', 'hi']
{m}:匹配前一个字符出现m次
result = re.findall("hi{2}", "hi china hello china")
print(result)
# []
{m,}:匹配前一个字符至少出现m次
result = re.findall("hi{1,}", "hi china hello chiia")
print(result)
# ['hi', 'hi', 'hii']
表示边界
^ :匹配字符串开头
result = re.findall("^hello", "hello world hello zhengzhou")
print(result)
# ['hello']
result = re.findall("^hello", "hello world\nhello zhengzhou", re.M)
print(result)
# ['hello', 'hello']
result = re.findall("^hello", r"hello world\nhello zhengzhou", re.M)
print(result)
# ['hello']
$ :匹配字符串结尾
result = re.findall("zhengzhou$", "hello world hello zhengzhou")
print(result)
# ['zhengzhou']
\b :匹配一个单词的边界,两边均无字符
result = re.findall(r"\bhello\b", "hello world hello zhengzhou")
print(result)
# ['hello', 'hello']
result = re.findall(r"\bhello\b", "hello\n world hello\n zhengzhou")
print(result)
# ['hello', 'hello']
\B :匹配非单词边界
result = re.findall(r"\Bhello\B", "1helloworld hello zhengzhou")
print(result)
# ['hello']
匹配分组
\ :匹配左右任意一个表达式
result = re.findall(r"\bhello\b|\bworld\b|\bhi\b", "hello world hi world")
print(result)
# ['hello', 'world', 'hi', 'world']
(ab) :将括号中字符作为一个分组
result = re.search("hello", "hello world hi world")
print(result.group())
# hello
result = re.search("(hello).*?w", "hello world hi world")
print(result.group(), result.group(1))
# hello w hello
\num :引用分组num匹配到的字符串
result = re.match(r"(hello).*?\1", "hello world hello china")
print(result.group(), result.group(1))
# hello world hello hello
(?P) (P=name1) :分组起别名 引用别名为name分组匹配到的字符串
练习
"""
使用re提取 股票名 股票代码 股票最新价
"""
import requests, re
response = requests.get("http://quote.stockstar.com/stock/ranklist_a_3_1_1.html")
# print(response.text)
result = re.search(r'<tbody class="tbody_right" id="datalist">(.*?)</tbody>', response.text, re.S)
# print(result.group(1))
result = re.findall(r'<tr>(.*?)</tr>', result.group(1))
print(result)
with open("data.txt", "w", encoding="utf8") as f:
for r in result:
r1 = re.findall(r'<td.*?>(.*?)</td>', r)
# print(r1[0],r1[1],r1[2])
id = re.search(r'<a href="//stock.quote.stockstar.com/(.*?).shtml">\1</a>', r1[0])
# print(id.group(1))
name = re.search(r'<a href="//stock.quote.stockstar.com/(.*?).shtml">(.*?)</a>', r1[1])
# print(name.group(2))
price = re.search(r'<span class="red">(.*?)</span>', r1[2])
print(price.group(1))
info = "股票代码 " + str(id.group(1)) + "股票名称 " + str(name.group(2)) + "股票价格 " + str(price.group(1))
f.write(info)
f.write('\n')