正则表达式

最新推荐文章于 2022-03-25 18:09:23 发布

顾一大人

最新推荐文章于 2022-03-25 18:09:23 发布

阅读量185

点赞数 1

分类专栏： python 文章标签： Python学习笔记

本文链接：https://blog.csdn.net/qq_41828603/article/details/89258995

版权

python 专栏收录该内容

25 篇文章 0 订阅

订阅专栏

正则表达式

正则表达式又称规则表达式。通常被用来检索、替换那些符合某个模式（规则）的文本。

1、目的

给定一个正则表达式和另一个字符串，我们可以达到如下目的：
（1）给定的字符串是否符合正则表达式的过滤逻辑（称作“匹配”）；
（2）可以通过正则表达式，从字符串中获取我们想要的特定部分。

2、特点

（1）灵活性、逻辑想和功能性非常强；
（2）可以迅速地用极简单的方式达到字符串的复杂控制；
（3）对于刚接触的人说，比较晦涩难懂。

re模块操作

1、常用方法

re.match(pattern, string, flags=0)
在字符串开头匹配pattern,如果匹配成功（可以是空字符串）返回对应的match对象，否则返回None。

import re

s1 = "hello world lla llb llc"

result = re.match("he", s1)
print(result)
# <re.Match object; span=(0, 2), match='he'>

re.search(pattern, string, flags=0)
扫描整个字符串string,找到与正则表达式pattern的第一个匹配（可以是空字符串）。并返回一个对应的Match对象。如果没有匹配返回None。

result = re.search("he", s1)
print(result)
# <re.Match object; span=(0, 2), match='he'>

re.fullmatch(pattern, string, flags=0)
string 是否整个和pattern匹配，如果是返回对应的match对象，否则返回None。

result = re.fullmatch("hello world lla llb llc", s1)
print(result)
# <re.Match object; span=(0, 11), match='hello world'>

re.findall(pattern, string, flags=0)
返回列表

result = re.findall("ll", s1)
print(result, type(result))
# ['ll', 'll', 'll', 'll'] <class 'list'>

re.split(pattern, string, maxsplit=0, flags=0)
返回分割列表

result = re.split("hello", "hello china hello world")
print(result)
# ['', ' china ', ' world']

re.sub(pattern, repl, string, count=0, flags=0)
使用repl替换pattern匹配到的内容，最多匹配count次。

result = re.sub("ll", "666", s1)
print(result)
# he666o world 666a 666b 666c

re.finditer(pattern, string, flags=0)
返回迭代器

result = re.finditer("he", s1)
print(result)
# <callable_iterator object at 0x0000016D8AE0DA20>

re.compile(pattern, flags=0)
编译得到匹配模型

pat = re.compile("hello")
print(pat, type(pat))
# re.compile('hello') <class 're.Pattern'>

2、flags

re模块的一些函数中将flags作为可选参数，下面列出了常用的几个flag,它们实际对应的是二进制数，可以通过位或将它们组合使用。flags可能改变正则表达式的行为：

re.I ：忽略大小写

result = re.match("hello", "Hello", re.I)
print(result)
# <re.Match object; span=(0, 5), match='Hello'>

re.M ：多行模式(可以匹配任意字符，但不能匹配换行符)

result = re.findall(".", "hello \n china", re.M)
print(result)
# ['h', 'e', 'l', 'l', 'o', ' ', ' ', 'c', 'h', 'i', 'n', 'a']

re.S ：单行模式，可以匹配\r(换行符)

result = re.findall(".", "hello \n china" , re.S)
print(result)
# ['h', 'e', 'l', 'l', 'o', ' ', '\n', ' ', 'c', 'h', 'i', 'n', 'a']

单个字符

1、单字符匹配

. ：匹配任意1个字符

result = re.findall(".", "hello \n china")
print(result)
# ['h', 'e', 'l', 'l', 'o', ' ', ' ', 'c', 'h', 'i', 'n', 'a']

[] ：匹配[]中列举的字符

result = re.findall("[012].ello", "0hello 1hello 2hello 3hello 4hello")
print(result)
# ['0hello', '1hello', '2hello']

\d ：匹配数字，即0-9

result = re.findall("\dhello", "hello 1hello 2hello 5hello")
print(result)
# ['1hello', '2hello', '5hello']

\D ：匹配非数字，即不是数字

result = re.findall("\Dhello", "hello hello      hello")
print(result)
# [' hello', ' hello']

\s ：匹配空白，即空格tab键

result = re.findall("\shello", " hello 1hello 5hello 0hello")
print(result)
# [' hello']

\S ：匹配非空白

result = re.findall("\Shello", "1hello shello hello")
print(result)
# ['1hello', 'shello']

\w ：匹配单词字符，即a-z, A-Z, 0-9, _

result = re.findall("\wello", "hello aello5ello_ello .ello ^ello")
print(result)
# ['hello', 'aello', '5ello', '_ello']

\W ：匹配非单词字符

result = re.findall("\Wello", "hello aelo5ello_ello .ello ^ello")
print(result)
# ['.ello', '^ello']

2、特殊贪婪字符 ?

正则匹配默认贪婪模式即匹配尽可能多个字符

result = re.findall("he*", "hee heeee zzheee0")
print(result)
# ['hee', 'heeee', 'heee']

当?出现在+、？、*、{m}之后开启非贪婪模式

result = re.findall("he*?", "hee heeee zzheee0")
print(result)
# ['h', 'h', 'h']

表示数量

*：匹配前一个字符出现0次或者无限次，即可有可无

result = re.findall("hi*", "hi china hello china")
print(result)
# ['hi', 'hi', 'h', 'hi']

+：匹配前一个字符出现1次或无限次，即至少有1次

result = re.findall("hi+", "hi china hello china")
print(result)
# ['hi', 'hi', 'hi']

?：匹配前一个字符出现1次或者0次，即要么有1次，要么没有

result = re.findall("hi?", "hi china hello china")
print(result)
# ['hi', 'hi', 'h', 'hi']

{m}：匹配前一个字符出现m次

result = re.findall("hi{2}", "hi china hello china")
print(result)
# []

{m,}：匹配前一个字符至少出现m次

result = re.findall("hi{1,}", "hi china hello chiia")
print(result)
# ['hi', 'hi', 'hii']

表示边界

^ ：匹配字符串开头

result = re.findall("^hello", "hello world hello zhengzhou")
print(result)
# ['hello']

result = re.findall("^hello", "hello world\nhello zhengzhou", re.M)
print(result)
# ['hello', 'hello']

result = re.findall("^hello", r"hello world\nhello zhengzhou", re.M)
print(result)
# ['hello']

$ ：匹配字符串结尾

result = re.findall("zhengzhou$", "hello world hello zhengzhou")
print(result)
# ['zhengzhou']

\b ：匹配一个单词的边界，两边均无字符

result = re.findall(r"\bhello\b", "hello world hello zhengzhou")
print(result)
# ['hello', 'hello']
result = re.findall(r"\bhello\b", "hello\n world hello\n zhengzhou")
print(result)
# ['hello', 'hello']

\B ：匹配非单词边界

result = re.findall(r"\Bhello\B", "1helloworld hello zhengzhou")
print(result)
# ['hello']

匹配分组

\ ：匹配左右任意一个表达式

result = re.findall(r"\bhello\b|\bworld\b|\bhi\b", "hello world hi world")
print(result)
# ['hello', 'world', 'hi', 'world']

(ab) ：将括号中字符作为一个分组

result = re.search("hello", "hello world hi world")
print(result.group())
# hello

result = re.search("(hello).*?w", "hello world hi world")
print(result.group(), result.group(1))
# hello w hello

\num ：引用分组num匹配到的字符串

result = re.match(r"(hello).*?\1", "hello world hello china")
print(result.group(), result.group(1))
# hello world hello hello

(?P) (P=name1) ：分组起别名引用别名为name分组匹配到的字符串

练习

"""
使用re提取 股票名 股票代码  股票最新价
"""
import requests, re

response = requests.get("http://quote.stockstar.com/stock/ranklist_a_3_1_1.html")
# print(response.text)

result = re.search(r'<tbody class="tbody_right" id="datalist">(.*?)</tbody>', response.text, re.S)
# print(result.group(1))
result = re.findall(r'<tr>(.*?)</tr>', result.group(1))
print(result)

with open("data.txt", "w", encoding="utf8") as f:
    for r in result:
        r1 = re.findall(r'<td.*?>(.*?)</td>', r)
        # print(r1[0],r1[1],r1[2])
        
        id = re.search(r'<a href="//stock.quote.stockstar.com/(.*?).shtml">\1</a>', r1[0])
        # print(id.group(1))
        
        name = re.search(r'<a href="//stock.quote.stockstar.com/(.*?).shtml">(.*?)</a>', r1[1])
        # print(name.group(2))
        
        price = re.search(r'<span class="red">(.*?)</span>', r1[2])
        print(price.group(1))
        
        info = "股票代码  " + str(id.group(1)) + "股票名称  " + str(name.group(2)) + "股票价格  " + str(price.group(1))
        f.write(info)
        f.write('\n')