Python网络爬虫(九)——re

最新推荐文章于 2024-05-11 19:54:28 发布

止步听风

最新推荐文章于 2024-05-11 19:54:28 发布

阅读量513

点赞数

分类专栏： # 网络爬虫文章标签：正则表达式匹配规则原生字符串 re

本文链接：https://blog.csdn.net/SAKURASANN/article/details/106257704

版权

网络爬虫专栏收录该内容

24 篇文章 4 订阅

订阅专栏

正则表达式

正则表达式(Regular Expression)是一种文本模式，包括普通字符(如 a-z)和特殊字符(元字符)
正则表达式使用单个字符串来描述，用来匹配一系列满足某个句法规则的字符串

匹配规则

匹配某个字符串

import re

text = 'hello'
ret = re.match('he',text)
print(ret.group())

结果为：

he

匹配任意字符(.)

import re

text = "ab"
ret = re.match('.',text)
print(ret.group())

结果为：

上边的结果中匹配到了第一个字符，但是使用 . 不能匹配到换行符。

import re

text = "\nab"
ret = re.match('.',text)
print(ret.group())

结果为：

AttributeError: 'NoneType' object has no attribute 'group'

匹配任意数字(\d)

import re

text = "123"
ret = re.match('\d',text)
print(ret.group())

结果为：

匹配任意非数字(\D)

import re

text = "abc"
ret = re.match('\D',text)
print(ret.group())

结果为：

如果匹配的字符为数字，则会报错：

import re

text = "123"
ret = re.match('\D',text)
print(ret.group())

结果为：

AttributeError: 'NoneType' object has no attribute 'group'

匹配空白字符(\s)

import re

text = "\t"
ret = re.match('\s',text)
print(ret.group())

结果为：

匹配非空白字符(\S)

import re

text = "abc"
ret = re.match('\S',text)
print(ret.group())

结果为：

匹配a-z、A-Z、0-9和下划线(\w)

import re

text = "_abc"
ret = re.match('\w',text)
print(ret.group())

结果为：

如果能匹配到的字符不是上述的字符之一，则也会报 AttributeError 错误。

匹配与(\w)相反的字符(\W)

import re

text = "+-*/"
ret = re.match('\W',text)
print(ret.group())

结果为：

如果能匹配到的字符是(\w)能匹配到的字符，则也会报 AttributeError 错误。

[]

匹配满足中括号中的任一项

import re

text = "86-8642354"
ret = re.match('[\d\-]+',text)
print(ret.group())

结果为：

86-8642354

匹配 0 个或者任意多个字符(*)

import re

text = "86-8642354"
ret = re.match('\d*',text)
print(ret.group())

结果为：

匹配 1 个或者多个字符(+)

import re

text = "hello"
ret = re.match('\w+',text)
print(ret.group())

结果为：

hello

匹配 1 个或者多个字符要求至少能够匹配一个字符，如果一个都匹配不到，则会报错。

import re

text = "+hello"
ret = re.match('\w+',text)
print(ret.group())

结果为：

AttributeError: 'NoneType' object has no attribute 'group'

匹配 0 次或 1 次字符(?)

import re

text = "abc"
ret = re.match('\w?',text)
print(ret.group())

结果为：

匹配 m 个字符({m})

import re

text = "abc"
ret = re.match('\w{3}',text)
print(ret.group())

结果为：

abc

匹配 m-n 个字符({m,n})

import re

text = "abcd"
ret = re.match('\w{2,5}',text)
print(ret.group())

结果为：

abcd

以...起始(^)

import re

text = "hello"
ret = re.match('^h',text)
print(ret.group())

如果 ^ 是在中括号中，则表示取反：

import re

text = "hello"
ret = re.match('[^\W]',text)
print(ret.group())

结果为：

以...结束($)

import re

text = "hello"
ret = re.search('o$',text)
print(ret.group())

结果为：

匹配多个表达式或者字符串

import re

text = "world"
ret = re.search('hello|world',text)
print(ret.group())

结果为：

world

贪婪模式和非贪婪模式

贪婪模式：正则表达式会尽量多的匹配字符，默认为贪婪模式
非贪婪模式：正则表达式会尽量少的匹配字符

import re

text = "abcdefg"
ret = re.match('\w',text)
print(ret.group())
ret = re.match('\w+',text)
print(ret.group())
ret = re.match('\w+?',text)
print(ret.group())

结果为：

a
abcdefg
a

上边第一次匹配到了 a，然后使用 + 匹配到了所有字符，此时就是贪婪模式，而在 + 后边多跟一个 ? 就取消了 + 的作用，变成了非贪婪模式。

转义字符

从上边可以看出，有些字符被用于了进行字符串匹配，因此如果想要匹配这些字符，就需要使用 \ 对这些字符进行转义。

import re

text = "1+1=2"
ret = re.match('1\+1=2',text)
print(ret.group())

结果为：

1+1=2

原生字符串

那么如果需要匹配 \，就需要：

import re

text = "apple \c"
print(text)
ret = re.search('\\\c',text)
# ret = re.search('\\\\c',text)
print(ret.group())

因为 python 中的 \ 可以用来进行转义，而正则表达式中的 \ 则是专门用来进行转义的，因此 search 中的 '\\\c' 会首先在 python 中被转义为 '\\c'，然后将之传递到 re 中，在 re 中又会将 '\\c' 转义为 '\c'。如果是 '\\\\c' 则会将 '\\\\c' 转义为 '\\c'，然后将之传递到 re 中，在 re 中又会将 '\\c' 转义为 '\c'。

这种转换关系无疑是很麻烦的，而原生表达式可以解决这个问题。

原生字符串在字符串前加 r 表达：

import re

text = "apple \c"
print(text)
ret = re.search(r'\\c',text)
print(ret.group())

使用这种形式则不需要再经过 python 转义，直接将 r 后的字符全部传递到正则表达式中。

re

python 中自带了 re 模块，该模块提供了对正则表达式的支持。

常用函数

match

def match(pattern, string, flags=0):
    """Try to apply the pattern at the start of the string, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).match(string)

从开始位置进行匹配，匹配成功则返回匹配到的字符，反之则返回 None。

search

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

在字符串中匹配满足条件的字符，匹配成功则返回匹配到的字符，反之则返回 None。

group

在正则表达式中，可以对匹配到的字符进行分组，分组使用圆括号的形式：

group：返回整个满足条件的字符串，等价于 group(0)
groups：返回子组，索引从 1 开始
group(1)：返回第一个子组，可以传入多个

import re

text = "abc def ghi"
ret = re.search('(abc)\s(def)\s(ghi)',text)
print(ret.group())
print(ret.group(1))
print(ret.group(2))
print(ret.group(3))
print(ret.groups())
print(ret.group(1,2,3))

结果为：

abc def ghi
abc
def
ghi
('abc', 'def', 'ghi')
('abc', 'def', 'ghi')

findall

def findall(pattern, string, flags=0):
    """Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result."""
    return _compile(pattern, flags).findall(string)

找出所有符合匹配条件的字符串，此时返回的是一个 list：

import re

text = "abc abc abc"
ret = re.match('abc',text)
print(type(ret))
ret = re.findall('abc',text)
print(type(ret))
print(ret)

结果为：

<class '_sre.SRE_Match'>
<class 'list'>
['abc', 'abc', 'abc']

sub

def sub(pattern, repl, string, count=0, flags=0):
    """Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the match object and must return
    a replacement string to be used."""
    return _compile(pattern, flags).sub(repl, string, count)

进行字符串替换，此时返回的是 str 类型：

import re

text = "abc abc abc"
ret = re.sub('abc','def',text)
print(type(ret))
print(text)
print(ret)

结果为：

<class 'str'>
abc abc abc
def def def

split

def split(pattern, string, maxsplit=0, flags=0):
    """Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list."""
    return _compile(pattern, flags).split(string, maxsplit)

使用正则表达式进行分隔，此时返回的是一个 list：

import re

text = "abc abc abc"
ret = re.split('\s',text)
print(type(ret))
print(ret)

结果为：

<class 'list'>
['abc', 'abc', 'abc']

compile

def compile(pattern, flags=0):
    "Compile a regular expression pattern, returning a pattern object."
    return _compile(pattern, flags)

如果某些正则表达式的使用频率较高，可以使用 compile 进行编译，使用的时候直接调用，这样能够提高执行的效率。

import re

text = "birthday is 2000/12/13"
r = re.compile(r"""
                \d+ # 年份
                / # 分隔符
                \d+ # 月份
                / # 分隔符
                \d+ # 日
                """,re.VERBOSE)
ret = re.search(r,text)
print(ret.group())

结果为：

2000/12/13

上述 compile 中添加了 flags=re.VERBOSE，为正则表达式添加了注释。

止步听风

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python网络爬虫(九)——re

正则表达式正则表达式(Regular Expression)是一种文本模式，包括普通字符(如 a-z)和特殊字符(元字符) 正则表达式使用单个字符串来描述，用来匹配一系列满足某个句法规则的字符串匹配规则匹配某个字符串import retext = 'hello'ret = re.match('he',text)print(ret.group())结果为：he匹配任意字符(.)import retext = "ab"ret = re.match('.',te
复制链接

扫一扫