1 使用正则表达式的时候需要导入re模块,这个是python自带的模块,不用下载
1.1正则表达式有许多常用的规则
这里要注意贪婪匹配和非贪婪匹配以及反斜杠转义的问题
1.2 匹配网页的时候有时候要考虑到换行和大小写的问题
遇到匹配换行时要使用修饰符re.S,遇到忽略大小写时需要使用re.I
1.3 re.findall()方法,源码
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
返回匹配的所有项,并且存在列表里,它的应用场景就是我想匹配很多信息的时候,对吧
1.4 re.compile()方法 , 源码
def compile(pattern, flags=0):
"Compile a regular expression pattern, returning a pattern object."
return _compile(pattern, flags)
这个方法将正则字符串编译成正则表达式对象,它的应用场景就是编译了一个正则表达式可以多次调用
1.5 re.sub()方法, 源码
def sub(pattern, repl, string, count=0, flags=0):
"""Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used."""
return _compile(pattern, flags).sub(repl, string, count)
就是说只要被pattern匹配到的字符,就会被repl替换,返回替换之后的项,它最常用在当提取的网页有多余的信息时,用它来把这些多余的东西处理掉
1.5 re.match()方法, 源码
def match(pattern, string, flags=0):
"""Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found."""
return