利用正则表达式， xpath, Beautifulsoup来解析网页

最新推荐文章于 2024-08-12 14:20:12 发布

机器猫666

最新推荐文章于 2024-08-12 14:20:12 发布

阅读量2k

点赞数

分类专栏： spider python 文章标签： xpath re BeautifulSoup

本文链接：https://blog.csdn.net/qq_39431562/article/details/81814056

版权

本文介绍了如何使用Python的re模块进行正则表达式匹配，包括贪婪匹配、非贪婪匹配及换行和大小写处理。接着讲解了BeautifulSoup库的导入和使用，推荐使用lxml解析器，以及CSS选择器和方法选择器如find_all()、find()等在解析网页时的作用。文章还提到了获取文本内容和属性的方法。

摘要由CSDN通过智能技术生成

1 使用正则表达式的时候需要导入re模块，这个是python自带的模块，不用下载

1.1正则表达式有许多常用的规则
这里写图片描述
这里要注意贪婪匹配和非贪婪匹配以及反斜杠转义的问题
1.2 匹配网页的时候有时候要考虑到换行和大小写的问题
遇到匹配换行时要使用修饰符re.S,遇到忽略大小写时需要使用re.I

1.3 re.findall()方法，源码

def findall(pattern, string, flags=0):
    """Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result."""
    return _compile(pattern, flags).findall(string)

返回匹配的所有项，并且存在列表里，它的应用场景就是我想匹配很多信息的时候，对吧

1.4 re.compile()方法 , 源码

def compile(pattern, flags=0):
    "Compile a regular expression pattern, returning a pattern object."
    return _compile(pattern, flags)

这个方法将正则字符串编译成正则表达式对象，它的应用场景就是编译了一个正则表达式可以多次调用
1.5 re.sub()方法，源码

def sub(pattern, repl, string, count=0, flags=0):
    """Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the match object and must return
    a replacement string to be used."""
    return _compile(pattern, flags).sub(repl, string, count)

就是说只要被pattern匹配到的字符，就会被repl替换，返回替换之后的项，它最常用在当提取的网页有多余的信息时，用它来把这些多余的东西处理掉

1.5 re.match()方法，源码

def match(pattern, string, flags=0):
    """Try to apply the pattern at the start of the string, returning
    a match object, or None if no match was found."""
    return