（二）BeautifulSoup库入门 | python爬虫实战

最新推荐文章于 2024-08-30 10:01:19 发布

倞涼諒

最新推荐文章于 2024-08-30 10:01:19 发布

阅读量354

点赞数

分类专栏：笔记文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_44242764/article/details/109910798

版权

笔记专栏收录该内容

12 篇文章 0 订阅

订阅专栏

BeautifulSoup库安装：pip3 install beautifulsoup4
BeautifulSoup库的导入：from bs4 import BeautifulSoup

HTML相当于一个标签树
BeautifulSoup库是解析、遍历、维护“标签树”的功能库

BeautifulSoup类的常用解析器

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(mk,‘html.parser’)	安装bs4库
lxml的HTML解析器	BeautifulSoup(mk,‘lxml’)	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,‘xml’)	pip install lxml
html5lib的解析器	BeautifulSoup(mk,‘html5lib’)	pip install html5lib

BeautifulSoup类的基本元素

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾
Name	标签的名字，<p>…</p>的名字是’p’，格式：.name
Attributes	标签的属性，字典形式组织，格式：.attrs
NavigableString	标签内非属性字符串，<>…</>中字符串，格式：.string
Comment	标签内字符串的注释部分，一种特殊的Comment类型

标签树的遍历

上行遍历

属性	说明
.parent	节点的父亲标签
.parents	节点的先辈标签的迭代类型，用于循环遍历先辈节点

下行遍历

属性	说明
.contents	返回子节点列表，将Tag所有儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历子节点
descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

平行遍历

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_sibling	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

实例

test.html

<!DOCTYPE html>
<html>
    <title>
        html标签树
    </title>
    <body>
        <a class="class01" id="id01"><!--This is a comment --></a>
        <p class="class02" id="id02">
            <h1>
                No.1
                <h2>
                    iterators
                </h2>
            </h1>
            <h1>
                No.2
            </h1>
        </p>
    </body>
</html>

from bs4 import BeautifulSoup

def func():
    soup = BeautifulSoup(open("./test.html", encoding="utf-8"), "html.parser")
    tag = soup.a
    print(type(tag))
    print(tag.name)
    print(tag.attrs)
    print(tag.string)
    print(soup.p.string)

if __name__ == "__main__":
    func()

运行输出

<class 'bs4.element.Tag'>
a
{'class': ['class01'], 'id': 'id01'}
This is a comment 
None