beautifulsoup4教程（二）bs4中四大对象

最新推荐文章于 2025-02-18 22:30:07 发布

tyson Lee

最新推荐文章于 2025-02-18 22:30:07 发布

阅读量7.6k

点赞数 6

分类专栏：爬虫

本文链接：https://blog.csdn.net/chinaltx/article/details/86748757

版权

爬虫专栏收录该内容

6 篇文章

订阅专栏

本文深入讲解BeautifulSoup4库的四大核心对象：Tag、NavigableString、BeautifulSoup和Comment，通过实例演示如何使用这些对象解析和操作HTML文档，包括获取标签、属性、内部文字及注释。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

beautifulsoup4教程（一）基础知识和第一个爬虫

 beautifulsoup4教程（二）bs4中四大对象

 beautifulsoup4教程（三）遍历和搜索文档树

 beautifulsoup4教程（四）css选择器

三、四大对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
Tag
NavigableString
BeautifulSoup
Comment

3.1 Tag 标签

#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象，例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")

#格式化输出
print soup.title
print soup.head
print soup.a
print soup.p

result:
<title>The Dormouse's story</title>
<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

利用 soup加标签名轻松地获取这些标签的内容
它查找的是在所有内容中的第一个符合要求的标签
这些对象的类型是<class 'bs4.element.Tag'>
Tag对象的两个重要属性：

name

输出标签的标签类型名

#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象，例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")

#格式化输出
print soup.name
print soup.head.name

result:
[document]
head

attrs

以字典的形式获取标签的属性

#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象，例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")

#利用Tag对象的attrs方法获取属性
print soup.p.attrs
#获取单个属性
print soup.p.attrs['class']
print soup.p.get('class')

resutl:
{'class': ['title'], 'name': 'dromouse'}

既然利用attr获得的是字典对象，那么也是可以修改和删除的

#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象，例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")

#修改Tag对象的属性
soup.p['class']="newClassname"
print soup.p
#删除Tag对象的属性
del soup.p['class']
print soup.p

result:
<p class="newClassname" name="dromouse"><b>The Dormouse's story</b></p>
<p name="dromouse"><b>The Dormouse's story</b></p>

3.2 NavigableString

作用：获取标签内部的文字
直译：可遍历的字符串
使用方法：soup.p.string
对象类型：<class 'bs4.element.NavigableString'>

#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象，例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")

#获取标签内部文字
print soup.p.string
print type(soup.p.string)

result:
The Dormouse's story
<class 'bs4.element.NavigableString'>

3.3 BeautifulSoup

文档对象，也就是整个文档的内容。
可以当做是一个Tag对象。

#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象，例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")

print soup.name
print type(soup.name)
print soup.attr

result:
[document]
<type 'unicode'>
None

3.4 Comment

Coment对象是一个特殊类型的NavigableString对象。
如果标签内部的内容是注释，例如：。那么该NavigableSring对象会转换成Comment对象，并且会把注释符号去掉。

print soup.a
print soup.a.string
print type(soup.a.string)

result:
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
<class 'bs4.element.Comment'>