beautifulsoup4教程(二)bs4中四大对象

beautifulsoup4教程(一)基础知识和第一个爬虫

beautifulsoup4教程(二)bs4中四大对象

beautifulsoup4教程(三)遍历和搜索文档树

beautifulsoup4教程(四)css选择器


三、四大对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
Tag
NavigableString
BeautifulSoup
Comment

3.1 Tag 标签
#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象,例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")

#格式化输出
print soup.title
print soup.head
print soup.a
print soup.p

result:
<title>The Dormouse's story</title>
<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
  1. 利用 soup加标签名轻松地获取这些标签的内容
  2. 它查找的是在所有内容中的第一个符合要求的标签
  3. 这些对象的类型是<class 'bs4.element.Tag'>
  4. Tag对象的两个重要属性:
  • name

输出标签的标签类型名

#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象,例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")

#格式化输出
print soup.name
print soup.head.name

result:
[document]
head
  • attrs

以字典的形式获取标签的属性

#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象,例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")

#利用Tag对象的attrs方法获取属性
print soup.p.attrs
#获取单个属性
print soup.p.attrs['class']
print soup.p.get('class')

resutl:
{'class': ['title'], 'name': 'dromouse'}
  • 既然利用attr获得的是字典对象,那么也是可以修改和删除的
#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象,例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")

#修改Tag对象的属性
soup.p['class']="newClassname"
print soup.p
#删除Tag对象的属性
del soup.p['class']
print soup.p

result:
<p class="newClassname" name="dromouse"><b>The Dormouse's story</b></p>
<p name="dromouse"><b>The Dormouse's story</b></p>

3.2 NavigableString

  • 作用:获取标签内部的文字
  • 直译:可遍历的字符串
  • 使用方法:soup.p.string
  • 对象类型:<class 'bs4.element.NavigableString'>
#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象,例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")

#获取标签内部文字
print soup.p.string
print type(soup.p.string)

result:
The Dormouse's story
<class 'bs4.element.NavigableString'>

3.3 BeautifulSoup

  • 文档对象,也就是整个文档的内容。
  • 可以当做是一个Tag对象。
#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象,例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")

print soup.name
print type(soup.name)
print soup.attr

result:
[document]
<type 'unicode'>
None

3.4 Comment

  • Coment对象是一个特殊类型的NavigableString对象。
  • 如果标签内部的内容是注释,例如:<!-- Elsie -->。那么该NavigableSring对象会转换成Comment对象,并且会把注释符号去掉。
print soup.a
print soup.a.string
print type(soup.a.string)

result:
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
<class 'bs4.element.Comment'>
  • 如果我们需要获得Coment类型的对象,需要先判断对象类型是Coment还是NavigableString。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值