beautifulsoup4教程(一)基础知识和第一个爬虫
beautifulsoup4教程(二)bs4中四大对象
beautifulsoup4教程(三)遍历和搜索文档树
beautifulsoup4教程(四)css选择器
三、四大对象种类
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
Tag
NavigableString
BeautifulSoup
Comment
3.1 Tag 标签
#-*-coding:utf-8-*-
from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象,例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")
#格式化输出
print soup.title
print soup.head
print soup.a
print soup.p
result:
<title>The Dormouse's story</title>
<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
- 利用 soup加标签名轻松地获取这些标签的内容
- 它查找的是在所有内容中的第一个符合要求的标签
- 这些对象的类型是
<class 'bs4.element.Tag'>
- Tag对象的两个重要属性:
- name
输出标签的标签类型名
#-*-coding:utf-8-*-
from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象,例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")
#格式化输出
print soup.name
print soup.head.name
result:
[document]
head
- attrs
以字典的形式获取标签的属性
#-*-coding:utf-8-*-
from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象,例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")
#利用Tag对象的attrs方法获取属性
print soup.p.attrs
#获取单个属性
print soup.p.attrs['class']
print soup.p.get('class')
resutl:
{'class': ['title'], 'name': 'dromouse'}
- 既然利用attr获得的是字典对象,那么也是可以修改和删除的
#-*-coding:utf-8-*-
from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象,例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")
#修改Tag对象的属性
soup.p['class']="newClassname"
print soup.p
#删除Tag对象的属性
del soup.p['class']
print soup.p
result:
<p class="newClassname" name="dromouse"><b>The Dormouse's story</b></p>
<p name="dromouse"><b>The Dormouse's story</b></p>
3.2 NavigableString
- 作用:获取标签内部的文字
- 直译:可遍历的字符串
- 使用方法:
soup.p.string
- 对象类型:
<class 'bs4.element.NavigableString'>
#-*-coding:utf-8-*-
from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象,例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")
#获取标签内部文字
print soup.p.string
print type(soup.p.string)
result:
The Dormouse's story
<class 'bs4.element.NavigableString'>
3.3 BeautifulSoup
- 文档对象,也就是整个文档的内容。
- 可以当做是一个Tag对象。
#-*-coding:utf-8-*-
from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建beautifulsoup对象
#也可以用打开本地的html文件来创建beautifulsoup对象,例如:
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")
print soup.name
print type(soup.name)
print soup.attr
result:
[document]
<type 'unicode'>
None
3.4 Comment
- Coment对象是一个特殊类型的NavigableString对象。
- 如果标签内部的内容是注释,例如:
<!-- Elsie -->
。那么该NavigableSring对象会转换成Comment对象,并且会把注释符号去掉。
print soup.a
print soup.a.string
print type(soup.a.string)
result:
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
Elsie
<class 'bs4.element.Comment'>
- 如果我们需要获得Coment类型的对象,需要先判断对象类型是Coment还是NavigableString。