六、BeautifulSoup 的使用
- Beautiful Soup 是⼀个可以从HTML或XML⽂件中提取数据的⽹⻚信息提取库
6.1 基本使用方法
方法 | 功能 |
---|---|
BeautifulSoup(html_doc,‘lxml’) ` | 获取bs对象 |
bs.prettify() | 打印文档内容 |
bs.title(标签名) | 获取标签内容 |
bs.title.name | 获取标签名称 |
bs.title.string | 获取标签里面的文本内容 |
6.2 bs4的对象种类
对象 | 种类 |
---|---|
tag | 标签 |
NavigableString | 可导航的字符串 |
BeautifulSoup | bs对象 |
Comment | 注释 |
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
# print(type(soup)) # <class 'bs4.BeautifulSoup'>
#
# print(type(soup.title)) # <class 'bs4.element.Tag'>
# print(type(soup.a)) # <class 'bs4.element.Tag'>
# print(type(soup.p)) # <class 'bs4.element.Tag'>
#
# print(soup.p.string) # The Dormouse's story
# print(type(soup.p.string)) # <class 'bs4.element.NavigableString'>
title_tag = soup.p
print(title_tag)
print(title_tag.name)
print(title_tag.string)
html_comment = '<a><!-- 这里是注释内容--></a>'
soup = BeautifulSoup(html_comment,'lxml')
print(soup.a.string)
print(type(soup.a.string)) # <class 'bs4.element.Comment'>
6.3 遍历树 遍历子节点
bs⾥⾯有三种情况,第⼀个是遍历,第⼆个是查找,第三个是修改
-
contents children descendants
-
contents 返回的是⼀个列表
-
children 返回的是⼀个迭代器通过这个迭代器可以进⾏迭代
-
descendants 返回的是⼀个⽣成器遍历⼦⼦孙孙
-
-
.string .strings .stripped_strings
-
string获取标签⾥⾯的内容
-
strings 返回是⼀个⽣成器对象⽤过来获取多个标签内容
-
stripped strings 和strings基本⼀致 但是它可以把多余的空格去掉
-
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
# tag
# print(soup.title)
# print(soup.p)
# print(soup.p.b)
# print(soup.a)
# all_p = soup.find_all('p')
#
# print(all_p)
# []属性来取值
title_tag = soup.p
print(title_tag['class'])#['title']
# contents 返回的是一个列表
# children 返回的是一个迭代器通过这个迭代器可以进行迭代
# 迭代 重复 循环(loop)
# python当中 循环 while for 实现迭代 for ... in ...
# 在Python中可以使用for关键字来逐个访问可迭代对象
# descendants 返回的是一个生成器遍历子子孙孙
# contents 返回的是一个列表
# links = soup.contents
# print(type(links)) # <class 'list'>
# print(links)
# children 返回的是一个迭代器通过这个迭代器可以进行迭代
html = '''
<div>
<a href='#'>百度</a>
<a href='#'>阿里</a>
<a href='#'>腾讯</a>
</div>
'''
# 需要div标签下的数据
soup2 = BeautifulSoup(html,'lxml')
# links = soup2.contents
# print(type(links))
#
# print(links)
# for i in links:
#
# print()
# links = soup2.div.children
# print(type(links)) # <class 'list_iterator'>
#
# for link in links:
# print(link)
# descendants 返回的是一个生成器遍历子子孙孙
# print(len(soup.contents))
# # print(len(soup.descendants)) # TypeError: object of type 'generator' has no len()
#
# for x in soup.descendants:
#
# print('----------------')
# print(x)
# string获取标签里面的内容
# strings 返回是一个生成器对象用过来获取多个标签内容
# stripped strings 和strings基本一致 但是它可以把多余的空格去掉
# title_tag = soup.title
# print(title_tag)
# print(title_tag.string)
#
# head_tag = soup.head
# print(head_tag.string)
#
# print(soup.html.string)
# strings = soup.strings
# print(strings) # <generator object _all_strings at 0x000001D9053745C8>
# for s in strings:
# print(s)
strings = soup.stripped_strings
for s in strings:
print(s)
6.4 遍历树 遍历父节点
-
parent直接获得⽗节点
-
parents获取所有的⽗节点
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
# parent直接获得父节点
# title_tag = soup.title
# print(title_tag)
# print(title_tag.parent)
# print(soup.html.parent)
# parents获取所有的父节点
a_tag = soup.a
# print(a_tag)
# print(a_tag.parents) # <generator object parents at 0x0000025F937E9678>
for x in a_tag.parents:
print(x)
print('----------------')
6.5 遍历树 遍历兄弟节点
-
next_sibling 下⼀个兄弟结点
-
previous_sibling 上⼀个兄弟结点
-
next_siblings 下⼀个所有兄弟结点
-
previous_siblings上⼀个所有兄弟结点
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# html = '<a><b>bbb</b><c>ccc</c><a>'
soup = BeautifulSoup(html_doc,'lxml')
#
# # print(soup.prettify())
# b_tag = soup.b
# print(b_tag)
# print(b_tag.next_sibling)
# c_tag = soup.c
# # print(c_tag.next_sibling)
# print(c_tag.previous_sibling)
a_tag = soup.a
# print(a_tag)
for x in a_tag.next_siblings:
print(x)
6.6 搜索树
-
字符串过滤器
-
正则表达式过滤器:我们⽤正则表达式⾥⾯compile⽅法编译⼀个正则表达式传给 find 或者
-
findall这个⽅法可以实现⼀个正则表达式的⼀个过滤器的搜索
-
列表过滤器
-
True过滤器
-
⽅法过滤器
from bs4 import BeautifulSoup
import re
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
# • 字符串过滤器
# • 正则表达式过滤器
# 我们用正则表达式里面compile方法编译一个正则表达式传给 find 或者 findall这个方法可以实现一个正则表达式的一个过滤器的搜索
# • 列表过滤器
# • True过滤器
# • 方法过滤器
# • 字符串过滤器
# a_tag2 = soup.a
#
# a_tags = soup.find_all('a')
# print(a_tags)
# 我想要找到所有t 打头的标签 正则表达式
# print(soup.find_all(re.compile('t')))
# 我想要找p标签和a标签 列表过滤器
# print(soup.find_all(['p','a']))
# print(soup.find_all(['title','b']))
# print(soup.find_all(True)) # True过滤器
def fn(tag):
return tag.has_attr('class')
print(soup.find_all(fn))
6.7 复习
方法 | 功能 |
---|---|
soup.prettify() | 格式化源码 |
soup.title | title整个标签的内容 |
soup.title.name | 标签的名字 |
soup.title.string | 标签内容 |
soup.contents | 返回一个列表 |
soup.div.children | 返回一个迭代器 |
soup.descendants | 返回一个生成器 遍历子子孙孙 |
soup.string | 获取标签内容 |
soup.string | 获取所有标签内容 |
soup.stripped_strings | 获取所有标签内容并删除多余空格 |
soup.a.parent | 获取a标签的父节点 |
soup.a.previous_sibling | 上一个兄弟节点 |
soup.a.next_sibling | 下一个兄弟节点 |
soup.a.next_siblings | 下一个所有兄弟节点 |
soup.a.previous_siblings | 上一个所有兄弟节点 |
6.8 find
函数 | 功能 |
---|---|
find(‘标签’,class=‘属性’) | 查找单个标签 |
find_all() | 查找所有标签 |
find_parents() | 搜索所有⽗亲 |
find_parrent() | 搜索单个⽗亲 |
find_next_siblings() | 搜索所有兄弟 |
find_next_sibling() | 搜索单个兄弟 |
find_previous_siblings() | 往上搜索所有兄弟 |
find_previous_sibling() | 往上搜索单个兄弟 |
find_all_next() | 往下搜索所有元素 |
find_next() | 往下查找单个元素 |
find_all(self, name=None, attrs={}, recursive=True, text=None,limit=None, **kwargs)
name : tag 名称
attrs :标签的属性
recursive : 是否递归
text : 文本内容
limit : 限制返回的条数
**kwargs :不定长参数 以关键字来传参
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
# find_all(self, name=None, attrs={}, recursive=True, text=None,
# limit=None, **kwargs)
# name : tag 名称
# attrs :标签的属性
# recursive : 是否递归
# text : 文本内容
# limit : 限制返回的条数
# **kwargs :不定长参数 以关键字来传参
# a_tags = soup.find_all('a')
# p_tags = soup.find_all('p','title')
# print(soup.find_all(id = 'link1'))
# print(soup.find_all('a',limit=2))
# print(soup.a)
# print(soup.find('a'))
# print(soup.find_all('a',recursive=True))
# print(soup.find_all('a',limit=1)[0])
# print(soup.find('a'))
# find_parents() 搜索所有父亲
# find_parrent() 搜索单个父亲
# find_next_siblings()搜索所有兄弟
# find_next_sibling()搜索单个兄弟
title_tag = soup.title
# print(title_tag.find_parent('head')) # <head><title>The Dormouse's story</title></head>
s = soup.find(text = 'Elsie')
# print(s.find_previous('p'))
# print(s.find_parents('p'))
# a_tag = soup.a
#
# # print(a_tag)
# #
# # print(a_tag.find_next_sibling('a'))
#
# print(a_tag.find_next_siblings('a'))
# find_previous_siblings() 往上搜索所有兄弟
# find_previous_sibling() 往上搜索单个兄弟
# find_all_next() 往下搜索所有元素
# find_next()往下查找单个元素
a_tag = soup.find(id='link3')
# print(a_tag)
# print(a_tag.find_previous_sibling())
# print(a_tag.find_previous_siblings())
p_tag = soup.p
# print(p_tag.find_all_next())
print(p_tag.find_next('a'))
6.9 修改文档树
-
修改tag的名称和属性
-
修改string 属性赋值,就相当于⽤当前的内容替代了原来的内容
-
append() 像tag中添加内容,就好像Python的列表的 .append() ⽅法
-
decompose() 修改删除段落,对于⼀些没有必要的⽂章段落我们可以给他删除掉
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
# 1.修改tag的名称和属性
# tag_p = soup.p
# print(tag_p)
# tag_p.name = 'w' # 修改名称
# tag_p['class'] = 'content' # 修改属性
# print(tag_p)
# 2. 修改string
tag_p = soup.p
# print(tag_p.string)
# tag_p.string = '521 wo ai ni men'
# print(tag_p.string)
# 3.tag.append() 方法 向tag中添加内容
# print(tag_p)
# tag_p.append('hahaha')
# print(tag_p)
# 4.decompose() 修改删除段落
result = soup.find(class_ = 'title')
result.decompose()
print(soup)
6.10 爬取天气网数据
- 知识点
find('div',class='conMidtab')
:根据标签属性获取数据table.find_all('tr')[2:]
:过滤掉前两个tr- enumerate 返回2个值,第一个是下标 第二个是下标所对应的元素
- BeautifulSoup中有两种解析方式
html
和html5lib
import requests
from bs4 import BeautifulSoup
# 定义一个函数来解析网页
def parse_page(url):
response = requests.get(url)
# 解决乱码
text = response.content.decode('utf-8')
soup = BeautifulSoup(text,'html5lib') # pip install html5lib
# 网页解析
# 一、class="conMidtab"
conMidtab = soup.find('div',class_='conMidtab')
# print(conMidtab)
# 二、table
tables = conMidtab.find_all('table')
# print(tables)
for table in tables:
# print(table)
# 三、tr 过滤掉去前2个
trs = table.find_all('tr')[2:]
# enumerate 返回2个值第一个是下标 第二个下标所对应的元素
for index,tr in enumerate(trs):
# print(tr)
tds = tr.find_all('td')
# 判断
city_td = tds[0] # 城市
if index == 0:
city_td = tds[1] # 省会
# 获取一个标签下面的子孙节点的文本信息
city = list(city_td.stripped_strings)[0]
temp_td = tds[-2]
temp = list(temp_td.stripped_strings)[0]
print('城市:',city,'温度:',temp)
# break # 先打印北京
# 四、td
# print(text)
def main():
url = 'http://www.weather.com.cn/textFC/hb.shtml' # 华东
# url = 'http://www.weather.com.cn/textFC/db.shtml' # 东北
url = 'http://www.weather.com.cn/textFC/gat.shtml' # 港澳台
urls = ['http://www.weather.com.cn/textFC/hb.shtml','http://www.weather.com.cn/textFC/db.shtml' ,'http://www.weather.com.cn/textFC/gat.shtml']
for url in urls:
parse_page(url)
if __name__ == '__main__':
main()