Python爬虫beautifulsoup4模块

最新推荐文章于 2024-05-22 10:59:39 发布

琴酒网络

最新推荐文章于 2024-05-22 10:59:39 发布

阅读量600

点赞数

分类专栏： Python爬虫文章标签： python 爬虫 beautifulsoup4

本文链接：https://blog.csdn.net/pcn01/article/details/105933744

版权

Python爬虫专栏收录该内容

14 篇文章 7 订阅

订阅专栏

Python爬虫beautifulsoup4模块

一：beautifulsoup4模块介绍
二：模块安装
三：节点选择器
四：方法选择器
五：CSS选择器
六：tag修改方法
七：小案例

一：beautifulsoup4模块介绍

Beautiful Soup是python的一个HTML或XML的解析库，我们可以用它来方便的从网页中提取数据，它拥有强大的API和多样的解析方式。

Beautiful Soup的三个特点：

Beautiful Soup提供一些简单的方法和python式函数，用于浏览，搜索和修改解析树，它是一个工具箱，通过解析文档为用户提供需要抓取的数据
Beautiful Soup自动将转入稳定转换为Unicode编码，输出文档转换为UTF-8编码，不需要考虑编码，除非文档没有指定编码方式，这时只需要指定原始编码即可
Beautiful Soup位于流行的Python解析器（如lxml和html5lib）之上，允许您尝试不同的解析策略或交易速度以获得灵活性。

Beautiful Soup在解析时实际上是依赖解析器的，它除了支持python标准库中的HTML解析器外还支持第三方解析器如lxml

Beautiful Soup支持的解析器,以及它们的优缺点：

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup,“html.parser”)	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML解析器	BeautifulSoup(markup,“lxml”)	速度快文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(markup,[“lxml”, “xml”]) BeautifulSoup(markup,“xml”)	速度快，唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup,“html5lib”)	最好的容错性，以浏览器的方式解析文档，生成HTML5格式文档	速度慢，不依赖外部扩展

二：模块安装

pip install bs4
pip install lxml

bs4模块文档见：

https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

三：节点选择器

直接调用节点的名称就可以选择节点元素，节点可以嵌套选择返回的类型都是bs4.element.Tag对象

from bs4 import BeautifulSoup
#下面代码示例都是用此文档测试
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup=BeautifulSoup(html_doc,'lxml')
print(soup.prettify())      # 以美观方式输出html内容
print(soup.head)            # 获取head标签
print(soup.p.b)             # 获取p节点下的b节点
print(soup.a.string)        # 获取a标签下的文本，只获取第一个

attrs属性获取节点属性，也可以字典的形式直接获取，返回的结果可能是列表或字符串类型，取决于节点类型

soup=BeautifulSoup(html_doc,'lxml')
print(soup.p.attrs)						# {'class': ['title']}
print(soup.p.attrs['class'])			# ['title']
print(soup.p['class'])					# ['title']

soup=BeautifulSoup(html_doc,'lxml')
a_list = soup.find_all('a')
for a in a_list:
	# 以下两种方法都可以获取到a标签的href属性
    # href = a.attrs['href']
    href = a['href']
    print(href)

string属性获取节点元素包含的文本内容：

print(soup.p.string)   					# 获取第一个p节点下的文本内容
infos = list(a.stripped_strings)    	# 获取a标签下非空白字符文本内容

contents属性获取节点的直接子节点，以列表的形式返回内容

soup.body.contents   #是直接子节点，不包括子孙节点

children属性获取的也是节点的直接子节点，只是以生成器的类型返回

soup.body.children

descendants属性获取子孙节点，返回生成器

soup.body.descendants

parent属性获取父节点，parents获取祖先节点，返回生成器

soup.b.parent
soup.b.parents

next_sibling属性返回下一个兄弟节点，previous_sibling返回上一个兄弟节点,注意换行符也是一个节点，所以有时候在获取兄弟节点是通常是字符串或者空白

soup.a.next_sibling
soup.a.previous_sibling

next_siblings和previous_sibling分别返回前面和后面的所有兄弟节点，返回生成器

soup.a.next_siblings
soup.a.previous_siblings

next_element和previous_element属性获取下一个被解析的对象，或者上一个

soup.a.next_element
soup.a.previous_element

next_elements和previous_elements迭代器向前或者后访问文档解析内容

soup.a.next_elements
soup.a.previous_elements

四：方法选择器

前面使用的都是通过节点属性来选择的，这种方法非常快，但在进行比较复杂的选择时就不够灵活，幸好Beautiful Soup还为我们提供了一些查询方法，如find_all()和find()等

find_all(name,attrs,recursive,text,**kwargs)：查询所有符合条件的元素，其中的参数

name表示可以查找所有名字为name的标签(tag)，也可以是过滤器，正则表达式，列表或者是True
attrs表示传入的属性，可以通过attrs参数以字典的形式指定如常用属性id,attrs={‘id’:‘123’}，由于class属性是python中的关键字，所有在查询时需要在class后面加上下划线即class_=‘element’，返回的结果是tag类型的列表
text参数用来匹配节点的文本，传入的形式可以是字符串也可以是正则表达式对象
recursive表示，如果只想搜索直接子节点可以将参数设为false：recursive=Flase
limit参数，可以用来限制返回结果的数量，与SQL中的limit关键字类似

import re
from bs4 import BeautifulSoup

html_doc = """ #下面示例都是用此文本内容测试
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
    ddd
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<span>中文</span>
"""
soup=BeautifulSoup(html_doc,'lxml')
print(type(soup))
print(soup.find_all('span'))  #标签查找
print(soup.find_all('a',id='link1'))  #属性加标签过滤
print(tags = soup.find_all('a', id = 'link1', class_ = 'sister'))	# 注意class后面有个下划线
print(soup.find_all('a',attrs={'class':'sister','id':'link3'})) #多属性
print(soup.find_all('p',class_='title'))  #class特殊性,此次传入的参数是**kwargs
print(soup.find_all(text=re.compile('Tillie')))  #文本过滤
print(soup.find_all('a',limit=2))  #限制输出数量

find( name , attrs , recursive , text , **kwargs )：它返回的是单个元素，也就是第一个匹配的元素，类型依然是tag类型，参数同find_all()一样
另外还有许多查询方法，其用法和前面介绍的find_all()方法完全相同，只不过查询范围不同，参数也一样

find_parents(name , attrs , recursive , text , **kwargs )
find_parent(name , attrs , recursive , text , **kwargs )
# 前者返回所有祖先节点，后者返回直接父节点
find_next_siblings(name , attrs , recursive , text , **kwargs )
find_next_sibling(name , attrs , recursive , text , **kwargs )
# 对当前tag后面的节点进行迭代，前者返回后面的所有兄弟节点，后者返回后面第一个兄弟节点
find_previous_siblings(name , attrs , recursive , text , **kwargs )
find_previous_sibling(name , attrs , recursive , text , **kwargs )
# 对当前tag前面的节点进行迭代，前者返回前面的所有兄弟节点，后者返回前面的第一个兄弟节点
find_all_next(name , attrs , recursive , text , **kwargs )
find_next(name , attrs , recursive , text , **kwargs )
# 对当前tag之后的tag和字符串进行迭代，前者返回所有符合条件的节点，后者返回第一个符合条件的节点
find_all_previous()
find_previous()
# 对当前tag之前的tag和字符串进行迭代，前者返回节点后所有符合条件的节点，后者返回第一个符合条件的节点

五：CSS选择器

Beautiful Soup还提供了CSS选择器，在 Tag 或 BeautifulSoup 对象的 .select()方法中传入字符串参数,即可使用CSS选择器的语法找到tag:

In [10]: soup.select('title')
Out[10]: [<title>The Dormouse's story</title>]

通过tag标签逐层查找：

In [12]: soup.select('body a')
Out[12]: 
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

查找某个tag标签下的直接子标签：

In [13]: soup.select('head > title')
Out[13]: [<title>The Dormouse's story</title>]

查找兄弟节点标签：

In [14]: soup.select('#link1 ~ .sister')
Out[14]: 
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过CSS类名查找：

In [15]: soup.select('.title')
Out[15]: [<p class="title"><b>The Dormouse's story</b></p>]

In [16]: soup.select('[class~=title]')
Out[16]: [<p class="title"><b>The Dormouse's story</b></p>]

通过tag的id查找：

In [17]: soup.select('#link1')
Out[17]: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [18]: soup.select('a#link2')
Out[18]: [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过是否存在某个属性来查找：

In [20]: soup.select('a[href]')
Out[20]: 
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过属性的值来查找匹配：

In [22]: soup.select('a[href="http://example.com/elsie"]')
Out[22]: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [23]: soup.select('a[href^="http://example.com/"]')  #匹配值的开头
Out[23]: 
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [24]: soup.select('a[href$="tillie"]')  #匹配值的结尾
Out[24]: [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [25]: soup.select('a[href*=".com/el"]')  #模糊匹配
Out[25]: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

tag节点查找，方法选择器查找和CSS选择器查找三种方法的实现方式基本相似，tag相对于其他两种所有最快速的查找方式，但方法选择器提供更便利更复杂的查找方式，使用更如有上手。

六：tag修改方法

Beautiful Soup的强项是文档的搜索功能，修改功能使用场景不是很多只做简单介绍，要了解更多修改方法请前往Beautiful Soup官方文档查看。
Beautiful Soup可以实现改变tag标志的属性的值，添加或删除属性和内容，下面介绍一些常用的方法

In [26]: markup='<a href="http://www.baidu.com/">baidu</a>'
In [28]: soup=BeautifulSoup(markup,'lxml')
In [29]: soup.a.string='百度'
In [30]: soup.a
Out[30]: <a href="http://www.baidu.com/">百度</a>
#如果a节点下包括子也将被覆盖掉

Tag.append() 方法想tag中添加内容,就好像Python的列表的 .append() 方法:

In [30]: soup.a
Out[30]: <a href="http://www.baidu.com/">百度</a>
In [31]: soup.a.append('一下')
In [32]: soup.a
Out[32]: <a href="http://www.baidu.com/">百度一下</a>

new_tag()方法用于创建一个tag标签

In [33]: soup=BeautifulSoup('<b></b>','lxml')

In [34]: new_tag=soup.new_tag('a',href="http://www.python.org") #创建tag,第一个参数必须为tag的名称

In [35]: soup.b.append(new_tag) #添加到b节点下

In [36]: new_tag.string='python' #为tag设置值

In [37]: soup.b
Out[37]: <b><a href="http://www.python.org">python</a></b>

其他方法：

insert()将元素插入到指定的位置
inert_before()在当前tag或文本节点前插入内容
insert_after()在当前tag或文本节点后插入内容
clear()移除当前tag的内容
extract()将当前tag移除文档数，并作为方法结果返回
prettify()将Beautiful Soup的文档数格式化后以Unicode编码输出，tag节点也可以调用
get_text()输出tag中包含的文本内容，包括子孙tag中的内容
soup.original_encoding 属性记录了自动识别的编码结果
from_encoding:参数在创建BeautifulSoup对象是可以用来指定编码，减少猜测编码的运行速度
#解析部分文档，可以使用SoupStrainer类来创建一个内容过滤器，它接受同搜索方法相同的参数

from bs4 import BeautifulSoup,SoupStrainer

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
    ddd
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<span>中文</span>
"""
only_a_tags = SoupStrainer('a')  # 过虑器
soup=BeautifulSoup(html_doc,'lxml',parse_only=only_a_tags)
print(soup.prettify())
#
<a class="sister" href="http://example.com/elsie" id="link1">
 Elsie
</a>
<a class="sister" href="http://example.com/lacie" id="link2">
 Lacie
</a>
<a class="sister" href="http://example.com/tillie" id="link3">
 Tillie
</a>

七：小案例

import requests
from bs4 import BeautifulSoup
from pyecharts.charts import Bar

ALL_DATA = []
def parse_page(url):
    headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
    response = requests.get(url, headers = headers)
    text = response.content.decode()
    # soup = BeautifulSoup(text, 'lxml')
    soup = BeautifulSoup(text, 'html5lib')
    conMidtab = soup.find('div', class_ = 'conMidtab')
    tables = conMidtab.find_all('table')
    for table in tables:
        trs = table.find_all('tr')[2:]
        for index, tr in enumerate(trs):
            tds = tr.find_all('td')
            city_td = tds[0]
            if index == 0: city_td = tds[1]
            city = list(city_td.stripped_strings)[0]
            temp_td = tds[-2]
            min_temp = list(temp_td.stripped_strings)[0]
            ALL_DATA.append({'city': city, 'min_temp': int(min_temp)})
            # print({'city': city, 'min_temp': min_temp})

def main():
    url = 'http://www.weather.com.cn/textFC/hb.shtml'
    parse_page(url)
    # print(ALL_DATA)
    ALL_DATA.sort(key = lambda data: data['min_temp'])
    # print(ALL_DATA)
    create_view(ALL_DATA[:10])

def create_view(data):
    chart = Bar()
    city_list = [i['city'] for i in data]
    tmp_list = [i['min_temp'] for i in data]
    # print(city_list)
    chart.add_xaxis(city_list)
    chart.add_yaxis('中国最低气温排行榜', tmp_list)
    chart.render()

if __name__ == '__main__':
    main()

出图效果如下：
Python爬虫beautifulsoup4模块
pyecharts模块见：

https://pyecharts.org/#/zh-cn/quickstart

琴酒网络

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫beautifulsoup4模块

Python爬虫beautifulsoup4模块一：beautifulsoup4模块介绍二：模块安装三：节点选择器四：方法选择器五：CSS选择器六：tag修改方法一：beautifulsoup4模块介绍Beautiful Soup是python的一个HTML或XML的解析库，我们可以用它来方便的从网页中提取数据，它拥有强大的API和多样的解析方式。Beautiful Soup的三个特点：...
复制链接

扫一扫