Python爬虫学习之数据提取(Beautiful Soup)

最新推荐文章于 2024-08-02 12:25:21 发布

侠~~

最新推荐文章于 2024-08-02 12:25:21 发布

阅读量840

点赞数

分类专栏：爬虫文章标签： python 爬虫学习

本文链接：https://blog.csdn.net/iwantrain/article/details/130545708

版权

爬虫专栏收录该内容

9 篇文章 0 订阅

订阅专栏

Python爬虫学习之数据提取Beautiful Soup

前期回顾
概述
解析器
准备工作
实例
总结

前期回顾

Python爬虫学习之requests
Python爬虫学习之数据提取(XPath)

概述

Beautiful Soup是Python的一个HTML或XML的解析库，我们可以很方便的利用它从网页中提取数据。

解析器

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, ‘html.parser’)	Python的内置标准库、执行速度适中、文档容错力强	Python 2.7.3 或 3.2.2 前的版本中文容错力差
LXML HTML解析器	BeautifulSoup(markup, ‘lxml’)	速度快、文档容错力强	需要安装C语言库
LXML HTML解析器	BeautifulSoup(markup, ‘xml’)	速度快，唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, ‘html5lib’)	提供最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

从表中可以看出，LXML解析器有解析HTML和XML的功能，而且速度快、容错能力强，所以推荐使用它。

准备工作

使用Beautiful Soup对HTML解析，需要安装Python的Beautiful Soup和lxml这两个库库。安装命令如下：

# pip 命令
pip install lxml
pip install beautifulsoup4
# pip3 命令
pip3 install lxml
pip3 install beautifulsoup4
# conda 命令
conda install lxml
conda install beautifulsoup4

实例

目标网站 http://www.biqugei.net/ ，首先抓取网站首页内容

import requests

res = requests.get("http://www.biqugei.net/")

print(res.text)

# 以下只展示本周热读推荐部分结果

<div class="CDcac container body-content">
	<div class="FTlza panel panel-default">
		<div class="xLYBo panel-heading">
			<span class="kHSHX glyphicon glyphicon-fire" aria-hidden="true"></span> 本周热读推荐<a class="hau7I pull-right" href="/top.html">More+</a>
		</div>
		<div class="8wdDp panel-body">
			<div class="B7ImI row">
			    							<div class="T5EB3 col-xs-4 book-coverlist">
					<div class="FkNE9 row">
						<div class="iZDC6 col-sm-5">
							<a href="/page/detail1292.html" class="5tIvU thumbnail" style="background-image:url(https://img.picturecdn.com/files/article/image/1/1287/1287s.jpg)"></a>
						</div>
						<div class="e9DDn col-sm-7 pl0">
							<div class="HhPUg caption">
								<h4 class="iH7NL fs-16 text-muted"><a href="/page/detail1292.html" title="神宠又给我开挂了">神宠又给我开挂了</a></h4>
								<small class="XxTAl fs-14 text-muted">石三</small>
								<p class="3QmUt fs-12 text-justify hidden-xs">万古八荒第一神挂！上溯三层世界,最巅峰律令！……三年前,天空坠落三个生灵。西岭秦王得其一,横扫六国统一西岭。南荒大周武曌得其一,纵横南荒十九教,登顶第一。孙长..</p>
							</div>
						</div>
					</div>
				</div>
											<div class="taypp col-xs-4 book-coverlist">
					<div class="xcP3a row">
						<div class="cah1s col-sm-5">
							<a href="/page/detail860.html" class="Y9Bd5 thumbnail" style="background-image:url(https://img.picturecdn.com/files/article/image/0/855/855s.jpg)"></a>
						</div>
						<div class="GhUWN col-sm-7 pl0">
							<div class="yDSpp caption">
								<h4 class="fpiwN fs-16 text-muted"><a href="/page/detail860.html" title="上门狂婿">上门狂婿</a></h4>
								<small class="EdfFX fs-14 text-muted">狼叔当道</small>
								<p class="3trdg fs-12 text-justify hidden-xs"> 入赘三年,受尽羞辱；扫墓归来,开启逆袭之路！</p>
							</div>
						</div>
					</div>
				</div>
											<div class="S89sS col-xs-4 book-coverlist">
					<div class="MXIgP row">
						<div class="lwg8D col-sm-5">
							<a href="/page/detail825.html" class="8keqN thumbnail" style="background-image:url(https://img.picturecdn.com/files/article/image/0/820/820s.jpg)"></a>
						</div>
						<div class="qYuPy col-sm-7 pl0">
							<div class="8lD3a caption">
								<h4 class="XKajH fs-16 text-muted"><a href="/page/detail825.html" title="巨甜！我在禁欲冷王的怀里恃宠而骄">巨甜！我在禁欲冷王的怀里恃宠而骄</a></h4>
								<small class="VQmaJ fs-14 text-muted">浩瀚之渊</small>
								<p class="MFGaB fs-12 text-justify hidden-xs">一场事故,让恶名昭昭的医学博士江云缨穿到了相府又丑又哑的嫡女身上,开局就是与人偷腥？满级邪恶大佬华丽登场,不做柔弱小白花,谁来惹她,个个反杀！什么？渣爹要把她嫁给克妻的痴傻璃王,双腿残疾还不能人道？完美！药死他是不是可以妻承夫业？于是江云缨带着小算盘当上人人同情的璃王妃,摇身成了京都第一美人,还一边风风火火搞事业。开连锁医馆,建最大商会,立暗杀组织,各方势力纷纷栽在她手里,她还直呼不刺激！被甩满脸</p>
							</div>
						</div>
					</div>
				</div>
											<div class="OUbpT col-xs-4 book-coverlist">
					<div class="uvZM5 row">
						<div class="O3ArI col-sm-5">
							<a href="/page/detail1038.html" class="24IJf thumbnail" style="background-image:url(https://img.picturecdn.com/files/article/image/1/1033/1033s.jpg)"></a>
						</div>
						<div class="gNfCV col-sm-7 pl0">
							<div class="ATzCK caption">
								<h4 class="53oTj fs-16 text-muted"><a href="/page/detail1038.html" title="此情惟你独钟">此情惟你独钟</a></h4>
								<small class="x22YR fs-14 text-muted">阮白</small>
								<p class="ynLzE fs-12 text-justify hidden-xs">定好的试管婴儿,突然变成了要跟那个男人同床怀孕,一夜缠绵,她被折磨的浑身瘫软！慕少凌,慕家高高在上的继承人,沉稳矜贵,冷厉霸道,这世上的事,只有他不想办的,没..</p>
							</div>
						</div>
					</div>
				</div>
											<div class="PCxJ1 col-xs-4 book-coverlist">
					<div class="Dz53C row">
						<div class="brF6V col-sm-5">
							<a href="/page/detail65541.html" class="Ghxgz thumbnail" style="background-image:url(https://img.picturecdn.com/files/article/image/65/65536/65536s.jpg)"></a>
						</div>
						<div class="gFMd6 col-sm-7 pl0">
							<div class="9T3F5 caption">
								<h4 class="TshUL fs-16 text-muted"><a href="/page/detail65541.html" title="皇家的和尚">皇家的和尚</a></h4>
								<small class="lVT2Q fs-14 text-muted">蓅謃</small>
								<p class="I4QB7 fs-12 text-justify hidden-xs">冯小宝穿越了,竟然变成了名副其实的花和尚。别人穿越都是带着王霸之气,他却只想如何活下去！大唐高宗年间,那是一个多姿多彩的时代,既有威震天下的名臣武将,李靖,长..</p>
							</div>
						</div>
					</div>
				</div>
											<div class="pWrSz col-xs-4 book-coverlist">
					<div class="MvVkL row">
						<div class="QUwYH col-sm-5">
							<a href="/page/detail131077.html" class="5GNTc thumbnail" style="background-image:url(https://img.picturecdn.com/files/article/image/131/131072/131072s.jpg)"></a>
						</div>
						<div class="ZBgtd col-sm-7 pl0">
							<div class="n1Oo6 caption">
								<h4 class="bvVYO fs-16 text-muted"><a href="/page/detail131077.html" title="穿越无限之旅">穿越无限之旅</a></h4>
								<small class="ylGL5 fs-14 text-muted">神人无名</small>
								<p class="ZNDCH fs-12 text-justify hidden-xs">金庸武侠中有不少绝世高手,书中有提及名字的,也有不曾提及名字的,但都是拥有自己独有的绝世武功而名动天下。段誉有六脉神剑,欧阳锋有蛤蟆功,林朝英有玉女素心剑法,..</p>
							</div>
						</div>
					</div>
				</div>
								
								<div class="1hyYi clear"></div>
			</div>
		</div>
	</div>

节点选择器

使用Beautiful Soup选择节点时，直接调用节点的名称即可。

import requests
from bs4 import BeautifulSoup

res = requests.get("http://www.biqugei.net/")

soup = BeautifulSoup(res.text, 'lxml')

h4 = soup.h4
print(h4)

结果如下：

<h4 class="iH7NL fs-16 text-muted"><a href="/page/detail1292.html" title="神宠又给我开挂了">神宠又给我开挂了</a></h4>

不过该方法获取的只有第一个匹配的元素，如果需要匹配多个元素，会比较复杂，不够灵活。Beautiful Soup为我们提供了一些查询方法，如find_all和find。通过这些方法就可以灵活的查询数据了。

方法选择器

find_all

顾名思义，find_all 就是查询所有符合条件的元素，可以给他传入一些属性或者文本来得到符合条件的元素。API如下：

find_all(name, attrs, recursive, string, limit, **kwargs)

find_all 可以根据name来查询元素，值是节点名称，代码如下

h4s = soup.find_all(name='h4')
print(h4s)

[<h4 class="Lr8df fs-16 text-muted"><a href="/page/detail1292.html" title="神宠又给我开挂了">神宠又给我开挂了</a></h4>, <h4 class="5FSmJ fs-16 text-muted"><a href="/page/detail131077.html" title="穿越无限之旅">穿越无限之旅</a></h4>, <h4 class="LM3lo fs-16 text-muted"><a href="/page/detail825.html" title="巨甜！我在禁欲冷王的怀里恃宠而骄">巨甜！我在禁欲冷王的怀里恃宠而骄</a></h4>, <h4 class="bVCAw fs-16 text-muted"><a href="/page/detail65541.html" title="皇家的和尚">皇家的和尚</a></h4>, <h4 class="maBUt fs-16 text-muted"><a href="/page/detail860.html" title="上门狂婿">上门狂婿</a></h4>, <h4 class="UsJlw fs-16 text-muted"><a href="/page/detail1038.html" title="此情惟你独钟">此情惟你独钟</a></h4>]

可以看到我们获取了所有的h4标签元素。返回结果是列表类型，接下来我们可以通过遍历列表，依次获取每个元素里面的a标签。

h4s = soup.find_all(name='h4')
for h4 in h4s:
    name = h4.find_all(name='a')
    print(name)

[<a href="/page/detail1292.html" title="神宠又给我开挂了">神宠又给我开挂了</a>]
[<a href="/page/detail131077.html" title="穿越无限之旅">穿越无限之旅</a>]
[<a href="/page/detail825.html" title="巨甜！我在禁欲冷王的怀里恃宠而骄">巨甜！我在禁欲冷王的怀里恃宠而骄</a>]
[<a href="/page/detail65541.html" title="皇家的和尚">皇家的和尚</a>]
[<a href="/page/detail860.html" title="上门狂婿">上门狂婿</a>]
[<a href="/page/detail1038.html" title="此情惟你独钟">此情惟你独钟</a>]

如果要提取每个a标签的文本内容，可以使用string属性：

h4s = soup.find_all(name='h4')
for h4 in h4s:
    name = h4.find_all(name='a')[0]   //注意，find_all返回的是列表，我们提取数据是需要加上索引来获取第一个元素。
    print(name.string)

神宠又给我开挂了
穿越无限之旅
巨甜！我在禁欲冷王的怀里恃宠而骄
皇家的和尚
上门狂婿
此情惟你独钟

如果需要获取a标签的href属性该怎么办呢？可以调用attrs获取相关的属性。

h4s = soup.find_all(name='h4')
for h4 in h4s:
    name = h4.find_all(name='a')[0]	//注意，find_all返回的是列表，我们提取数据是需要加上索引来获取第一个元素。
    print(name.attrs['href'])

/page/detail65541.html
/page/detail1038.html
/page/detail860.html
/page/detail825.html
/page/detail1292.html
/page/detail131077.html

以上就是利用find_all 方法的name参数提取元素，除此之外，我们还可以根据节点的属性进行提取。

a = soup.find_all(attrs={'title': '皇家的和尚'})
print(a)

[<a href="/page/detail65541.html" title="皇家的和尚">皇家的和尚</a>]

可以看到返回了属性title值为皇家的和尚的a标签，返回类型也是列表类型。提取文本与属性和前面相同。

a = soup.find_all(attrs={'title': '皇家的和尚'})[0]//注意，find_all返回的是列表，我们提取数据是需要加上索引来获取第一个元素。
href = a.attrs['href']
name = a.string
print(href)
print(name)

//结果如下
/page/detail65541.html
皇家的和尚

find_all 中的string参数可以用来匹配节点的文本，可以传入字符串或者正则表达式对象，代码如下：

a = soup.find_all(string=re.compile('和尚'))
print(a)

['皇家的和尚', '冯小宝穿越了,竟然变成了名副其实的花和尚。别人穿越都是带着王霸之气,他却只想如何活下去！大唐高宗年间,那是一个多姿多彩的时代,既有威震天下的名臣武将,李靖,长..']

这里string传入的是正则表达式对象，返回结果为所有的与正则表达式匹配的节点的文本组成。

find

除了find_all方法以外，还有find方法也可以提取符合条件的元素，只不过使用find方法只能返回第一个匹配的元素，而find_all返回所有匹配的元素组成的列表。

h4 = soup.find(name='h4')
print(h4)

<h4 class="Okwmx fs-16 text-muted"><a href="/page/detail860.html" title="上门狂婿">上门狂婿</a></h4>

可以看到只返回了第一匹配到的节点名字等于h4的元素。

总结

至此，Beautiful Soup的学习告一段落了，在进行节点选择的时候推荐使用find_all与find方法提取元素。此外，Beautiful Soup还提供了一些其他方法来提取元素，区别在于查询范围不同。

方法名	介绍
find_parents 与 find_parent	前者返回所有的祖先节点，后者返回直接父亲节点
find_next_siblings 与 find_next_sibling	前者返回后面所有的兄弟节点，后者返回后面第一个兄弟节点
find_previous_sibling 与 find_previous_sibling	前者返回前面的所有兄弟节点，后者返回前面第一个兄弟节点
find_all_next 与 find_next	前者返回节点后面所有的符合条件的节点，后者返回后面第一个符合条件的节点
find_all_previous 与 find_previous	前者返回节点前面所有的符合条件的节点，后者返回前面第一个符合条件的节点