bs4获取href_Python实战：爬虫解析框架BS4（7）

最新推荐文章于 2023-05-09 21:40:45 发布

weixin_39957934

最新推荐文章于 2023-05-09 21:40:45 发布

阅读量1.7k

点赞数

文章标签： bs4获取href

Beautiful Soup 4.4.0 文档

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

这篇文档介绍了BeautifulSoup4中所有主要特性,并且有小例子.让我来向你展示它适合做什么,如何工作,怎样使用,如何达到你想要的效果,和处理异常情况.

文档中出现的例子在Python2.7和Python3.2中的执行结果相同

你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,我们推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4

数据源

访问获取网页源码

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用BeautifulSoup解析这段代码

能够得到一个BeautifulSoup的对象,并能按照标准的缩进格式的结构输出:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

参数：

1.html_doc 解析的数据源

2.parser 解析器，可以通过多种解析器处理不同数据

html、xml

解析器

下表列出了主要的解析器,以及它们的优缺点:

解析器使用方法优势劣势

Python标准库BeautifulSoup(markup, "html.parser")

Python的内置标准库
执行速度适中
文档容错能力强
Python 2.7.3 or 3.2.2)前的版本中文档容错能力差

lxml HTML 解析器BeautifulSoup(markup, "lxml")

速度快
文档容错能力强
需要安装C语言库

lxml XML 解析器BeautifulSoup(markup, ["lxml-xml"])BeautifulSoup(markup, "xml")

速度快
唯一支持XML的解析器
需要安装C语言库

html5libBeautifulSoup(markup, "html5lib")

最好的容错性
以浏览器的方式解析文档
生成HTML5格式的文档
速度慢
不依赖外部扩展

对象的种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为

4种: Tag , NavigableString , BeautifulSoup , Comment .

html本身标签，被Beautiful Soup转换

html- BeautifulSoup类型

tag- Tag类型

# 1.解析数据源
soup = BeautifulSoup(html_doc,'html5lib')
print(type(soup))

Name

每个tag都有自己的名字,通过 .name 来获取,只能获取第一个出现的满足名称的标签

# 2.Tag.name 获取到每一个名称对于的标签
title = soup.title
print(type(title),title)

p = soup.p
print(type(p),p)

运行

runfile('/Users/lpf/Desktop/安康学院pyhton实训/python实训/第四天/bs4框架教程.py')
<class 'bs4.BeautifulSoup'>
<class 'bs4.element.Tag'> <title>The Dormouse's story</title>

runfile('/Users/lpf/Desktop/安康学院pyhton实训/python实训/第四天/bs4框架教程.py')
<class 'bs4.BeautifulSoup'>
<class 'bs4.element.Tag'> <title>The Dormouse's story</title>
<class 'bs4.element.Tag'> <p class="title"><b>The Dormouse's story</b></p>

Attributes

一个tag可能有很多个属性. tag <b class="boldest"> 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同:

# 3.Tag.attrs 获取属性，字典的访问方式一样
a = soup.a.attrs
print(a,a['href'],a.get('id'))

运行

{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'} http://example.com/elsie link1

注意

attrs：字典结构 name=value满足key:value
id唯一: 一个值
class课重复：列表

.string和.text

获取标签的子节点可以通过.string和.text获取

.string 如果标签下方只存在一个子节点就获取，如果存在多个则无法判断获取哪一个，返回None
.text 可以获取标签下方所有的子节点的字符串

# 4.去标签化，只获取子节点的文本
body = soup.body.string
print(body)

p = soup.p.string
print(p)

body = soup.body.text
print(body)

运行

None
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

高级查询 find_all() find()

find_all( name , attrs , recursive , string , **kwargs )

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件

return list[tag,tag,........]

find() 方法搜索当前tag的第一个出现tag子节点,并判断是否符合过滤器的条件

return tag

title = soup.find_all(name='title') # 单标签类型查询
print(title)

tag_list = soup.find_all(name=['title','a']) # 多标签类型查询
print(tag_list)

a_list = soup.find_all(name='a',attrs={'id':'link3'}) # 单标签类型查询,属性查询
print(a_list)

tag_list = soup.find_all(name=['title','a'],attrs={'class':'sister'}) # 多标签类型查询,属性查询
print(tag_list)

a = soup.find(name='a',attrs={'id':'link3'}) # 单标签类型查询,属性查询
print(a)

tag = soup.find(name=['title','a'],attrs={'class':'sister'}) # 多标签类型查询,属性查询
print(tag)

运行

runfile('/Users/lpf/Desktop/安康学院pyhton实训/python实训/第四天/bs4框架教程.py')
<class 'bs4.BeautifulSoup'>
<class 'bs4.element.Tag'> <title>The Dormouse's story</title>
<class 'bs4.element.Tag'> <p class="title"><b>The Dormouse's story</b></p>
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'} http://example.com/elsie link1
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'} http://example.com/elsie link1
None
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...

[<title>The Dormouse's story</title>]
[<title>The Dormouse's story</title>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

CSS选择器

Beautiful Soup支持大部分的CSS选择器可以借鉴下方语法大全

Selectorswww.w3.org

在 Tag 或 BeautifulSoup 对象的 .select().select_on() 方法中传入字符串参数, 即可使用CSS选择器的语法找到tag:

标签名称选择 tagname
class类名选择 .classname
id选择 #id

# 6.css选择

title_list = soup.select('title')
print('select',title_list)

title = soup.select_one('title')
print('select_one',title)


sister_list = soup.select('.sister')
print('select',sister_list)

sister = soup.select_one('.sister')
print('select_one',sister)


link1_list = soup.select('#link1')
print('select',link1_list)

link1 = soup.select_one('#link1')
print('select_one',link1)

运行

select [<title>The Dormouse's story</title>]
select_one <title>The Dormouse's story</title>
select [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
select_one <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
select [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
select_one <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

作者：Beautiful Soup官方文档内容借鉴
原出处：Beautiful Soup官方文档内容借鉴
原文链接：部分内容借鉴，大部分原创

https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id14beautifulsoup.readthedocs.io