【Python爬虫】Beautiful Soup 文档

最新推荐文章于 2024-03-23 22:06:10 发布

「已注销」

最新推荐文章于 2024-03-23 22:06:10 发布

阅读量517

点赞数 1

文章标签： python

本文链接：https://blog.csdn.net/weixin_43923790/article/details/106072391

版权

这篇文档介绍了Python爬虫库Beautiful Soup的使用，包括快速开始、安装、如何使用BeautifulSoup解析HTML，详细讲解了对象的种类如Tag，以及搜索文档树的方法如find_all()和find()，还提到了CSS选择器的应用。

摘要由CSDN通过智能技术生成

【Python爬虫】Beautiful Soup 文档

文章目录

- - 【Python爬虫】Beautiful Soup 文档

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库。

快速开始

手动定义一个网页文件，供后续使用。

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用BeautifulSoup解析这段代码：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

还有以下使用方法：

print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.title.p)
print(soup.title.a)
print(soup.find_all('a'))
print(soup.find(id='link3'))

从文档中找到所有标签的链接：

for link in soup.find_all('a'):
    print(link.get('href'))

从文档中获取所有文字内容：

print(soup.get_text())

安装 Beautiful Soup

首先安装 Beautiful Soup 库

pip install beautifulsoup4

安装解析器

Beautiful Soup支持 Python 标准库中的 HTML 解析器，还支持一些第三方的解析器，其中一个是 lxml：

pip install lxml

另一个可供选择的解析器是纯 Python 实现的 html5lib，html5lib 的解析方式与浏览器相同：

pip install html5lib

主要的解析器，以及它们的优缺点：

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	Python的内置标准库、执行速度适中、文档容错能力强
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快、文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, [“lxml-xml”])、BeautifulSoup(markup, “xml”)	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

如何使用

将一段文档传入 BeautifulSoup 的构造方法，就能得到一个文档的对象,，可以传入一段字符串或一个文件句柄。

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")

首先,文档被转换成 Unicode，并且 HTML 的实例都被转换成 Unicode 编码。

然后，Beautiful Soup 选择最合适的解析器来解析这段文档，如果手动指定解析器那么 Beautiful Soup 会选择指定的解析器来解析文档。

对象的种类

Beautiful Soup 将复杂 HTML 文档转换成一个复杂的树形结构，每个节点都是 Python 对象，所有对象可以归纳为4种：Tag , NavigableString，BeautifulSoup ，Comment 。

Tag

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
print(tag.name)
print(tag['class'])

搜索文档树

字符串

soup.find_all('b')

正则表达式

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
for tag in soup.find_all(re.compile("t")):
    print(tag.name)

列表

soup.find_all(["a", "b"])

True

可以匹配任何值，下面代码查找到所有的 tag，但是不会返回字符串节点：

for tag in soup.find_all(True):
    print(tag.name)

find_all()

soup.find_all("title")
soup.find_all("p", "title")
soup.find_all("a")
soup.find_all(id="link2")
soup.find_all("a", class_="sister")
import re
soup.find(string=re.compile("sisters"))

find()

下面两行代码是等价的：

soup.find_all('title', limit=1)
soup.find('title')

soup.head.title
soup.find("head").find("title")

CSS选择器

在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数, 即可使用 CSS 选择器的语法找到 tag ：

soup.select("title")
soup.select("p:nth-of-type(3)")
soup.select("body a")
soup.select("html head title")
soup.select("head > title")
soup.select("p > a")
soup.select("p > a:nth-of-type(2)")
soup.select("p > #link1")
soup.select("body > a")