BeautifulSoup解析html网页（Python3--爬虫）

最新推荐文章于 2024-08-30 10:01:19 发布

图图0-0

最新推荐文章于 2024-08-30 10:01:19 发布

阅读量2.4k

点赞数

本文链接：https://blog.csdn.net/weixin_40116618/article/details/80363367

版权

网页解析器：从网页中提取有用的信息

beautifulsoup支持如下各种解析器：

简单来说，Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。官方解释如下：

----------------------------------------------------------------------------------------------------------------------------------

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

----------------------------------------------------------------------------------------------------------------------------------

beautifulsoup 的使用流程如下：

其中：节点可以按照名称，属性，文字进行搜索

语法：find_all（名字，属性，文字）

举例说明：
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

a:名称 href:属性 Elsie：文字
# Author:zhang ying
import bs4
from bs4 import BeautifulSoup
import html.parser

html_doc = """
<html>
<head>
 <title>The Dormouse's story</title>
</head>
<body>

 The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

..."""

#根据下载好的HTML字符串创建beautifulsoup对象，加载成为DOM形式
soup=BeautifulSoup(
 html_doc, #文档字符串
 'html.parser', #解析器
 from_encoding='utf-8') #文档编码

#搜索节点：find_all,find
#soup.find(a)=soup.a输出结果相同，输出第一个 a 标签 
print(soup.find('a')) 
print(soup.a)
结果如下：
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
#搜索第一个 p 标签
print(soup.p) 
print(soup.find(p))
结果如下：
The Dormouse's story
The Dormouse's story
#输出全部的a 标签,结果是列表形式,并遍历出节点的名称，属性，文字
links=soup.find_all('a')
print(links)
for link in links:
    print(link)
    print(link.name,link['href'],link.get_text()
结果如下
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
a http://example.com/elsie Elsie
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
a http://example.com/lacie Lacie
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
a http://example.com/tillie Tilli
#查找标签为a,属性为lacie的节点
print(find_all('a',href='http://example.com/lacie'))
结果如下：
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]