最近因为经常要爬取网站数据,需要频繁用到BeautifulSoup,但自己现在掌握的并不是特别熟练,就在这里梳理下BeautifulSoup的各项用法,以供以后参考。本文的测试数据来自BeautifulSoup的官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
1.BeautifulSoup基本用法
1.1 BeautifulSoup介绍
BeautifulSoup是一个可以从HTML或XML页面中从提取数据的Python第三方库。它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.
构建一个 BeautifulSoup 对象需要两个参数,第一个参数是将要解析的 HTML 文本字符串,第二个参数告诉 BeautifulSoup 使用哪个解析器来解析 HTML(如Python自带的html.parser、第三方解析器lxml和html5lib)。
BeautifulSoup对象构建如下所示:
soup = BeautifulSoup(html_doc,’lxml’)
1.2格式化输出HTML文档
代码如下所示:
# -*- coding: utf-8 -*-
"""
Created on Thu May 4 13:56:00 2017
@author: zch
"""
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
print(soup.prettify())
格式化输出结果如下所示:
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
1.3 浏览结构化数据的几种方法
(1)获取HTML文档title各项属性
(2)获取HTML超链接(a)的各项属性
(3)获取HTML段落(p)的各项属性
(4)通过find方法查找HTML中的匹配项
2.BeautifulSoup实例测试
代码如下所示:
# -*- coding: utf-8 -*-
"""
Created on Thu May 4 15:11:23 2017
@author: zch
"""
from bs4 import BeautifulSoup
import re
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
print('测试1:获取所有的链接')
links = soup.find_all('a')
for link in links:
print(link.name,link['href'],link.get_text())
print('测试2:通过正则匹配获取链接')
link_node = soup.find('a',href=re.compile(r"cie"))
print(link_node.name,link_node['href'],link_node.get_text())
print('测试3:获取故事正文')
p_text = soup.find('p',class_='story')
print(p_text.name,p_text.get_text())
#print(soup.p.get_text())
测试结果如下图所示: