目录
1.Beautiful Soup简介
前面学习通过正则表达式提取网页信息时,如果正则表达式出现错误则无法正确提取我们所需要的结果。由于网页有一定的特殊和层级关系,利用强大的解析工具——Beautiful Soup能够借助网页的结构和属性等特性来解析网页,相比于正则表达式,它可以利用更简单的语句提取网页内容。
简单来说,Beautiful Soup是Python的一个HTML或XML的解析库,我们用它可以方便地从网页中提取数据,其官方解释如下:
2.解析器
通过对比不同解析器可以看出,LXML解析器有解析HTML和XML的功能,而且速度快,容错能力强,推荐使用。在使用LXML解析器时,只需要在初始化Beautiful Soup时,将第二个参数修改为lxml即可。
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>hello</p>','lxml')
print(soup.p.string)
运行结果:
hello
3.安装Beautiful Soup
在使用之前 确保已经正确安装好Beautiful Soup和lxml两个库。在cmd里直接pip安装即可,命令如下:
pip install beautifulsoup4
pip install lxml
4.基本使用
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify()) #自动补全代码 容错处理
print(soup.title.string) #返回title的内容
运行结果:
<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title" name="dromouse"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --> </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html> The Dormouse's story
首先声明变量html字符串,但是需要注意的是这并不是一个完整的html字符串。接着将它作为第一个参数传给BeautifulSoup对象,第二个参数为解析器的类型(设置为lxml),此时完成BeautifulSoup对象的初始化,接着将这个对象赋值给soup变量。之后,就可以调用soup的各个方法和属性解析这串html代码了。
①调用prettify方法。对不标准的html字符串自动更正格式。
②调用soup.title.string。输出HTML中title节点的文本内容。
5.节点选择器
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.title) #输出title节点的选择结果
print(type(soup.title)) #输出title节点的类型
print(soup.title.string) #输出title节点里面的文字内容
print(soup.head) #输出head节点
print(soup.p) #输出第一个p标签的内容
运行结果:
<title>The Dormouse's story</title> <class 'bs4.element.Tag'> The Dormouse's story <head><title>The Dormouse's story</title></head> <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
【注】bs4.element.Tag是BeautifulSoup中一个重要的数据结构,经过选择器选择的结果都是这种Tag类型。
6.提取信息
#下面皆由这段html文本为例:
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(h