1、BeautifulSoup4模块简介:
- 本质:python的一个第三方库
- 作用:在获取到网页源代码的前提下,在HTML文件或者XML文件中提取数据。
- 安装指令:pip install BeautifulSoup4
- 安装说明:除了上面的指令安装之外,还可以用pycharm中的图形化安装界面安装
- 使用BeautifulSoup方法针对网页源代码进行文档解析,返回一个BeautifulSoup对象(本质:树结构),这个解析过程需要解析器。
2、示例代码:
html_str = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_str, 'html.parser')
p_list = soup.select('p')
print(p_list)
p_list2 = soup.select('html > body > p')
print(p_list2)
p_list3 = soup.select('html p')
print(p_list3)