BeautifulSoup
1、BeautifulSoup介绍
安装依赖:pip install bs4、pip Install requests
Beautiful Soup 是一个用于解析 HTML 和 XML 文档的 Python 库。它的主要功能是从网页中提取数据,特别是在网络爬虫和数据挖掘等任务中很有用。Beautiful Soup 可以帮助你轻松地遍历文档树、搜索特定的标签或文本,并提供了方便的方法来处理这些数据
2、基础篇
1.1、页面抓取方式
① 从网页抓取
from bs4 import BeautifulSoup
import requests
html = requests.get('https://python123.io/ws/demo.html')
soup = BeautifulSoup(html.text, 'html.parser')
print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
② 从字符串抓取
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup('<p>this is a p tag</p>', 'html.parser')
print(soup.prettify())
<p>
this is a p tag
</p>
③ 从文件抓取
from bs4 import BeautifulSoup
import requests
with open('test.html', 'r') as f:
soup = BeautifulSoup(f, 'html.parser')
print(soup.prettify())
<!-- test.html -->
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>
Title
</title>
</head>
<body>
<p class="title">
<b>
this is b
</b>
</p>
</body>
</html>
1.2、Tag对象
获取文本节点的方法 | 描述 |
---|---|
string | 如果标签只包含文本节点,没有直接子节点(直接子标签),则可以通过标签的 string 属性访问标签的文本节点;如果包含了子标签,那么访问标签的 string 属性将会得到 None |
strings | 获取标签自身以及子孙标签的文本,返回的是生成器generator类型 |
stripped_strings | 同strings |
get_text() | 获取标签自身以及子孙标签的文本,合并成一个大字符串再返回;可以传入参数作为每个标签文本的分割符 |
Tip:标签中的文本实际上是一个节点,跟直接子节点(直接子标签)是同级同类型的
from bs4 import BeautifulSoup
html_content = "<p>string test</p><div>strings and stripped_strings test<span>strings</span>stripped_strings<span></span></div>"
soup = BeautifulSoup(html_content, "html.parser")
p_tag = soup.p
div_tag = soup.div
print("属性:", type(p_tag), '\n')
print("string:", p_tag.string, '\n')
print("strings:")
for string in div_tag.strings:
print(string)
print("\nstripped_strings:")
for string in div_tag.stripped_strings:
print(string)
属性: <class 'bs4.element.Tag'>
string: string test
strings:
strings and stripped_strings test
strings
stripped_strings
stripped_strings:
strings and stripped_strings test
strings
stripped_strings
from bs4 import BeautifulSoup
html_content = "<p>string test</p><div>strings and stripped_strings test<span>strings</span>stripped_strings<span></span></div>"
soup = BeautifulSoup(html_content, "html.parser")
print(soup.get_text())
print(soup.get_text('|', strip=True))
string teststrings and stripped_strings teststringsstripped_strings
string test|strings and stripped_strings test|strings|stripped_strings
Tag对象的属性 | 描述 |
---|---|
name | 标签的名字,类型:str |
attrs | 标签的属性,类型:dict |
NavigableString | 文本部分,类型:NavigableString |
Comment | 注释部分,类型:Comment |
from bs4 import BeautifulSoup
import requests
page = requests.get("https://python123.io/ws/demo.html")
soup = BeautifulSoup(page.text, "html.parser")
soup1 = BeautifulSoup("<b><!--this is comment--></b><p>this is not comment</p>", "html.parser")
print(f"标签对象{
type(soup.a)}:{
soup.a}")
print(f"标签名{
type(soup.a.name)}:{
soup.a.name}")
print(f"标签属性{
type(soup.a.attrs)}:{
soup.a.attrs}")
print(f"访问标签的某个属性:{
soup.a.attrs['class']}")
print(f"标签文本{
type(soup.a.string)}:{
soup.a.string}")
print(f"注释{
type(soup1.b.string)}:{
soup1.b.string}")
标签对象<class 'bs4.element.Tag