Beautiful Soup库是解析、遍历、维护 HTML"标签树"的功能库。
Ubuntu环境下库的安装
sudo apt-get install python3-bs4
HTML 标签结构
<p>..</p> #标签Tag,成对出现。
<p class='title'>...</p>
↑属性Attributes定义标签特点
Beautiful Soup 库
BeautifulSoup 类 对应一个HTML文档的全部内容。
简单使用
1 import requests
2 from bs4 import BeautifulSoup
3 url = 'http://python123.io/ws/demo.html'
4 r = requests.get(url)
5 demo = r.text
6 soup = BeautifulSoup(demo,'html.parser')
7 print(soup.prettify)
返回结果:
<bound method Tag.prettify of <html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>>
格式化显示
为HTML提供换行服务。
print(soup.prettify())
输出:
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
类的基本元素
标签树爬取
提取HTML页面信息的重要手段。
- 三种爬取方向
- 代码
import requests
from bs4 import BeautifulSoup
#1.requests
url = 'http://python123.io/ws/demo.html'
r = requests.get(url)
demo = r.text
#2.bs4
soup = BeautifulSoup(demo,'html.parser')
#平行遍历:后续、前续
#for sibling in soup.a.next_siblings:
#for sibling in soup.a.previous_siblings:
#上行遍历
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
返回结果:
p
body
html
[document]