BeautifulSoup库学习
1. BeautifulSoup的安装
BeautifulSoup库一般用来解析网页。
安装方法:
打开cmd控制台,输入
pip install beautifulsoup4
BeautifulSoup库的安装测试
打开pycharm(我使用的python编译软件,使用其他的都可以),输入命令:
from bs4 import BeautifulSoup
不报错即可。
2. 获取网页源代码
演示HTML页面地址:http://python123.io/ws/demo.html
网页截图:
右键屏幕,点击“检查页面源代码”,可查看网页源代码。
网页源代码(HTML5.0格式代码)截图:
获取网页源代码的方法有两种:
- 进入网页后,点击“检查页面源代码”,手动复制
- 使用Requests库进行爬取
下面为使用Requests库进行网页源码爬取的代码:
import requests
url = 'https://python123.io/ws/demo.html'
r = requests.get(url)
demo = r.text
print(demo)
运行后,得到源码(没有缩进):
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>
3. 简单使用BeautifulSoup库
使用BeautifulSoup只需要两行代码:
from bs4 import Beautiful
soup = BeautifulSoup('<p>data</p>', 'html.parser')
例:
依旧使用上面的demo的源码,然后使用BeautifulSoup库进行格式化的输出。
from bs4 import BeautifulSoup
import requests
url = 'https://python123.io/ws/demo.html'
r = requests.get(url)
demo = r.text
soup = BeautifulSoup(demo, 'html.parser') # 对demo进行HTML的解析
print(soup.prettify()) # 格式化打印HTML源码
运行结果(有了缩进):
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>