BeautifulSoup
Beautiful Soup: Python 的第三方插件用来提取 xml 和 HTML 中的数据,官网地址 https://www.crummy.com/software/BeautifulSoup/
案例
网站:
网页源代码:
# coding:utf-8
#导入BS库和requests库
from bs4 import BeautifulSoup
import requests
url = 'http://python123.io/ws/demo.html'
r = requests.get(url)
demo = r.text # 服务器返回响应
soup = BeautifulSoup(demo, "html.parser")
"""
demo 表示被解析的html格式的内容
html.parser表示解析用的解析器
"""
print(soup) # 输出响应的html对象
print(soup.prettify()) # 使用prettify()格式化显示输出
输出:
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
#格式化输出之后的内容如下
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
Process finished with exit code 0
实战
内容描述:
使用beautifulsoup提取丁香园论坛的回复内容。
丁香园直通点:http://www.dxy.cn/bbs/thread/626626#626626 。
所以评论的内容就在td class="postbody"标签下
代码:
from bs4 import BeautifulSoup
import requests
url='http://www.dxy.cn/bbs/thread/626626#626626'
r=requests.get(url)
crawl=r.text #服务器相应
soup=BeautifulSoup(crawl,"html.parser")
# 如果只想得到tag中包含的文本内容,那么可以嗲用 get_text() 方法,
# 这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回
userid = soup.find("div", class_="auth").get_text(strip=True)
print(userid)
commment=soup.find("td", class_="postbody").get_text(strip=True)
print(commment)
结果:
但可以看到只能抓到单条评论,所以对代码又进行了改进,原因是没有写入循环。
改进版代码:
from bs4 import BeautifulSoup
import requests
url='http://www.dxy.cn/bbs/thread/626626#626626'
r=requests.get(url)
crawl=r.text #服务器相应
html=BeautifulSoup(crawl,"html.parser")
# 如果只想得到tag中包含的文本内容,那么可以嗲用 get_text() 方法,
# 这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回
datas = [] # 用来存放获取的用户名和评论
for data in html.find_all("tbody"):
try:
userid = data.find("div", class_="auth").get_text(strip=True)
print(userid)
commment=data.find("td", class_="postbody").get_text(strip=True)
print(commment)
datas.append((userid, commment))
except:
pass
print(datas)
结果: