2.1 学习beautifulsoup
- 学习beautifulsoup,并使用beautifulsoup提取内容。
- 使用beautifulsoup提取丁香园论坛的回复内容。
使用beautifulsoup提取丁香园论坛的回复内容
1, Beautiful Soup的简介
beautifulsoup官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment .
2. Beautiful Soup 安装
我使用的是anaconda python发行版本,已经包含bs4,所以无需安装
3. 创建 Beautiful Soup 对象
导入BeautifulSoup模块:
from bs4 import BeautifulSoup
创建 beautifulsoup 对象
html=BeautifulSoup(exanmple.html,‘lxml’)
4,获取获取所有包含用户名和评论内容的tbody
html.find_all("tbody")
5,分别获取用户名和评论内容
userid = data.find("div", class_="auth").get_text(strip=True)
content = data.find("td", class_="postbody").get_text(strip=True)
源码:
import urllib.request
from bs4 import BeautifulSoup as bs
def main():
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0"
}
url = 'http://www.dxy.cn/bbs/thread/626626'
request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request).read().decode("utf-8")
html = bs(response, 'lxml')
getItem(html)
def getItem(html):
datas = [] # 用来存放获取的用户名和评论
for data in html.find_all("tbody"):
try:
userid = data.find("div", class_="auth").get_text(strip=True)
print(userid)
content = data.find("td", class_="postbody").get_text(strip=True)
print(content)
datas.append((userid,content))
except:
pass
print(datas)
if __name__ == '__main__':
main()
输出结果:
参考:
Python中使用Beautiful Soup库的超详细教程_python_脚本之家 https://www.jb51.net/article/65287.htm
python爬虫初步之BeautifulSoup实战 - wwq114的博客 - CSDN博客 https://blog.csdn.net/wwq114/article/details/88085875
Beautiful Soup 4.2.0 documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html