python爬虫BeautifulSoup（二）

最新推荐文章于 2022-05-18 21:00:02 发布

悄悄不加糖

最新推荐文章于 2022-05-18 21:00:02 发布

阅读量182

点赞数 1

分类专栏： Python 文章标签：爬虫 python

本文链接：https://blog.csdn.net/weixin_42927719/article/details/98877176

版权

Python 专栏收录该内容

16 篇文章 2 订阅

订阅专栏

BeautifulSoup

Beautiful Soup: Python 的第三方插件用来提取 xml 和 HTML 中的数据，官网地址 https://www.crummy.com/software/BeautifulSoup/

案例

网站：
在这里插入图片描述
网页源代码：

# coding:utf-8
#导入BS库和requests库
from bs4 import BeautifulSoup
import requests

url = 'http://python123.io/ws/demo.html'
r = requests.get(url)
demo = r.text  # 服务器返回响应

soup = BeautifulSoup(demo, "html.parser")
"""
demo 表示被解析的html格式的内容
html.parser表示解析用的解析器
"""
print(soup)  # 输出响应的html对象
print(soup.prettify())  # 使用prettify()格式化显示输出

输出：

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
#格式化输出之后的内容如下
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

Process finished with exit code 0

实战

内容描述：

使用beautifulsoup提取丁香园论坛的回复内容。
丁香园直通点：http://www.dxy.cn/bbs/thread/626626#626626 。

所以评论的内容就在td class="postbody"标签下

代码：

from bs4 import BeautifulSoup
import requests

url='http://www.dxy.cn/bbs/thread/626626#626626'
r=requests.get(url)
crawl=r.text #服务器相应

soup=BeautifulSoup(crawl,"html.parser")

# 如果只想得到tag中包含的文本内容,那么可以嗲用 get_text() 方法,
# 这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回
userid = soup.find("div", class_="auth").get_text(strip=True)
print(userid)
commment=soup.find("td", class_="postbody").get_text(strip=True)
print(commment)

结果：
在这里插入图片描述
但可以看到只能抓到单条评论，所以对代码又进行了改进，原因是没有写入循环。
改进版代码：

from bs4 import BeautifulSoup
import requests

url='http://www.dxy.cn/bbs/thread/626626#626626'
r=requests.get(url)
crawl=r.text #服务器相应
html=BeautifulSoup(crawl,"html.parser")

# 如果只想得到tag中包含的文本内容,那么可以嗲用 get_text() 方法,
# 这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回
datas = []  # 用来存放获取的用户名和评论
for data in html.find_all("tbody"):
    try:
        userid = data.find("div", class_="auth").get_text(strip=True)
        print(userid)
        commment=data.find("td", class_="postbody").get_text(strip=True)
        print(commment)
        datas.append((userid, commment))
    except:
        pass
print(datas)

结果：
在这里插入图片描述

美味汤方法

悄悄不加糖

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python爬虫BeautifulSoup（二）

BeautifulSoupBeautiful Soup: Python 的第三方插件用来提取 xml 和 HTML 中的数据，官网地址 https://www.crummy.com/software/BeautifulSoup/案例网站：网页源代码：# coding:utf-8#导入BS库和requests库from bs4 import BeautifulSoupimport...
复制链接

扫一扫

专栏目录