爬取某学校官网通知

最新推荐文章于 2024-04-24 21:19:10 发布

ScrapingBoy

最新推荐文章于 2024-04-24 21:19:10 发布

阅读量1.1k

点赞数 4

分类专栏：爬虫

本文链接：https://blog.csdn.net/y_h_k_666/article/details/107431376

版权

爬虫专栏收录该内容

3 篇文章

订阅专栏

爬取某学校官网通知

一、爬取网页所有通知

需求：有时候对于学生党来说，每次大考之后，查分数都是特别心急，特别是对于学校官网的通知等，本笔记主要关注这一点，以下是实现内容。

# 使用 urllib 和 BeautifulSoup 库实现
import urllib.request
import urllib.parse

from bs4 import BeautifulSoup

# 一、获取网址中的HTML源代码存储为python list对象
requst = urllib.request.Request('http://sjxy.whpu.edu.cn/index/tzgg.htm')
# 二、因网站设置有反爬虫，需要添加请求头
requst.add_header('User-Agent' , 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)') # 添加请求头，模仿人使用浏览器访问页面
response = urllib.request.urlopen(requst)
html = response.read()

# 三、使用 BeauSoup() 对象实现对 HTML 页面的解析 ，使用 python 自带的解析方式 ‘html.parser’
bs = BeautifulSoup(html, 'html.parser')
# 四、定位 通知信息所在的 table 标签，使用find_all() 方法，class 类选择器查找
tables = bs.find_all('table', class_= "in_list2")  # {'class' : 'in_list2'}  两种写法
# 五、查找 table 表格下的行标签 tr 获取 list 对象
tab = tables[0].find_all('tr')
# print(tab)
# 六、 遍历 tr 得到 本页面的所有通知
print('--------------------------------')
for tr in tab:
    for td in tr.find_all('td') :
        print('通知时间：' , td.find_all('p')[1].get_text())
        print('通知标题：' , td.p.get_text()[2:])
        print('链接：' , 'http://sjxy.whpu.edu.cn/' + td.a['href'][2:])
        print('-----------分割线----------------')