python小小爬虫（一）—— 爬取学校官网通知（仅作为练习使用）

最新推荐文章于 2024-05-30 10:08:42 发布

薛定猫

最新推荐文章于 2024-05-30 10:08:42 发布

阅读量2.5k

点赞数 3

分类专栏：爬虫 Pyhton 文章标签： python html 爬虫 request bs4

本文链接：https://blog.csdn.net/weixin_44456692/article/details/115449857

版权

Pyhton 同时被 2 个专栏收录

17 篇文章

订阅专栏

爬虫

2 篇文章

订阅专栏

无聊拿来玩儿，仅作为练习用。

文章目录

分析
代码
效果

分析

找到文本的所处的div即可，注意应该是文本（一般是<span>text</span>这种形式）的上一级div,即div下面应该就是文本所在的span,找到文本所在的tag之后，使用.get_text()获取其文本信息。

代码

import requests  
from bs4 import BeautifulSoup  
  
# 获取html文档  
def get_html(url):  
    """get the content of the url"""  
    response = requests.get(url)  
    response.encoding = 'utf-8'  #中文乱码的话，可以试一下gb2312
    return response.text  
      
# 获取内容
def get_certain_web(html):  
    """get the content of the html"""  
    global soup  #方便调试
    soup = BeautifulSoup(html, 'lxml')  #使用lxml解析器对html进行解析，生成soup结构化文件
    web_content_temp = soup.select('div.index-tab-notice-right-list-title')#根据所需内容进行筛选
    web_content = ''
    for i in web_content_temp:
        web_content += i.get_text() + '\n'
    #web_content = soup.find('div',{'class':'zzj_5b_2d'})[0].get_text()    也可以使用这个
    return web_content

url_web = "http://www.zzu.edu.cn/" 
html = get_html(url_web)  
web_content = get_certain_web(html)  
print(web_content)