爬取广东工业大学官网并将内容发送到 QQ 邮箱

待爬取网站:

在这里插入图片描述

此爬虫程序大致分为以下步骤:

1. 获取官网页面
2. 提取各新闻的链接
3. 提取各板块的新闻标题
4. 发送到 QQ 邮箱

1. 首先使用 Requests 库获取官网页面:
import requests

def get_html(url):
    print("正在获取页面……")
    headers = {
        'Cookie': "UM_distinctid=17101abc69635b-0e556116b0f673-f313f6d-144000-17101abc6973c8; JSESSIONID=3178C10CD6DE2F5EA6033F90566F562C; wzws_cid=7a15963ee9210949b0d09b2f2889a0907ed8418df0e1e8b8122cd34a54d6be425da4ae3433c5ca7b3146755fc4cfcc31069f2f47f9468431388ba3ddfcac6c9f875fc30f80771a437b1ce7a07185b1d9",
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
    }
    try:
        r = requests.get(url, headers=headers)
        r.encoding = r.apparent_encoding
        if r.status_code == 200:
            print("获取页面成功!")
    except Exception as e:
        print("获取页面失败,原因是:%s" % e)

    return r.text

2. 提取各新闻的链接

此处利用 XPath Helper 和 Chrome 的开发者工具抓包:

XPath Helper 安装教程:https://blog.csdn.net/weixin_45961774/article/details/104534166

from lxml import etree

def get_url():
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
    }
    url = "http://old.gdut.edu.cn/"
    r = requests.get(url, headers=headers)
    html = etree.HTML(r.text)

    link1 = html.xpath(
        "/html/body/div[@class='box-c3']/div[@class='news-s']/div[@class='news-zv']/ul/li/a/@href")
    link2 = html.xpath(
        "/html/body/div[@class='box-c']/div[@class='news-s']/div[@class='news-zv']/ul/li/a/@href")
    link3 = html.xpath(
        "/html/body/div[@class='box-c3']/div[@class='news-g']/div[@class='news-zv2']/ul/li/a/@href")
    link4 = html.xpath(
        "/html/body/div[@class='box-c']/div[@class='news-g']/div[@class='news-zv2']/ul/li/a/@href")
    link5 = html.xpath(
        "/html/body/div[@class='box-c']/div[@class='news-x']/div[@class='news-zv3']/ul/li/a/@href")

    return link1, link2, link3, link4, link5

3. 提取各板块的新闻标题
def parse_html(html):
    print("正在解析页面……")

    html = etree.HTML(html)

    gdut_news = html.xpath(
        "/html/body/div[@class='box-c3']/div[@class='news-s']/div[@class='news-zv']/ul/li/a/@title")
    gdut_media = html.xpath("/html//div[6]/div[1]/div[2]/ul/li/a/@title")
    bwcx_fdsz = html.xpath("/html//div[5]/div[2]/div[2]/ul/li/a/@title")
    Academic_Notice = html.xpath(
        "/html/body/div[6]/div[2]/div[2]/ul/li/a/@title")
    stu_work = html.xpath("/html/body/div[6]/div[3]/div[2]/ul/li/a/@title")

    print("解析页面成功!")

    for i in range(5):
        gdut_news[i] = gdut_news[i] + " 详情点击:" + get_url()[0][i]
        gdut_media[i] = gdut_media[i] + " 详情点击:" + get_url()[1][i]
        bwcx_fdsz[i] = bwcx_fdsz[i] + " 详情点击:" + get_url()[2][i]
        Academic_Notice[i] = Academic_Notice[i] + " 详情点击:" + get_url()[3][i]
        stu_work[i] = stu_work[i] + " 详情点击:" + get_url()[4][i]


    all_news = '\n'.join(gdut_news) + '\n' + '\n'.join(
        gdut_media) + '\n' + '\n'.join(bwcx_fdsz) + '\n' + '\n'.join(
        Academic_Notice) + '\n' + '\n'.join(stu_work)

    return all_news
4. 发送到 QQ 邮箱
import smtplib
from email.mime.text import MIMEText
from email.header import Header

def sent_email(mail_body):
    sender = '发送人邮箱'
    receiver = '收信人邮箱'
    smtpServer = 'smtp.qq.com'  # 简单邮件传输协议服务器(这里是QQ邮箱的)
    username = '用户名'
    password = 'smtp授权码'
    mail_title = '【广东工业大学官网通知】'
    mail_body = mail_body

    message = MIMEText(mail_body, 'plain', 'utf-8')
    message["Accept-Language"] = "zh-CN"
    message["Accept-Charset"] = "ISO-8859-1,utf-8"
    message['From'] = sender
    message['To'] = receiver
    message['Subject'] = Header(mail_title, 'utf-8')

    try:
        smtp = smtplib.SMTP()
        smtp.connect(smtpServer)
        smtp.login(username, password)
        smtp.sendmail(sender, receiver, message.as_string())
        print('邮件发送成功!')
        smtp.quit()
    except smtplib.SMTPException:
        print("邮件发送失败!")

PS:sender 是发送方,receiver 是接收方,username 是邮箱账号,但 password 不是邮箱密码,而是 smtp 授权码。

如何获取 smtp 授权码:https://blog.csdn.net/weixin_45961774/article/details/105040536


主函数:

if __name__ == '__main__':
    url = 'http://old.gdut.edu.cn/'
    html = get_html(url)
    sent_email(mail_body=parse_html(html))

最后我们运行程序:

然后查看邮箱:

在这里插入图片描述

成功收到邮箱!


完整源代码:https://github.com/Giyn/PythonScraper/blob/master/GDUT/old_official_website.py

  • 9
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值