Scrapy爬虫实战-爬取西北农林科技大学机电学院新闻

最新推荐文章于 2023-03-02 17:02:00 发布

real_Luyouqi

最新推荐文章于 2023-03-02 17:02:00 发布

阅读量543

点赞数 3

文章标签： Python 爬虫 Scrapy

本文链接：https://blog.csdn.net/real_Luyouqi/article/details/88958498

版权

Scrapy爬虫实战-爬取西北农林科技大学机电学院新闻

需求

Python
scrapy框架
pywin32

cmd中执行如下命令:
pip3 install scrapy

1.创建一个scrapy项目

我们在需要创建爬虫项目的目录下进入终端，并执行如下代码

scrapy startproject University

这时我们可以看到文件目录多出来一个“University”文件夹，这就是scrapy建立的工程，打开这个文件夹我们可以看到scrapy为我们建立的爬虫体系

graph LR
    A[University]-->B[University]
    A-->C[scrapy.cfg]
    B-->D[__pycache__]
    B-->E[spiders]
    B-->F[__init__.py]
    B-->G[items.py]
    B-->H[middlewares.py]
    B-->I[pipelines.py]
    B-->J[settings.py]
    E-->K[__init__.py]
    E-->L[__pycache__]

CSDN不支持Markdown流程图

2.将cmd切换到University项目根目录中,着手items.py文件编写

打开items.py文件,我们可以看到scrapy为我们建立的初始代码

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class UniversityItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

该段代码建立了爬取内容的初始定义，我们根据爬取内容修改爬取字段

以西北农林科技大学机电学院的学院通知为例

import scrapy

class UniversityItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    newstitle = scrapy.Field() # 新闻标题
    data = scrapy.Field() # 发布时间

3. 编写spider文件

在Uinversity根目录使用终端建立爬虫类

scrapy genspider UniversityNews "nwsuaf.edu.cn"

该命令中UniversityNews是你创建的爬虫类名字,也是spider目录下新建的py文件的名称,后面的网址是爬虫的作用范围,通常我们选择你要爬取数据的网站的根节点,打开这个UniversityNews.py文件,我们可以看见它的初始代码

# -*- coding: utf-8 -*-
import scrapy


class UniversitynewsSpider(scrapy.Spider):
    name = 'UniversityNews'
    allowed_domains = ['nwsuaf.edu.cn']
    start_urls = ['http://nwsuaf.edu.cn/']

    def parse(self, response):
        pass

接下来我们编写这个爬虫

# -*- coding: utf-8 -*-
import scrapy
from University.items import UniversityItem

class UniversitynewsSpider(scrapy.Spider):
    name = 'UniversityNews'
    allowed_domains = ['nwsuaf.edu.cn']
    # start_urls = ['http://nwsuaf.edu.cn/']
    url = "https://cmee.nwsuaf.edu.cn/xwzx/xytz/index"
    offset = 1 # 从0开始,最大到127还有数据,这边先暂时从1开始
    start_urls = [url + str(offset) +".htm"] # 起始url

    def parse(self, response):
        for each in response.xpath("/html/body/div[3]/div[3]/div[1]/ul/li"):
            item = UniversityItem() # 初始化模型对象
            item['newstitle']=each.xpath("./a/text()").extract()[0]
            item['data']=each.xpath("./span/text()").extract()[0]
# 使用 scrapy shell http://cmee.nwsuaf.edu.cn/xwzx/xytz/index 来验证xpath
            yield item # 迭代生成器,每次返回迭代数据
        if self.offset<127:
            self.offset+=1
                # 每次处理完一一页数据后重新发送下一页页面请求
        # self.offset自增10，同时拼接为新的url,并调用self.parse处理response
        yield scrapy.Request(self.url+str(self.offset)+".htm",callback=self.parse )

该步骤最难的地方在于xpath的确定,如何实在不能理解的可以Google,下面我简单的说一下我的方法

xpath()的确定方法

首先我们在谷歌浏览器中打开西农机电学院这个网站,在我们需要爬取的新闻处右击–>检查，这时浏览器就会打开开发者工具,这时就会在"Elements"处将内容标灰,右键标灰处并选择Copy–>Copy Xpath。
在这里插入图片描述
后面我建议在终端中验证我们获取的xpath是否正确。

在终端中输入

scrapy shell https://cmee.nwsuaf.edu.cn/xwzx/xytz/index.htm

这时终端会收到网页数据，如果你在最后几行发现*response <200 https://cmee.nwsuaf…>*就表明你获取了正确数据
在这里插入图片描述
这时我们在在终端中输入刚才复制的地址

response.xpath("/html/body/div[3]/div[3]/div[1]/ul/li").extract()

我们可以看到我们需要的数据，如果没有看到，需要你反复调试

4. 编写pipelines.py文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json

class UniversityPipeline(object):
    def __init__(self):
        self.filname=open("UniversityNew.json","wb")
    def process_item(self, item, spider):
        text=json.dumps(dict(item),ensure_ascii=False)+",\n"
        self.filname.write(text.encode("utf-8"))
        return item
    def close_spider(self,spider):
        self.filname.close()

5. 编写settings.py文件

在setting.py文件末尾添加如下代码

DEFAULT_REQUEST_HEADERS = {
    "User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}

# 设置item——pipelines
ITEM_PIPELINES = {
    'University.pipelines.UniversityPipeline': 300,
}

最后我们进入"University"文件夹中的"University"文件夹"终端，并执行如下代码启动爬虫

scrapy crawl UniversityNews

等待程序执行完毕我们就可以看见爬虫爬下来的json文件啦 : )

注意：本例程不是从第一页新闻开始爬的,如果要从第一页开始，请修改UniversityNews.py文件中的offset参数范围

该文档是以大佬YangPython的博客为基础写作而成的,如果侵犯了您的权力请联系作者删除,邮箱:real_luyouqi@163.com