python—的爬虫框架Scrapy

阿龙的代码在报错

已于 2022-09-02 15:23:58 修改

阅读量788

点赞数 1

分类专栏： python 文章标签： python 爬虫 scrapy pycharm 开发语言

于 2022-09-02 15:19:57 首次发布

本文链接：https://blog.csdn.net/yujinlong2002/article/details/126663205

版权

python 专栏收录该内容

71 篇文章 1 订阅

订阅专栏

提示：本文章代码由pyharm实现

文章目录

前言
安装scrapy
生成Scrapy项目
爬取壁纸图片链接
- 修改参数
写items.py文件
书写爬虫文件
写pipelines文件
在框架中运行
说明：

前言

一直想学的爬虫框架，这次遇见了好的文章做一下笔记

提示：以下是本篇文章正文内容，下面案例可供参考

安装scrapy

1、使用Anaconda安装
如果你的python是使用anaconda安装的，可以用这种方法。
（本人使用方法）
在哪cmd中输入一下代码：

conda install Scrapy

2、windows安装
windows安装就比较复杂了需要下载以下以来库：

lxml
pyOpenSSL
Twisted
PyWin32

安装完上述库之后，就可以安装Scrapy了，命令如下：

pip install Scrapy

生成Scrapy项目

启动cmd 进入我们要要创建的文件位置
进入后在cmd输入一下代码：

scrapy startproject 项目名称

如果在创建项目的时候出现： “ImportError: DLL load failed: 找不到指定的模块。”的错误可以参考文章：创建scrapy工程时报错 “ImportError: DLL load failed: 找不到指定的模块。“的解决方法

在cmd中进入我们最新创建的文件中

cd firstpro

创建我们的项目输入一下代码

scrapy genspider scenery pic.netbian.com

无报错则创建完成

爬取壁纸图片链接

修改参数

打开settings.py文件

修改第20行的机器人协议
修改第28行的下载间隙（默认是注释掉的，取消注释是3秒，太长了，改成1秒）
修改第40行，增加一个请求头
修改第66行，打开一个管道

写items.py文件

打开tems.py文件，输入一下代码：

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class FirstproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    link = scrapy.Field()
    pass

书写爬虫文件


import scrapy
from ..items import FirstproItem


class ScenerySpider(scrapy.Spider):
    name = 'scenery'
    allowed_domains = ['pic.netbian.com']
    start_urls = ['https://pic.netbian.com/4kfengjing/']  # 起始url
    page = 1

    def parse(self, response):
        items = FirstproItem()
        lists = response.css('.clearfix li')
        for list in lists:
            items['name'] = list.css('a img::attr(alt)').extract_first()  # 获取图片名
            items['link'] = list.css('a img::attr(src)').extract_first()  # 获取图片链接

            yield items

        if self.page < 10:  # 爬取10页内容
            self.page += 1
            url = f'https://pic.netbian.com/4kfengjing/index_{str(self.page)}.html'  # 构建url

            yield scrapy.Request(url=url, callback=self.parse)  # 使用callback进行回调

写pipelines文件

打开pipelines.py文件,输入一下代码：

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class FirstproPipeline:
    def process_item(self, item, spider):
        print(item)
        return item

在框架中运行

在cmd中输入以下代码：

scrapy crawl scenery

也可以在pycharm中创建run.py文件输入以下代码:

from scrapy import cmdline

cmdline.execute('scrapy crawl scenery'.split())  # 记得爬虫名改成自己的

说明：

代码来自：原作者博客

阿龙的代码在报错

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
python—的爬虫框架Scrapy

python—的爬虫框架Scrapy
复制链接

扫一扫

专栏目录