百思不得姐网站 Scrapy爬虫笔记

EEEchoy

于 2020-02-08 23:36:59 发布

阅读量160

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/weixin_40448659/article/details/104229784

版权

爬虫专栏收录该内容

9 篇文章 0 订阅

订阅专栏

JsonItemExpoter 和 JsonLinesItemExpoter

start.py

Github

settings.py文件设置

机器人协议

ROBOTSTXT_OBEY = False
#遵守机器人协议,默认是True,改为False即可

下载速度

DOWNLOAD_DELAY = 1
#设置下载速度，避免造成爬取网站的服务器坍塌

请求头

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Language': 'en',
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'
}
修改User-Agent适当伪装自己 😀

ITEM_PIPELINES

ITEM_PIPELINES = {
   'budejie.pipelines.BudejiePipeline': 300,
   #数值表示优先级，值越小优先级越高
}

pipelines.py

作用：保存数据

其中有三个方法是经常被使用的

open_spider(self,spider):当爬虫被打开时执行
process_item(self, item, spider):当爬虫有item被传过来时调用
close_spider(self, spider):当爬虫关闭时调用

激活pipeline，应该设置‘settings.py’中的'ITEM_PIPELINES'

JsonItemExpoter 和 JsonLinesItemExpoter

保存json数据时用这两个类让操作更简单

JsonItemExpoter ：每次将数据添加到内存，最后统一写入磁盘，适合小规模数据。

from scrapy.exporters import JsonItemExporter
#数据量是中规模时

class BudejiePipeline(object):
    def __init__(self):
        self.fp = open("duanzi.json", "wb")
        self.exporter = JsonItemExporter(self.fp, ensure_ascii = False, encoding='utf-8')
        self.exporter.start_exporting()

    def open_spide(self, spider):
        print("爬虫开始")
        pass

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.fp.close()
        print("爬虫结束")
        pass

start.py

运行这个文件即可调用运行爬虫

from scrapy import cmdline
cmdline.execute("scrapy crawl budejie_scrapy".split())

'''
执行命令
有两种写法
cmdline.execute(["scrapy","crawl","budejie_scrapy"])
'''

Github

https://github.com/Echoyy9/budejie

EEEchoy

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
百思不得姐网站 Scrapy爬虫笔记

目录机器人协议下载速度请求头ITEM_PIPELINESpipelines.pyJsonItemExpoter 和 JsonLinesItemExpoterstart.pyGithubsettings.py文件设置机器人协议ROBOTSTXT_OBEY = False#遵守机器人协议,默认是True,改为False即可下载速度DOWNLOA...
复制链接

扫一扫

专栏目录