python3 [爬虫入门实战]scrapy爬取盘多多五百万数据并存mongoDB

最新推荐文章于 2023-03-01 09:33:49 发布

徐代龙

最新推荐文章于 2023-03-01 09:33:49 发布

阅读量5.1k

点赞数 1

分类专栏： python 文章标签： mongodb python 数据爬虫多线程

本文链接：https://blog.csdn.net/snake_son/article/details/75577992

版权

本文介绍了使用Scrapy爬取盘多多网站的文件信息，包括文件名、链接、类型等，并将近500万数据存储到MongoDB的过程。在爬取过程中，解决了403错误、编码问题和数据存储问题，强调了Scrapy在大规模爬取中的稳定性。

摘要由CSDN通过智能技术生成

总结：虽然是第二次爬取，但是多多少少还是遇到一些坑，总的结果还是好的，scrapy比多线程多进程强多了啊，中途没有一次被中断过。

此版本是盘多多爬取数据的scrapy版本，涉及数据量较大，到现在已经是近500万的数据了。

1，抓取的内容

这里写图片描述

主要爬取了：文件名，文件链接，文件类型，文件大小，文件浏览量，文件收录时间

一，scrapy中item.py代码

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class PanduoduoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # pass
    # 文件名称
    docName = scrapy.Field()
    # 文件链接
    docLink = scrapy.Field()
    # 文件分类
    docType = scrapy.Field()
    # 文件大小
    docSize = scrapy.Field()
    # 网盘类型
    docPTpye = scrapy.Field()
    # 浏览量
    docCount = scrapy.Field()
    # 收录时间
    docTime = scrapy.Field()

在spider进行抓取出现的问题，（1），因为没有设置请求头信息，盘多多浏览器会返回403错误，不让进行数据的爬取，所以这里我们要进行user-agent的设置，（settings.py中）

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'

COOKIES_ENABLED = False

ROBOTSTXT_OBEY = False