Python爬虫之scrapy下载文件和图片

最新推荐文章于 2024-05-16 11:47:28 发布

琴酒网络

最新推荐文章于 2024-05-16 11:47:28 发布

阅读量1.1w

点赞数 3

分类专栏： Python爬虫文章标签： Python爬虫 scrapy ImagesPipeline scrapy图片下载

本文链接：https://blog.csdn.net/pcn01/article/details/99094369

版权

Python爬虫之scrapy下载文件和图片

一：pipeline
二：使用scrapy下载图片
二：下载文件的 Files Pipeline
三：下载图片的 Images Pipeline
四：Images Pipeline的简单案例

一：pipeline

scrapy为下载 item上包含的文件（比如在爬取到产品时，同时也想保存对应的图片）提供了一个可重用的item pipelines。这些pipeline 有些共同的方法和结构（我们称之为media pipeline)。一般来说你会使用Files Pipeline或者images pipelines。

为什么要选择scrapy内置的下载文件的方法
1：避免重新下载最近已经下载过的数据
2：可以方便的指定文件存储的路径
3：可以将下载的图片转换成通用的格式。如：png,jpg
4：可以方便的生成缩略图
5：可以方便的检测图片的宽和高，确保他们满足最小限制
6：异步下载，效率非常高

二：使用scrapy下载图片

2.1 创建scrapy项目

(crawler) F:\WWWROOT\crawler>scrapy startproject bmw

1.2 创建爬虫

(crawler) F:\WWWROOT\crawler>scrapy genspider bmw5 "car.autohome.com.cn"

1.3 配置文件设置

ROBOTSTXT_OBEY = False 
DOWNLOAD_DELAY = 1
DEFAULT_REQUEST_HEADERS = {
   
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11'
}
ITEM_PIPELINES = {
   
   'bmw.pipelines.BmwPipeline': 300,
}

1.4 编写启动文件

from scrapy import cmdline
cmdline.execute("scrapy crawl bmw5".split())

启动文件用来代替命令行启动，文件放在项目根目录下

1.5 爬取数据

import scrapy
from bmw.items import BmwItem
class Bmw5Spider(scrapy.Spider):
    name = 'bmw5'
    allowed_domains = ['car.autohome.com.cn']
    start_urls = ['https://car.autohome.com.cn/pic/series/587.html']
    def parse(self, response):
        ui_boxes = response.xpath('//div[@class="uibox"]')[1:]
        for ui_box in ui_boxes:
            category = ui_box.xpath('.//div[@class="uibox-title"]/a/text()').get()
            urls = ui_box.xpath('.//ul/li/a/img/@src').getall()
            urls = list(map(lambda url: response.urljoin(url), urls))
            item = BmwItem(category = category, urls = urls)
            yield item

1.6 定义字段

最低0.47元/天解锁文章

琴酒网络

关注

3
点赞
踩
40

收藏

觉得还不错? 一键收藏
3
评论
Python爬虫之scrapy下载文件和图片

爬取数据import scrapyfrom xspider.items import XspiderItemclass ScandalSpider(scrapy.Spider): name = 'scandal' allowed_domains = ['car.autohome.com.cn'] start_urls = ['https://car.autohome...
复制链接

扫一扫

专栏目录