记录一下scrapy 框架爬取静态网页图片方法
爬取网站 煎蛋网动物区
- settings.py 之中进行设置
BOT_NAME = 'pictures'
SPIDER_MODULES = ['pictures.spiders']
NEWSPIDER_MODULE = 'pictures.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 1.5
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1}
#ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
# IMAGES_STORE = '/path/to/valid/dir'
IMAGES_STORE = 'D:/scrapy/images'
# 自定义名称,不设置的话items.py 要使用默认keys (images, image_urls)
IMAGES_URLS_FIELD = 'custom_image_urls'
IMAGES_RESULT_FIELD = 'custom_images'
- items.py 进行设置,注意setting.py 中images, image_urls field名字与items定义相同。
import scrapy
class PicturesItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# by default: image_urls images
custom_image_urls = scrapy.Field()
custom_images = scrapy.Field()
如果不在items中进行定义,可以省略settings.py 中最后两行 默认为如下
# 这两行可以与items.py一起省略<