scrapy splash 爬取图片学习心得

最新推荐文章于 2023-03-10 14:14:43 发布

二七狂人

最新推荐文章于 2023-03-10 14:14:43 发布

阅读量866

点赞数

分类专栏： python scrapy 文章标签： python scrapy 爬虫图片 splash

本文链接：https://blog.csdn.net/weixin_43066287/article/details/116757164

版权

python 同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

scrapy

1 篇文章 0 订阅

订阅专栏

1. 鸣谢

首先特别感谢知乎作者 晚来天欲雪 的文章给予的知识帮助，以下内容也是基于原文基础上形成的。

2. 环境搭建（基于python 3.7）

2.1 准备镜像

参考官方文档教程，拉取 splash 镜像。

docker pull scrapinghub/splash

2.2 启动splash容器

将宿主机 8050 端口映射到容器 8050 端口。

docker run -p 8050:8050 scrapinghub/splash

2.3 安装 scrapy-splash

pip install scrapy-splash

2.4 安装 Pillow（图片处理）

pip install Pillow

3.生成项目

scrapy startproject netbian

4.setting.py设置

# Scrapy settings for spider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'spider'

SPIDER_MODULES = ['spider.spiders']
NEWSPIDER_MODULE = 'spider.spiders'

FEED_EXPORT_ENCODING = 'utf-8'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'spider.middlewares.SpiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'spider.middlewares.SpiderDownloaderMiddleware': 543,
# }

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 此处设置你想要运行的pipeline，数值越大优先级越高。
ITEM_PIPELINES = {
   'spider.pipelines.SpiderPipeline': 300,	#项目自定义
   'scrapy.pipelines.images.ImagesPipeline': 1	#scrapy框架自带
}

IMAGES_STORE = 'images'
IMAGES_URLS_FIELD = 'img_url'

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# Splash

# 添加splash服务器地址
SPLASH_URL = 'http://localhost:8050'

# 添加Splash中间件
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# Enable SplashDeduplicateArgsMiddlewar
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# 设置Splash自己的去重过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

# 如果你使用Splash的Http缓存，那么还要指定一个自定义的缓存后台存储介质
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

5.items.py设置

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class SpiderItem(scrapy.Item):
    # define the fields for your item here like:
    img_url = scrapy.Field()

6.编写爬虫

import scrapy
from scrapy_splash import SplashRequest

lua_script = '''
function main(splash)                     
    splash:go(splash.args.url)        --打开页面
    splash:wait(2)                    --等待加载
    return splash:html()              --返回页面数据
end
'''

class NetbianSpider(scrapy.Spider):
    name = 'netbian'
    allowed_domains = ['jd.com']
    start_urls = ['https://item.jd.com/34637635130.html']

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url,
                                endpoint='execute',
                                args={'lua_source': lua_script,
                                 		'timeout': 90, #超时时间，有的页面读取很慢导致504，可设置大值防止504
                                  		'wait': 0.5},
                                cache_args=['lua_source'],
                                callback=self.parse)
    def parse(self, response):
        price = response.xpath('//span[@class="price J-p-34637635130"]/text()').extract_first()
        print("价格：", price)

7.执行脚本

scrapy crawl netbian

8.遇到的坑

WARNING: /xxx…/scrapy_splash/request.py:41: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
url = to_native_str(url)

解决方法：
在 /xxx…/scrapy_splash/request.py 中增加

from scrapy.utils.python import to_unicode

在第41行将

url = to_native_str(url)

改为

url = to_unicode(url)

9.项目地址

git地址