打包方法参见:https://zhuanlan.zhihu.com/p/41875047
以下为个人遇到的问题:
(1)模块导入部分,还需要导入自定义的的scrapy脚本,比如自定义的中间件和pipeline等:
import robotparser
import scrapy.spiderloader
import scrapy.statscollectors
import scrapy.logformatter
import scrapy.dupefilters
import scrapy.squeues
import scrapy.extensions.spiderstate
import scrapy.extensions.corestats
import scrapy.extensions.telnet
import scrapy.extensions.logstats
import scrapy.extensions.memusage
import scrapy.extensions.memdebug
import scrapy.extensions.feedexport
import scrapy.extensions.closespider
import scrapy.extensions.debug
import scrapy.extensions.httpcache
import scrapy.extensions.statsmailer
import scrapy.extensions.throttle
import scrapy.core.scheduler
import scrapy.core.engine
import scrapy.core.scraper
import scrapy.core.spidermw
import scrapy.core.downloader
import scrapy.downloadermiddlewares.stats
import scrapy.downloadermiddlewares.httpcache
import scrapy.downloadermiddlewares.cookies
import scrapy.downloadermiddlewares.useragent
import scrapy.downloadermiddlewares.httpproxy
import scrapy.downloadermiddlewares.ajaxcrawl
import scrapy.downloadermiddlewares.chunked
import scrapy.downloadermiddlewares.decompression
import scrapy.downloadermiddlewares.defaultheaders
import scrapy.downloadermiddlewares.downloadtimeout
import scrapy.downloadermiddlewares.httpauth
import scrapy.downloadermiddlewares.httpcompression
import scrapy.downloadermiddlewares.redirect
import scrapy.downloadermiddlewares.retry
import scrapy.downloadermiddlewares.robotstxt
import scrapy.spidermiddlewares.depth
import scrapy.spidermiddlewares.httperror
import scrapy.spidermiddlewares.offsite
import scrapy.spidermiddlewares.referer
import scrapy.spidermiddlewares.urllength
import scrapy.pipelines
import scrapy.core.downloader.handlers.http
import scrapy.core.downloader.contextfactory
import scrapy.pipelines.images # 用到图片管道
import openpyxl # 用到openpyxl库
#自定义中间件和pipelines
import spiders.myscrapy.middlewares.agentmiddleware
import spiders.myscrapy.pipelines
import spiders.myscrapy.settings
import spiders.myscrapy.items
(2)当调用爬虫的脚本不在scrapy目录下时,发生了的pipeline和中间件失效的问题,查看日志,发现setting中配置的中间件和pipelines都没有被识别,获取到的list为空:
于是尝试过修改settings中的配置路径如下:
ITEM_PIPELINES={
‘scrapys.spider1.wallpaper.pipelines.myPipeline’:300
}
修改后,还是不生效,日志中enable pipelines依旧为空
查看源码后,发现导入失败的中间件和pipelines会被丢弃,于是考虑应该是PATH环境变量问题,但项目打包之后,不存在目录层级关系了,无法通过导入sys.path的方式解决。只好追溯报错堆栈,修改源码scrapy/util/misc.py如下:
def walk_modules(path):
"""Loads a module and all its submodules from the given module path and
returns them. If *any* module throws an exception while importing, that
exception is thrown back.
For example: walk_modules('scrapy.utils')
"""
mods = []
try:
mod = import_module(path)
except:
mod = import_module('spiders.' + path) #添加自己的当前目录
mods.append(mod)
if hasattr(mod, '__path__'):
for _, subpath, ispkg in iter_modules(mod.__path__):
fullpath = path + '.' + subpath
if ispkg:
mods += walk_modules(fullpath)
else:
submod = import_module(fullpath)
mods.append(submod)
return mods
问题解决。