跟着教程撸代码
工程目录settings.py中添加
#用户user-agent代理
UAPOOL=[
‘Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/61.0’,#火狐
‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36’,#谷歌
‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134’,#edge
‘Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko’#ie
‘Mozilla/5.0 (Windows NT 6.1) AppleWebkit/536.5’
]
同级目录新增文件uamid.py
重点:from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
#uamid下载中间件
import random
from myweb.myfirstpjt.myfirstpjt.settings import UAPOOL
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
class Uamid(UserAgentMiddleware):
def __init__(self,ua=''):
self.ua=ua
def process_request(self, request, spider):
thisua=random.choice(UAPOOL)
print("当前使用的user_agent是:"+thisua)
request.headers.setdefault('User-Agent',thisua)
if __name__=='__main__':
UA=Uamid()
去settings中增加中间件uamid
搜索关键字”DOWNLOADER_MIDDLEWARES “去掉字典注释,
不要忘记末尾的逗号“,”分隔字典。
重点:
‘scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware’:2,
‘myfirstpjt.uamid.Uamid’:1,
DOWNLOADER_MIDDLEWARES = {
# 'myfirstpjt.middlewares.MyfirstpjtDownloaderMiddleware': 543,
'scrapy.contrib.downloadermiddleware.useragent .UserAgentMiddleware':2,
'myfirstpjt.uamid.Uamid':1,
运行报错。!有道一下报错信息
scrapy.contrib.downloadermiddleware.useragent'
is deprecated(是弃用模块)
(使用)use scrapy.downloadermiddlewares.useragent
D:\py3\python.exe D:/py3/myweb/myfirstpjt/myfirstpjt/uamid.py
['D:\\py3\\myweb\\myfirstpjt\\myfirstpjt', 'D:\\py3\\myweb', 'D:\\py3\\python37.zip', 'D:\\py3\\DLLs', 'D:\\py3\\lib', 'D:\\py3', 'D:\\py3\\lib\\site-packages', 'D:\\py3\\lib\\site-packages\\win32', 'D:\\py3\\lib\\site-packages\\win32\\lib', 'D:\\py3\\lib\\site-packages\\Pythonwin', 'D:\\py3\\myweb\\myfirstpjt\\myfirstpjt']
D:/py3/myweb/myfirstpjt/myfirstpjt/uamid.py:4: ScrapyDeprecationWarning: Module `scrapy.contrib.downloadermiddleware.useragent` is deprecated, use `scrapy.downloadermiddlewares.useragent` instead
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
不确定,百度一下,正好有scrapy文档链接
ctrl+f:查找UserAgentMiddleware
UserAgentMiddleware
类scrapy.downloadermiddlewares.useragent.UserAgentMiddleware
允许蜘蛛覆盖默认用户代理的中间件。
为了使蜘蛛覆盖默认用户代理, 必须设置其user_agent属性。
修改uamid内容,将:
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
:替换为
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
将init中ua=”, 替换为user_agent=”
import random
from myxml.settings import UAPOOL
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
class Uamid(UserAgentMiddleware):
def __init__(self,user_agent=''):
self.user_agent=user_agent
def process_request(self, request, spider):
thisua=random.choice(UAPOOL)
print("当前使用的user_agent是:"+thisua)
request.headers.setdefault('user_agent',thisua)
if __name__=='__main__':
UA=Uamid()
settings一样修改为:
DOWNLOADER_MIDDLEWARES = {
# 'myxml.middlewares.MyxmlDownloaderMiddleware': 543,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':2,
'myxml.uamid.Uamid':1,
}
前往工作文件夹下D:\py3\myweb\myxml,输入框输入cmd打开在当前路径下,不用麻烦输入dos命令cd跳转路径什么的。
D:\py3\myweb\myxml>scrapy crawl myxmlspider
运行,成功。scrapy crawl myxmlspider –nolog,不显示日志,可看可不看,报错时在看。
总结:scrapy15.0,将中间件部分原来的旧模块弃用了导致运行失败,爬坑不容易,且行且珍惜!
感谢有道在线翻译,给了英语渣一条活路。