在scrapy项目中,避免被禁止:
(1)禁止cookie
(2)设置下载延迟
(3)使用IP池
(4)使用用户代理池
(5)其他方法,如进行分布式爬取等
(1)禁止cookie
可在对应scrapy项目中的settings.py文件中进行设置
# Disable cookies (enabled by default) COOKIES_ENABLED = False #将注释去了~
(2)设置下载延时
对scrapy项目的settings.py文件中进行设置
# Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 3 #注释去了~download_delay:设置时间间隔,解除这行的注释可实现下载延时的配置,3代表3秒
(3)使用IP池
首先在项目核心目录下创建下载中间件文件,文件名自定义
cmd:
cd myfirstpjt\
echo #>middlewares.py
对scrapy项目的settings.py文件中进行设置,添加如下信息:
#ip池设置
IPPOLL=[{"ipaddr":"121.33.226.167:3128"},{"ipaddr":"118.187.10.11:80"},{"ipaddr":"123.56.245.138:808"}]
IPPOLL就是对应代理服务器池,外层通过列表的形式存储,里层通过字典的形式存储。
设置好ip池后,编写middlewares文件:
#middlewares下载中间件
#导入随机数模块,目的是随机挑选一个ip池中的ip
#从settings文件(myfirstspjt.settings为settings文件的地址)中导入设置好的IPPOOL
#导入官方文档中HttpProxyMiddleware对应的模块
import random
from myfirstspjt.settings import IPPOOL
from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware
class IPPOOLS(HttpProxyMiddleware):
#初始化方法
def __init__(self,ip=''):
self.ip=ip
#process_request()方法,主要进行请求处理
def process_request(self, request, spider):
#先随机选择一个ip
thisip=random.choice(IPPOOL)
#输出当前选择的IP,便于调试观察
print("当前使用的ip是:"+thisip["ipaddr"])
#将对应的ip实际添加为具体的代码,用该ip进行爬取
request.meta["proxy"]="http://"+thisip["ipaddr"]
设置settings.py文件:
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# 'myfirstspjt.middlewares.MyCustomDownloaderMiddleware': 543,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':123,
'myfirstpjt.middlewares.IPPOOLS':125
}
cmd下运行:
scrapy crawl weisuen --nolog
(4)使用用户代理池
在settings.py配置文件中设置好用户代理池,代理池自定义:
UAPOOL = ["Mozilla/5.0 (Windows NT 6.1;WOW64)AppleWebKit/537.36(KHTML,like Gecko)Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0",
"Mozilla/5.0 (Windows NT 6.1;WOW64;rv:48.0) Gecko/20100101 Firefox/48.0",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.5"
]
创建中间件文件:echo #>uamid.py
修改uamid.py文件:
import random
from myfirstspjt.settings import UAPOOL
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
class Uamid(UserAgentMiddleware):
def __init__(self,ua=''):
self.user_agent=ua
def process_request(self, request, spider):
thisua=random.choice(UAPOOL)
print("当前使用user-agent是:"+thisua)
request.headers.setdefault('User-Agent',thisua)
然后修改settings.py文件:
DOWNLOADER_MIDDLEWARES = {
# 'myfirstspjt.middlewares.MyCustomDownloaderMiddleware': 543,
# 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':123,
# 'myfirstpjt.middlewares.IPPOOLS':125
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':2,
'myfirsrspjt.uamid.Uamid':1
}
cmd:scrapy crawl weisuen --nolog
(5)其他方法.........