一、反爬取设置
1.降低请求频率
settings.py中设置
DOWNLOAD_DELAY = 3 # 下载延迟时间为3秒
RANDOMIZE_DOWNLOAD_DELAY = True # 设置一个(0.5~1.5) *DOWNLOAD_DELAY之间的随机延迟时间
2.禁用Cookie
settings.py中设置
COOKIES_ENABLED = False
3.伪装成随机浏览器
方法一:调用fake-useragent第三方库
- 优点:方便,调用简单,settints.py里面不用手动添加很多user-agent
- 缺点:受第三方库影响,随机取user-agent的时候经常出现timeout的错误,导致拿不到user-agent
(1)设定浏览器列表
sudo pip install fake-useragent
(2)在中间件UserAgentMiddleware中从浏览器列表中随机获取一个浏览器
middlewares.py中设置
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
from fake_useragent import UserAgent
class KaidailiUserAgentMiddleware(UserAgentMiddleware):
# 定义类KaidailiUserAgentMiddleware,用于设置随机user-agent
# 继承于UserAgentMiddleware
def process_request(self, request, spider):
# 处理Request请求函数
ua = UserAgent()
request.headers['User-Agent'] = ua.random
print(request.headers['User-UserAgent'])
(3)启用中间件UserAgentMiddleware
settings.py中设置
DOWNLOADER_MIDDLEWARES = {
'pcProxy.middlewares.PcproxyDownloaderMiddleware': None,
'pcProxy.middlewares.KaidailiUserAgentMiddleware': 100,
}
方法二:手动添加user-agent
- 优点:不用受其他因素影响
- 缺点: 需要手动在settints.py里添加很多user-agent,user-agent不全面
(1)设定浏览器列表
settings.py中手动添加一些user-agent,可以自己增加
# 设置user-agent
MY_USER_AGENT = [
'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)',
'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)',
'Mozilla/5.0 (compatible; ABrowse 0.4; Syllable)',
'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)',
'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser; GTB5; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; InfoPath.1; .NET CLR 3.5.30729; .NET CLR 3.0.30618)',
]
(2)在中间件UserAgentMiddleware中从浏览器列表中随机获取一个浏览器
middlewares.py中设置
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random
from pcProxy.settings import MY_USER_AGENT
class MyUserAgentMiddleware(UserAgentMiddleware):
# 定义类UserAgentMiddleware,用于设置随机user-agent,作用于所有spider
# 继承于UserAgentMiddleware
def process_request(self, request, spider):
# 处理Request请求函数
agent = random.choice(list(MY_USER_AGENT))
request