抽空整理的几种在scrapy框架里实现自定义请求头的方法
前言
scrapy爬虫框架示意图
scrapy项目里add_headers爬虫的任务是爬取新浪首页,并打印其默认headers信息
import scrapy
class AddHeadersSpider(scrapy.Spider):
name = 'add_headers'
allowed_domains = ['sina.com']
start_urls = ['https://www.sina.com.cn']
def parse(self,response):
print("---------------------------------------------------------")
print("response headers: %s" % response.headers)
print("request headers: %s" % response.request.headers)
print("---------------------------------------------------------")
默认headers信息
方法一
修改settings.py里默认的User-Agent
运行一下add_headers.py
默认User-Agent已修改
也可以更详细一点,在settings.py里修改默认headers
运行一下add_headers.py
默认headers已修改
方法二
在爬虫类代码里添加headers为类变量,创建方法把start_urls列表里每一个元素yield Request出去,添加headers参数,callback方法为parse方法
代码为
import scrapy
class AddHeadersSpider(scrapy.Spider):
name = 'add_headers'
allowed_domains = ['sina.com']
start_urls = ['https://www.sina.com.cn']
headers = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
"Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.5,en;q=0.3",
"Accept-Encoding": "gzip, deflate",
'Content-Length': '0',
"Connection": "keep-alive"
}
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, headers=self.headers, callback=self.parse)
def parse(self,response):
print("---------------------------------------------------------")
print("response headers: %s" % response.headers)
print("request headers: %s" % response.request.headers)
print("---------------------------------------------------------")
方法三
在middlewares.py中设置User-Agent
class AaaDownloaderMiddleware(object):
def process_request(self, request, spider):
request.headers.setdefault('User-Agent', 'Mozilla/5.0
在middlewares.py中设置完整的headers
from scrapy.http.headers import Headers
headers = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
"Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.5,en;q=0.3",
"Accept-Encoding": "gzip, deflate",
'Content-Length': '0',
"Connection": "keep-alive"
}
class AaaDownloaderMiddleware(object):
def process_request(self, request, spider):
request.headers = Headers(headers)
在settings.py中开启中间件
方法四
添加动态User-Agent,在settings.py添加一个User-Agent列表,在middlewares.py中间件里从列表里随机抽取一个User-Agent,实现User-Agent的动态
settings.py
USER_AGENT_LIST=[
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
middlewares.py
import random
from aaa.settings import USER_AGENT_LIST
class AaaSpiderMiddleware(object):
def process_request(self, request, spider):
request.headers.setdefault('User-Agent', random.choice(USER_AGENT_LIST))