一、下载器中间件配置随机请求头
实现两个方法:process_request和process_response
能获取当前浏览器请求头的网站:http://httpbin.org/user-agent
全世界所有浏览器的请求头:http://www.useragentstring.com/pages/useragentstring.php?typ=Browser
修改 UseragentrandomDownloaderMiddleware:
class UseragentrandomDownloaderMiddleware:
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2919.83 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2866.71 Safari/537.36',
'Mozilla/5.0 (X11; Ubuntu; Linux i686 on x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2820.59 Safari/537.36'
]
def process_request(self, request, spider):
user_agent = random.choice(self.USER_AGENTS)
request.headers['User-Agent'] = user_agent
return None
settings.py中开启中间件:
DOWNLOADER_MIDDLEWARES = {
'userAgentRandom.middlewares.UseragentrandomDownloaderMiddleware': 543,
}
spider中不断发请求,输出请求头:
class HttpbinSpider(scrapy.Spider):
name = 'httpbin'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/user-agent']
def parse(self, response):
user_agent = json.loads(response.text)['user-agent']
yield scrapy.Request(self.start_urls[0],
dont_filter=True) # 不要去重
print(user_agent)
运行并输出:
二、下载器中间件配置IP代理池
class IPProxyrandomDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
PROXYIES = [
"183.166.138.181:4216",
"183.166.162.157:4216",
"112.194.112.175:8118"
]
PROXYIES = ['https://' + p for p in PROXYIES] # 不加报错
print(PROXYIES)
def process_request(self, request, spider):
proxy = random.choice(self.PROXYIES)
request.meta['proxy'] = proxy
spider中通过查ip的网站输出当前ip:
class IpproxySpider(scrapy.Spider):
name = 'ipproxy'
allowed_domains = ['ip.cn', 'httpbin.org']
# start_urls = ['http://httpbin.org/ip']
start_urls = ['https://www.ip.cn/']
def parse(self, response):
print('执行ipproxy.py中的parse')
ip = response.xpath('//div[@class="well"]/p[1]/code/text()').get()
# ip = json.loads(response.text)['origin']
print(ip)
yield scrapy.Request(self.start_urls[0], dont_filter=True)
运行!然而免费代理ip质量堪忧...