随机请求头的设置+IP代理池的建立
随机请求头的设置
说明
开始,建立一个scrapy爬虫,这里在python爬虫-------scrapy学习笔记(一) 已经详细讲过了。
这里我已经创建好了一个scrapy爬虫,这里我们通过浏览器来访问下http://httpbin.org/headers,这里可以看我们的请求头。
下面我们通过爬虫来访问一下(不过先要设置一下setting.py里的robots协议和修改下默认请求头)。
在刚刚创建的爬虫中,来通过代码来访问下http://httpbin.org/headers
import scrapy
import json
class HeadersettingSpider(scrapy.Spider):
name = 'headersetting'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/headers']
def parse(self, response):
print("="*60)
result = json.loads(response.text)['headers']['User-Agent']
print(result)
结果如下(和直接通过浏览器访问的一样):
设置请求头
这里推荐一个网站,可以搜索所有浏览器的请求头
下面,打开scrapy项目的middlewares.py
新建一个类
class UserAgentDownloadMiddlewares(object):
USER_AGENTS=['Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/44.0.2403.155 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Flock/3.5.3.4628',
"Mozilla/5.0 (X11; Linux i686; rv:64.0) Gecko/20100101 Firefox/64.0",
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14931',
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; Browzar)"]
def process_request(self, request, spider):
user_agent = random.choice(self.USER_AGENTS)
request.headers['User-Agent'] = user_agent
之后进入setting.py
,修改DOWNLOADER_MIDDLEWARES
关于主爬虫代码,在最后加上yield scrapy.Request(self.start_urls[0],dont_filter=True)
重复发送多次请求
import scrapy
import json
class HeadersettingSpider(scrapy.Spider):
name = 'headersetting'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/headers']
def parse(self, response):
print("="*60)
result = json.loads(response.text)['headers']['User-Agent']
print(result)
yield scrapy.Request(self.start_urls[0],dont_filter=True)
之后运行(在命令行或者新建一个start.py文件:
结果显示,每次的请求头都不一样了。
IP代理池的建立
推荐一个查看IP地址 的网址
这里先在原基础上通过“命令行”再建立一个爬虫
然后在middlewares.py
中创建一个新的类
class IpProxyDownloadMiddlewares(object):
IP_Pool = [
'182.35.85.61:9999','123.163.27.103:9999','58.22.177.109:9999',"175.155.142.249:1133","112.84.99.134:9999","182.149.83.56:9999"
]
def process_request(self,request,spider):
proxy = random.choice(self.IP_Pool)
request.meta['proxy'] = "http://"+proxy
之后进入setting.py
,修改DOWNLOADER_MIDDLEWARES
关于主爬虫:
# -*- coding: utf-8 -*-
import scrapy
import json
class IpporxySpider(scrapy.Spider):
name = 'ipporxy'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
result = json.loads(response.text)['origin']
print('='*60)
print(result)
yield scrapy.Request(self.start_urls[0],dont_filter=True)
运行(由于这些ip不是购买的,效果就没那么明朗了):