爬虫经常会用到代理ip如果你用你的ip可能用不了多久就会被封掉的,这时候就要使用代理ip了:
scrapy里面其实有内置的代理拓展,源码这边我就不一一说明了直接说如何使用:
要使用代码必须要在爬虫开始前:所以第一种的使用是在我们的start_request里面:
内置(1):
def start_requests(self):
# 这里是代理ip
import os
os.environ["HTTPS_PROXY"] = '37.187.149.129:1080'
os.environ["HTTP_PROXY"] = '202.29.212.213:443'
# 重写了start_requests一定要重写这个不然不能爬取了。
#方法一:
#for url in self.start_urls:
# yield Request(url=url)
# 方法二:
request_list = []
for url in self.start_urls:
request_list.append(Request(url=url))
return request_list
这种是通过环境变量进行添加到我们的代码中的。
内置(2)
内置的第二种方式是通过添加meta参数实现
def start_requests(self):
# 这里是添加了代理ip
for url in self.start_urls:
yield Request(url=url, meta={'proxy': "https://37.187.149.129:1080/"})
当两种内置代理都使用时,meta的优先级要高于环境变量的。
自定义代理:
我在项目根目录下面创建了一个proxy.py的文件用于存放自定义的代理相关的类。
proxy.py:
# -*- coding: utf-8 -*-
# @Time : 2019/7/18 下午 9:40
# @Author : lh
# @Email : xx@lh.com
# @File : myproxy.py
# @Software: PyCharm
import base64
import random
from six.moves.urllib.parse import unquote
try:
from urllib2 import _parse_proxy
except ImportError:
from urllib.request import _parse_proxy
from six.moves.urllib.parse import urlunparse
from scrapy.utils.python import to_bytes
class MyProxyMiddleware(object):
"""
自定义方式一
"""
def _basic_auth_header(self, username, password):
user_pass = to_bytes(
'%s:%s' % (unquote(username), unquote(password)),
encoding='latin-1')
return base64.b64encode(user_pass).strip()
def process_request(self, request, spider):
PROXIES = [
"62.176.126.67:35830",
"37.187.149.129:1080",
"117.191.11.110:80",
"202.29.212.213:443",
"176.197.87.50:49742",
"183.88.214.47:8080",
]
url = random.choice(PROXIES)
orig_type = ""
proxy_type, user, password, hostport = _parse_proxy(url)
proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', ''))
if user:
creds = self._basic_auth_header(user, password)
else:
creds = None
request.meta['proxy'] = proxy_url
if creds:
request.headers['Proxy-Authorization'] = b'Basic ' + creds
class OneProxyMiddleware(object):
"""
自定义方式二
"""
def process_request(self, request, spider):
PROXIES = [
{'ip_port': '111.11.228.75:80', 'user_pass': ''},
{'ip_port': '120.198.243.22:80', 'user_pass': ''},
{'ip_port': '111.8.60.9:8123', 'user_pass': ''},
{'ip_port': '101.71.27.120:80', 'user_pass': ''},
{'ip_port': '122.96.59.104:80', 'user_pass': ''},
{'ip_port': '122.224.249.122:8088', 'user_pass': ''},
]
proxy = random.choice(PROXIES)
if proxy['user_pass'] is not None:
request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
encoded_user_pass = base64.b64encode(to_bytes(proxy['user_pass']))
request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
else:
request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
这里面有两种自定义的方法。
然后要让他生效就要在settings.py里面添加这个了
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# 'scrapy_tesy.middlewares.ScrapyTesyDownloaderMiddleware': 543,
'scrapy_tesy.myproxy.MyProxyMiddleware': 500
}
然后就会调用我们自定义的代理了。