最近在爬英国一些电商网站时,需要使用代理来避免被封。这边使用的是某网站的账密认证代理。构建拓展:extension.py
import zipfile
def proxies(username, password, endpoint, port):
manifest_json = """
{
"version": "1.0.0",
"manifest_version": 2,
"name": "Proxies",
"permissions": [
"proxy",
"tabs",
"unlimitedStorage",
"storage",
"<all_urls>",
"webRequest",
"webRequestBlocking"
],
"background": {
"scripts": ["background.js"]
},
"minimum_chrome_version":"22.0.0"
}
"""
background_js = """
var config = {
mode: "fixed_servers",
rules: {
singleProxy: {
scheme: "http",
host: "%s",
port: parseInt(%s)
},
bypassList: ["localhost"]
}
};
chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});
function callbackFn(details) {
return {
authCredentials: {
username: "%s",
password: "%s"
}
};
}
chrome.webRequest.onAuthRequired.addListener(
callbackFn,
{urls: ["<all_urls>"]},
['blocking']
);
""" % (endpoint, port, username, password)
extension = 'proxies_extension.zip'
with zipfile.ZipFile(extension, 'w') as zp:
zp.writestr("manifest.json", manifest_json)
zp.writestr("background.js", background_js)
return extension
然后在爬虫中使用:
from discount_spider_extension import proxies
proxies_extension = proxies(proxies_user_name, proxies_password, proxies_endpoint, proxies_port) #这里各项认证需用自己的
options = webdriver.ChromeOptions()
#options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_extension(proxies_extension)
# 启动 Chrome 浏览器
driver = webdriver.Chrome(options=options, executable_path="/home/spider/chromedriver")
driver.set_window_size(1920, 1080)
这样就可以正常使用了。注意不可使用无头模式,否则会启动失败。如果部署在无界面的服务器,则要在启动driver之前添加代码:
from pyvirtualdisplay import Display# 在chromedriver启动前启动一个显示器
display = Display(visible=0, size=(800, 800))
display.start()
这样就可以构建虚拟显示器用于chromedriver。需要先安装库:
yum install Xvfb
pip install pyvirtualdisplay