之前写爬虫的时候windows搞的,没有问题,在Linux上问题不少记录一下,主要原因还是服务器有代理配置
1.crontab执行请求url失败
#!/bin/bash
cd /usr/local/python_spider/test
nohup pipenv run python3 ./test/main.py >> spider.log 2>&1 &
原因是没有写source /etc/profile,因为/etc/profile有http_proxy和https_proxy的配置,所以以后不管执行啥最好crontab脚本里都加上source /etc/profile
2.因为要等待页面加载,所以加了selenium,用的phantomjs,发现获取的网页是空的,我以为是因为phantomjs不支持了有问题,又换了firefox,发现是请求失败,我就联想到上面的代理,加了代理配置果然好了
firefox代理配置
profile = webdriver.FirefoxProfile()
profile.set_preference('network.proxy.type', 1)
profile.set_preference('network.proxy.http', 'url')
profile.set_preference('network.proxy.http_port', 8888)
profile.set_preference('network.proxy.ssl', 'url')
profile.set_preference('network.proxy.ssl_port', 8888)
profile.update_preferences()
options = webdriver.FirefoxOptions()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options, firefox_profile=profile)
driver.get(request.url)
phantomjs代理配置
from selenium.webdriver.common.proxy import ProxyType
proxy = webdriver.Proxy()
proxy.proxy_type = ProxyType.MANUAL
proxy.http_proxy = 'url:port'
proxy.https_proxy = '10.1.1.1:8888'
proxy.add_to_capabilities(webdriver.DesiredCapabilities.PHANTOMJS)
driver = webdriver.PhantomJS() # 需指定环境变量
driver.start_session(webdriver.DesiredCapabilities.PHANTOMJS)
driver.get(request.url)