记linux服务器有代理python scrapy的坑

最新推荐文章于 2023-02-20 17:24:52 发布

黄大仙儿

最新推荐文章于 2023-02-20 17:24:52 发布

阅读量315

点赞数

分类专栏： python 文章标签： selenium linux python scrapy

本文链接：https://blog.csdn.net/hyj_13/article/details/113264132

版权

python 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

之前写爬虫的时候windows搞的，没有问题，在Linux上问题不少记录一下，主要原因还是服务器有代理配置

1.crontab执行请求url失败

#!/bin/bash
cd /usr/local/python_spider/test
nohup pipenv run python3 ./test/main.py >> spider.log 2>&1 &

原因是没有写source /etc/profile，因为/etc/profile有http_proxy和https_proxy的配置，所以以后不管执行啥最好crontab脚本里都加上source /etc/profile

2.因为要等待页面加载，所以加了selenium，用的phantomjs，发现获取的网页是空的，我以为是因为phantomjs不支持了有问题，又换了firefox，发现是请求失败，我就联想到上面的代理，加了代理配置果然好了

firefox代理配置

        profile = webdriver.FirefoxProfile()
        profile.set_preference('network.proxy.type', 1)
        profile.set_preference('network.proxy.http', 'url')
        profile.set_preference('network.proxy.http_port', 8888)
        profile.set_preference('network.proxy.ssl', 'url')
        profile.set_preference('network.proxy.ssl_port', 8888)
        profile.update_preferences()
        options = webdriver.FirefoxOptions()
        options.add_argument('--headless')
        driver = webdriver.Firefox(options=options, firefox_profile=profile)
        driver.get(request.url)

phantomjs代理配置

from selenium.webdriver.common.proxy import ProxyType

        proxy = webdriver.Proxy()
        proxy.proxy_type = ProxyType.MANUAL
        proxy.http_proxy = 'url:port'
        proxy.https_proxy = '10.1.1.1:8888'
        proxy.add_to_capabilities(webdriver.DesiredCapabilities.PHANTOMJS)
        driver = webdriver.PhantomJS()  # 需指定环境变量
        driver.start_session(webdriver.DesiredCapabilities.PHANTOMJS)
        driver.get(request.url)