Python爬虫：scrapy-splash的请求头和代理参数设置

最新推荐文章于 2024-08-16 19:12:04 发布

彭世瑜

最新推荐文章于 2024-08-16 19:12:04 发布

阅读量1w

点赞数

分类专栏： Python scrapy Spider爬虫工程化入门到进阶

本文为博主原创文章，欢迎转载，请注明出处

本文链接：https://blog.csdn.net/mouday/article/details/82151257

版权

Python 同时被 3 个专栏收录

614 篇文章 36 订阅

订阅专栏

scrapy

35 篇文章 2 订阅

订阅专栏

Spider爬虫工程化入门到进阶

4 篇文章 2 订阅

订阅专栏

3中方式任选一种即可

1、lua中脚本设置代理和请求头：

function main(splash, args)
	-- 设置代理    			 
	splash:on_request(function(request)
        request:set_proxy{
	        host = "27.0.0.1",
	        port = 8000,
	    }
		end)
    
    -- 设置请求头
    splash:set_user_agent("Mozilla/5.0")
    
    -- 自定义请求头
   splash:set_custom_headers({
    ["Accept"] = "application/json, text/plain, */*"
    })
            
    splash:go("https://www.baidu.com/")
    return splash:html()

2、scrapy中设置代理

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url,
            endpoint='execute',
            args={'wait': 5,
                  'lua_source': source，
                  'proxy': 'http://proxy_ip:proxy_port'
                  }

scrapy中设置请求头一样的在headers中设置

3、中间件中设置代理

class ProxyMiddleware(object):
      def process_request(self, request, spider):
	      request.meta['splash']['args']['proxy'] = proxyServer
	      request.headers["Proxy-Authorization"] = proxyAuth