目录
一、常见的ip代理网站收集
-
由于同一个ip访问过于频繁,所爬网站可能会让我们输入验证码或直接封锁IP;因而使用代理隐藏真实的IP,让服务器以为是代理服务器在请求自己。这样在爬取过程中通过不断更换代理,就不会被封锁,可以达到很好的爬取效果。
-
代理网站:多贝云,阿布云 ,鲸鱼 , ET , 熊猫 , 站大爷,讯代理,品易 , 无忧 ,闪臣,代理云,蜻蜓,神龙,微秒云,星速云,Proxy302 ,http://proxylist.fatezero.org
-
以下代码运行结果的origin是代理的ip,则证明代理已设置成功 http://httpbin.org/getorigin
二、代理池使用
1、崔大开源的代理池
- 崔大的代理池子代码,直接下载整个代码,然后
pip install -r requirements.txt -i https://pypi.douban.com/simple
, 然后直接运行run.py
即可,然后打开http://localhost:5555/random就可以随机获取代理ip了
- 测试获取代理ip的代码
import requests import logging def main(): """ main method, entry point :return: none """ headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"} for retry_times in range(10): proxy = requests.get('http://127.0.0.1:5555/random').text.strip() print('get random proxy', proxy) proxies = {'http': f'http://{proxy}'} try: html = requests.get('http://httpbin.org/get', proxies=proxies, timeout=5, headers=headers) print(html.status_code, html.text) break except Exception as err: logging.warning(err) if __name__ == '__main__': main()
2、jhao104开源的代理池
-
jhao104代理池代码 ,,直接下载整个代码,然后
pip install -r requirements.txt -i https://pypi.douban.com/simple
, 然后setting.py里面改下DB_CONN = 'redis://@127.0.0.1:6379/0'
,然后运行python proxyPool.py schedule
,python proxyPool.py server
即可,然后点击http://127.0.0.1:5010/get/就可以随机获取代理ip了
-
调用代理代码样例
import requests import logging def main(): """ main method, entry point :return: none """ headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"} for retry_times in range(10): proxy = requests.get('http://127.0.0.1:5010/get/').json()["proxy"] print('get random proxy', proxy) proxies = {'http': f'http://{proxy}'} try: html = requests.get('http://httpbin.org/get', proxies=proxies, timeout=5, headers=headers) print(html.status_code, html.text) break except Exception as err: logging.warning(err) if __name__ == '__main__': main()
三、各个模块设置代理
1、requests设置代理ip
- 设置代理,或者是SOCKS的代理类型
import requests proxy = "58.240.220.86:5281" proxies = {'https': f'https://{proxy}', 'http': f'http://{proxy}'} socks_proxies = {'https': f'socks5://{proxy}', 'http': f'socks5://{proxy}'} # 代理类型是SOCKS try: response = requests.get('https://httpbin.org/get', proxies=proxies) print(response.json()["origin"]) except requests.exceptions.ConnectionError as e: print('Error', e.args)
- 利用sockets设置全局的代理
import requests import socks import socket # 设置全局的ip代理 socks.set_default_proxy(socks.SOCKS5, '58.240.220.86', '5281') socket.socket = socks.socksocket try: response = requests.get('https://httpbin.org/get') print(response.json()["origin"]) except requests.exceptions.ConnectionError as e: print('Error', e.args)
2、httpx设置代理ip
- 设置代理
import httpx proxy = "58.240.220.86:5281" proxies = {'https': f'https://{proxy}', 'http': f'http://{proxy}'} with httpx.Client(proxies=proxies) as client: response = client.get('https://httpbin.org/get') print(response.json()["origin"])
3、aiohttp设置代理ip
- 设置代理
import asyncio import aiohttp proxy = "http://58.240.220.86:5281" async def main(): async with aiohttp.ClientSession() as session: async with session.get('http://httpbin.org/get', proxy=proxy) as response: print(await response.text()) asyncio.run(main())
4、selenium设置代理ip
- 以Chrome为例
from selenium import webdriver chromeOptions = webdriver.ChromeOptions() # 代理ip+port:58.240.220.86:53281 proxy = '58.240.220.86:53281' chromeOptions.add_argument(f'--proxy-server=http://{proxy}') # chromeOptions.add_argument(f'--proxy-server=socks5://{proxy}') browser = webdriver.Chrome(options=chromeOptions) browser.get('http://httpbin.org/get') print(browser.page_source) browser.close()
5、pyppeteer设置代理ip
- 设置代理
import asyncio from pyppeteer import launch async def main(): proxy = '58.240.220.86:53281' browser = await launch({"args": [f'--proxy-server=http://{proxy}'], "headless": False}) # browser = await launch({"args": [f'--proxy-server=socks5://{proxy}'], "headless": False}) page = await browser.newPage() await page.goto('https://httpbin.org/get') print(await page.content()) await browser.close() # 关闭浏览器对象 asyncio.get_event_loop().run_until_complete(main())
6、Playwright设置代理ip
- 设置代理
from playwright.sync_api import sync_playwright proxy = '58.240.220.86:5281' with sync_playwright() as p: browser = p.chromium.launch(headless=False, proxy={'server': f'http://{proxy}'}) # browser = p.chromium.launch(headless=False, proxy={'server': f'socks5://{proxy}'}) page = browser.new_page() page.goto('https://httpbin.org/get') print(page.content()) browser.close()