使用 Python 的 Requests 库爬取代理网站数据并筛选可用代理

木易为生，男一为名

已于 2024-10-11 13:58:34 修改

阅读量308

点赞数 8

文章标签：开发语言 python 爬虫

于 2024-10-11 13:57:16 首次发布

本文链接：https://blog.csdn.net/weixin_53367199/article/details/142852334

版权

在进行网络爬虫开发时，使用代理服务器可以帮助我们绕过一些访问限制，同时减少被目标网站封禁的风险。本文将介绍如何使用 Python 的 `requests` 库从一个公开的代理网站抓取免费代理列表，并测试这些代理是否可用，最后将可用的代理（包含 IP 和端口）存入一个列表中。

## 为什么需要代理？

1. **避免 IP 被封**：频繁请求同一网站可能会导致您的 IP 地址被封锁。
2. **提高访问速度**：某些情况下，合适的代理可以加快对特定资源的访问速度。
3. **隐藏真实 IP**：通过代理服务器访问网络可以隐藏您的真实 IP 地址，保护隐私。

## 准备工作

首先确保已经安装了 `requests` 库。如果还没有安装，可以通过 pip 安装：

pip install requests

此外，我们将使用 `BeautifulSoup` 来解析 HTML 内容。同样地，如果您还没有安装 `beautifulsoup4` 和 `lxml` 解析器，可以通过以下命令安装它们：

pip install beautifulsoup4 lxml

## 爬取代理列表

假设我们要从某个提供免费代理的网站上获取代理列表。这里以一个虚构的代理网站为例，实际操作时请替换为真实的 URL。

### 步骤 1: 发送请求```python```

import requests
from bs4 import BeautifulSoup

def fetch_proxies(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        print(f"Failed to retrieve data, status code: {response.status_code}")
        return None

# 假设这是代理网站的 URL
proxy_url = "http://example-proxy-website.com/free-proxies"
html_content = fetch_proxies(proxy_url)

if html_content is not None:
    soup = BeautifulSoup(html_content, 'lxml')
else:
    print("No content to parse.")
    exit()

### 步骤 2: 解析 HTML 并提取代理信息

我们需要找到页面中存储代理信息的部分。这通常是一个表格或者列表，其中每个条目都包含了 IP 地址和端口号。```python```

def extract_proxies(soup):
    proxies = []
    # 假设代理信息位于 <table> 标签内，每行 <tr> 对应一个代理
    for row in soup.find_all('tr'):
        columns = row.find_all('td')
        if len(columns) >= 2:  # 至少有两个列，IP 和 端口
            ip = columns[0].text.strip()
            port = columns[1].text.strip()
            proxies.append(f"{ip}:{port}")
    return proxies

proxies_list = extract_proxies(soup)
print(f"Found {len(proxies_list)} potential proxies.")

### 步骤 3: 测试代理是否可用

现在我们有了潜在的代理列表，接下来需要检查每个代理是否真的可用。为此，我们可以尝试通过这些代理向另一个网站发送请求。```python```

def test_proxy(proxy):
    try:
        # 使用代理发送请求到 google.com
        response = requests.get('https://www.google.com', proxies={'http': proxy, 'https': proxy}, timeout=5)
        if response.status_code == 200:
            return True
    except (requests.exceptions.RequestException, requests.exceptions.Timeout):
        pass
    return False

working_proxies = [proxy for proxy in proxies_list if test_proxy(proxy)]
print(f"Working proxies: {working_proxies}")

## 结论

通过上述步骤，我们成功地从一个代理网站抓取了免费代理列表，并且筛选出了那些真正可用的代理。这个过程不仅可以帮助我们在网络爬虫项目中更好地管理代理池，还能确保我们的爬虫更加健壮、高效。

请注意，在实际应用中，请遵守相关法律法规以及目标网站的服务条款，合理合法地使用代理服务。希望这篇教程能够帮到您！如果有任何问题或需要进一步的帮助，请随时留言交流。