Python爬虫添加代理IP池（新手）

最新推荐文章于 2024-01-19 02:05:40 发布

xiaoxianerqq

最新推荐文章于 2024-01-19 02:05:40 发布

阅读量596

点赞数 2

分类专栏： python

原文链接：https://blog.csdn.net/weixin_41996197/article/details/89427121

版权

python 专栏收录该内容

248 篇文章 6 订阅

订阅专栏

给爬虫添加代理IP池
我们在运行爬虫的过程中由于请求次数过多经常会遇到被封IP的情况，这时就需要用到代理IP来解决。代理IP的原理，简单来说就像在本机和web服务器之间开一个中转站，把本机的请求交给代理IP服务器，由它帮本机向web服务器发送请求，再把响应返回给本机。

下载安装代理IP池
这是一个在github上人气比较高的代理池，使用的是Redis数据库。由于都是免费代理，所以质量并不高，但供大家学习已经够用。里面也可以自己设置收费的代理ip，但在本文不会写到。
地址：https://github.com/jhao104/proxy_pool

这是一个在gayhub上面人气很高的免费代理池

下载解压后，在cmd里进入到解压的路径，运行此命令安装依赖。

pip install -r requirements.txt

在这里插入图片描述

在解压后的文件夹里找到Config\setting.py，notepad++或者你的ide打开进行修改。

在这里插入图片描述

接下来在cmd里面进入Run目录运行。

python main.py

在这里插入图片描述

成功开启代理池！
我们也可以打开RedisDesktopManager来看看。

在这里插入图片描述

接下来我们来测试一下这个代理池能不能用。

爬取天天基金网“热门主题”的基金名称

首先找到这个代理池的使用demo https://github.com/jhao104/proxy_pool 一直拉下去。
打开 http://fund.eastmoney.com/ ，按F12打开开发者工具
在Console栏输入 document.charset ，回车一下即可看到网页的编码。
分析网页元素位置

代码如下：

import requests
from bs4 import BeautifulSoup

def get_proxy():
    return requests.get("http://127.0.0.1:5010/get/").text

def delete_proxy(proxy):
    requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))

def getHtml():
    retry_count = 5 # 容错次数
    proxy = get_proxy() # 获取代理ip
    while retry_count > 0:
        try:
            html = requests.get('http://fund.eastmoney.com/', proxies={"http": "http://{}".format(proxy)}) # 添加代理参数proxies
            html.encoding = 'utf-8'
            # 使用代理访问
            return html.text  # 返回html文本
        except Exception:
            retry_count -= 1
    # 出错5次, 删除代理池中代理
    delete_proxy(proxy)
    return None  # 如果请求失败则返回None

# print(getHtml())

def get_info():
    # html = getHtml()
    # print(html)
    soup = BeautifulSoup(getHtml(), 'lxml')
    item = soup.find_all('div', attrs={'class':'content dataShow-itemB'})
    for i in item:
        a = i.find_all('a')
        for j in range(6):
            print(a[j-1].string)

get_info()
————————————————

————————————————
版权声明：本文为CSDN博主「巴赤赤」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/weixin_41996197/article/details/89427121