如何利用爬虫与数据分析指导选择首篇小说类型：第三章通过免费代理网站建立ip池-CSDN博客

本文链接：https://blog.csdn.net/yueguangfan/article/details/137250048

如何利用爬虫与数据分析指导选择首篇小说类型：第三章通过免费代理网站建立ip池

第三章通过免费代理网站建立ip池

文章目录

如何利用爬虫与数据分析指导选择首篇小说类型：第三章通过免费代理网站建立ip池
前言
一、获取免费ip
二、筛选可用ip
三、建立ip池，筛选可用ip
总结

前言

之前一篇文章已讲述如何使用代理ip获取小说数据：第二章使用代理ip获取小说数据。相信大家对于如何使用代理ip去获取小说网站数据有大致了解，但是仍带来两个问题：
1、如何快速获取免费代理网站建立ip池；
2、如何从ip池中快速找出可使用的ip。
下面作者将详细列出两个问题的解决办法。

一、获取免费ip

作者将演示如何从下面5个网站获取代理ip，建立ip池：
66网：http://www.66ip.cn/
89网：https://www.89ip.cn/
快代理：https://www.kuaidaili.com/free/dps/
开心代理：http://www.kxdaili.com/dailiip.html
云代理：http://www.ip3366.net/free/

1.封装requests请求网址方法

代码如下：

############请求页面信息
def get_url(url='',headers='',proxies=False):
    try:
        if proxies:
            response = requests.get(url, headers=headers, proxies={'http': proxies, 'https': proxies} , timeout=(5,10))
        else:
            response = requests.get(url,headers=headers,timeout=10)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text,'html.parser')
            return soup
        else:
            return False
    except Exception as e:
        print(e)
        return False

2.获取代理ip—开心代理

代码如下：

##########################从开心代理获取免费ip池
def getIpByKxdl(headers):
    try:
        ips = []
        for i in range(1, 3):
            url = r'http://www.kxdaili.com/dailiip/1/' + str(i) + r'.html'
            soup = get_url(url, headers)
            if soup:
                pattern = r"(\d{1,3}\.\d{1,3}.\d{1,3}.\d{1,3})\s*\r?\n\s*(\d{1,5})"
                matches = re.findall(pattern, soup.text)
                for ip, port in matches:
                    ips.append(str(ip) + ':' + str(port))
                print("开心代理:",ips)
        return ips
    except Exception as e:
        print(e)
        return []

3.获取代理ip—66代理

代码如下：

##########################从66代理获取免费ip池
def getIpBy66dl(headers):
    try:
        ips = []
        for i in range(1, 3):
            url = r'http://www.66ip.cn/nmtq.php?getnum=&isp=0&anonymoustype=0&start=&ports=&export=&ipaddress=&area=1&proxytype=2&api=66ip'
            soup = get_url(url, headers)
            if soup:
                pattern = r"(\d{1,3}\.\d{1,3}.\d{1,3}.\d{1,3})\s*\r?:\s*(\d{1,5})"
                matches = re.findall(pattern, str(soup))
                for ip, port in matches:
                    ips.append(str(ip) + ':' + str(port))
                print("66代理:", ips)
        return ips
    except Exception as e:
        print(e)
        return []

4.获取代理ip—89代理

代码如下：

##########################从89代理获取免费ip池
def getIpBy89dl(headers):
    try:
        ips = []
        for i in range(1, 3):
            url = r'https://www.89ip.cn/index_' + str(i) + '.html'
            soup = get_url(url, headers)
            if soup:
                table_ipc = soup.find('table', 'layui-table')
                pattern = r"(\d{1,3}\.\d{1,3}.\d{1,3}.\d{1,3})\s*\r?\n\s*(\d{1,5})"
                matches = re.findall(pattern, table_ipc.text)
                for ip, port in matches:
                    ips.append(str(ip) + ':' + str(port))
                print("89代理:", ips)
        return ips
    except Exception as e:
        print(e)
        return []

5.获取代理ip—快代理

代码如下：

##########################从快代理获取免费ip池
def getIpByKdl(headers):
    try:
        ips = []
        for i in range(1, 2):
            url = r'https://www.kuaidaili.com/free/dps/' + str(i)
            soup = get_url(url, headers)
            if soup:
                pattern = r"\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b"
                ip = re.findall(pattern, str(soup))
                pattern = r'"port": "(\d+)"'
                port = re.findall(pattern, str(soup))
                for i in range(len(ip)):
                    ips.append(str(ip[i]) + ':' + str(port[i]))
                print("快代理:", ips)
        return ips
    except Exception as e:
        print(e)
        return []

6.获取代理ip—云代理

代码如下：

##########################从云代理获取免费ip池
def getIpByYdl(headers):
    try:
        ips = []
        for i in range(1, 3):
            url = r'http://www.ip3366.net/free/?stype=1&page=' + str(i)
            soup = get_url(url, headers)
            if soup:
                pattern = r"(\d{1,3}\.\d{1,3}.\d{1,3}.\d{1,3})\s*\r?\n\s*(\d{1,5})"
                matches = re.findall(pattern, soup.text)
                for ip, port in matches:
                    ips.append(str(ip) + ':' + str(port))
                print("云代理:", ips)
        return ips
    except Exception as e:
        print(e)
        return []

二、筛选可用ip

代码如下：

###########################测试ip是否有效
def textProxies(headers,proxies):
    text_url = r'http://icanhazip.com/'
    soup = get_url(text_url, headers, proxies)
    if soup:
        qd_url = r'https://www.qidian.com/'
        qd_soup = get_url(qd_url, headers, proxies)
        if qd_soup:
            return proxies
    else:
        return False

三、建立ip池，筛选可用ip

代码如下(cookie记得更换为自己的网页cookie)：

import re
import pandas as pd
import requests
from bs4 import BeautifulSoup

##############cookie记得更换为自己的网页cookie
headers = {
    'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.289 Mobile Safari/537.36',
    'Cookie': 'newstatisticUUID=1687606367_135137479; fu=1132719504; supportwebp=true; supportWebp=true; _ga_D20NXNVDG2=GS1.1.1698680222.2.0.1698680231.0.0.0; _ga_VMQL7235X0=GS1.1.1698680222.2.0.1698680231.0.0.0; _csrfToken=8c77025c-97fa-44ff-',
}
ips = getIpByKxdl(headers)
ips += getIpBy66dl(headers)
ips += getIpBy89dl(headers)
ips += getIpByKdl(headers)
ips += getIpByYdl(headers)
for proxies in ips:
    proxies = textProxies(headers,proxies)

总结

通过自动构建免费代理IP池并快速筛选可用IP，结合之前的两篇文章：第一章小说数据获取与解析和第二章使用代理ip获取小说数据，我们已经能够基本实现爬取小说网站内强推小说的全部公开数据。然而，在实际操作过程中，我们不可避免地会面临一些代码优化的问题。

首先，当请求失败时，我们需要有一套完善的处理机制，确保爬虫能够稳定、持续地运行。

其次，当前的IP池中没有可用的IP时，我们需要让爬虫能够自动地继续爬取新的IP，并实时更新IP池，直到找到可用的IP为止。

再者，即使在找到可用IP后，我们还需要考虑在执行爬取任务期间IP突然失效的情况。此时，爬虫应该能够自动切换到下一个可用的IP，以保证爬取任务的连续性。

对于上述问题，作者将在后续的文章中详细探讨并给出解决方案。具体来说，我们将讨论：

1、如何有效地爬取强推、三江等小说数据，确保数据的准确性和完整性。
2、如何利用爬取到的小说数据进行深入的数据分析，挖掘其中的价值。
3、针对请求失败的情况，我们将提供一套实用的错误处理策略。
4、当IP池中没有可用IP时，我们将介绍如何实现爬虫的自动更新IP池功能。
5、在执行业务代码期间，当IP失效时，我们将分享如何优雅地切换到下一个可用IP的方法。
6、最后，我们还将探讨如何利用PyQt5开发一款软件，它不仅能够建立和管理IP池，查找可用IP，还能实现小说数据的爬取和数据的可视化展示，帮助他们更高效地爬取和分析小说数据。