python爬取某趣阁小说（2.0），十分钟爬完一千六百章

最新推荐文章于 2024-07-19 16:36:18 发布

鑫xing

最新推荐文章于 2024-07-19 16:36:18 发布

阅读量508

点赞数 1

分类专栏：爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_52612318/article/details/120479160

版权

python爬虫高效爬取某趣阁小说

这次的代码是根据我之前的笔趣阁爬取的基础上修改的，因为使用的是自己的ip，所以在请求每个章节的时候需要设置sleep（4~5）才不会被封ip，那么在计算保存的时间，每个章节会花费6-7秒，如果爬取一部较长的小说时，时间会特别的长，所以这次我使用了代理ip。这样就可以不需要设置睡眠时间，直接大量访问。

一，获取免费ip
关于免费ip，我选择的是站大爷。因为免费ip的寿命很短，所以尽量要使用实时的ip，这里我专门使用getip.py来获取免费ip，代码会爬取最新的三十个ip，并以字典的形式返回两种，如{’http‘：’ip‘}，{’https‘：’ip‘}
在这里插入图片描述

！！！！！！这里是另写了一个py文件，后续正式写爬虫的时候会调用。

import requests
from lxml import etree
from time import sleep

def getip():
    base_url = 'https://www.zdaye.com'
    url = 'https://www.zdaye.com/dayProxy.html'
    headers = {
   
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
    }

    res = requests.get(url, headers=headers)
    res.encoding = "utf-8"
    dom = etree.HTML(res.text)
    sub_urls = dom.xpath('//h3[@class ="thread_title"]/a/@href')

    sub_pages =[]
    for sub_url in sub_urls:
        for i in range(1, 11):
            sub_page = (base_url + sub_url).rstrip('.html') + '/' + str(i) + '.html'
            sub_pages.append(sub_page)
    http_list = []
    https_list = []
    for sub in sub_pages[:3]:
        sub_res = requests.get(sub, headers=headers)
        sub_res.encoding = 'utf-8'
        sub_dom = etree.HTML(sub_res.text)
        ips = sub_dom.xpath('//tbody/tr/td[1]/text()')
        ports = sub_dom.xpath('//tbody/tr/td[2]/text()')
        types = sub_dom.xpath('//tbody/tr/td[4]/text()')
        sleep(3)
        sub_res.close()

        for ip,port,type in zip(ips, ports,types):
            proxies_http = {
   }
            proxies_https= {
   }
            http = 'http://' + ip + ':' + port
            https = 'https://'

最低0.47元/天解锁文章

鑫xing

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
3
评论
python爬取某趣阁小说（2.0），十分钟爬完一千六百章

python爬虫高效爬取某趣阁小说这次的代码是根据我之前的笔趣阁爬取的基础上修改的，因为使用的是自己的ip，所以在请求每个章节的时候需要设置sleep（4~5）才不会被封ip，那么在计算保存的时间，每个章节会花费6-7秒，如果爬取一部较长的小说时，时间会特别的长，所以这次我使用了代理ip。这样就可以不需要设置睡眠时间，直接大量访问。一，获取免费ip关于免费ip，我选择的是站大爷。因为免费ip的寿命很短，所以尽量要使用实时的ip，这里我专门使用getip.py来获取免费ip，代码会爬取最新的三十个i
复制链接

扫一扫

专栏目录