Python3网络爬虫快速入门实战解析——对大佬部分代码失效的修改

最新推荐文章于 2024-09-21 08:01:20 发布

aec153

最新推荐文章于 2024-09-21 08:01:20 发布

阅读量535

点赞数 3

CC 4.0 BY-SA版权

分类专栏： python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/aec153/article/details/114477775

python 专栏收录该内容

7 篇文章

订阅专栏

本文介绍了如何使用Python爬虫从网站https://www.qu.la/book/16431/抓取《一念永恒》小说的全部章节链接，并解决编码问题。通过分析网页结构，利用BeautifulSoup解析HTML，实现对67页链接的遍历，最终获取所有章节的名称和链接，将内容写入文件进行下载。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

大佬原文链接，一定要看看，收获很大~~~~
https://blog.csdn.net/c406495762/article/details/78123502

看完大佬的第一个爬取《一念永恒》小说的代码后，发觉：

1、小说网站失效了

2、爬取代码不显示中文

3、大多数网站不能够一次性爬取全部连接

一、这里我写下百度了个比较容易爬的网站

网站：https://www.qu.la/book/
一念永恒的链接：https://www.qu.la/book/16431/

二、对get请求后的数据(req)加上 req.encoding = ‘utf-8’

在这里插入图片描述

三、从这个网站上看，发觉每个网页只显示20个链接，观察每一页的网址可以发现，每一页的链接

https://www.qu.la/book/16431/index_X.html（X表示页数，从1到67页（至于我为什么知道是67页，是因为我按到了最后一页））
发现只有二十个链接
上图发现只有20个链接

在这里插入图片描述

这三张图可以发现出链接的规律

3.1 一是定位链接：

在这里插入图片描述
这里的话，我是用ul 定位 class_ 的属性section-list，这时可以观察到是数组的第二个值

div = div_bf.find_all('ul',class_='section-list') #修改
a_bf = BeautifulSoup(str(div[1])) #数组的第二个值，数组是从0开始

3.2 二是爬取链接：

解决如何爬取67页链接的问题：

3.2.1 使用for循环，每次循环的 i 值，将链接给拼接到一起

    for i in range(67):#【i从0到66】
        server = 'https://www.qu.la/book/16431/index_'+ str(i+1)+'.html'

3.2.2 解决总链接数的问题：

def get_download_url(self):
    a_all = 0 #链接总数
    for i in range(67):
        url = self.target + 'index_' + str(i+1)+'.html'
        req = requests.get(url)
        req.encoding = 'utf-8'
        html = req.text
        div_bf = BeautifulSoup(html)
        div = div_bf.find_all('ul',class_='section-list') #修改
        a_bf = BeautifulSoup(str(div[1]))
        a = a_bf.find_all('a')
        a_all += len(a) #加上每一页的a的链接
        for each in a:
            self.names.append(each.string)
            self.urls.append(self.server + each.get('href')) 
    self.nums = a_all #得到链接总数

最后效果：
在这里插入图片描述
全部代码：

from bs4 import BeautifulSoup
import requests,sys
"""
    类说明：下载《笔趣阁》网小说《一念永恒》
    Parameters:
        无
    Retures:
        无
    Modify:
        2021年3月6日
"""
class downloader(object):
    def __init__(self):
        self.server = 'https://www.qu.la/' #修改
        self.target = 'https://www.qu.la/book/16431/' #修改
        self.names = [] #章节名
        self.urls = []  #存放章节链接
        self.nums = 0  #章节数
    """
        函数说明：获取下载链接
        Parameters:
            无
        Returns:
            无
        Modify:
            2021年3月6日
    """
    def get_download_url(self):
        a_all = 0 #链接总数
        for i in range(67):
            url = self.target + 'index_' + str(i+1)+'.html'
            req = requests.get(url)
            req.encoding = 'utf-8'
            html = req.text
            div_bf = BeautifulSoup(html)
            div = div_bf.find_all('ul',class_='section-list') #修改
            a_bf = BeautifulSoup(str(div[1])) #得到数组的第二个值
            a = a_bf.find_all('a')
            a_all += len(a) #加上每一页的a的链接
            for each in a:
                self.names.append(each.string)
                self.urls.append(self.server + each.get('href')) 
        self.nums = a_all  #得到链接总数
    """
        函数说明：获取章节内容
        Paramters:
            target - 下载内容(string)
        Returns:
            texts - 章节内容(string)
        Modify:
            2021年3月6日
    """
    def get_contents(self,target):
        req = requests.get(url = target)
        req.encoding = 'utf-8' #增加，对代码进行utf-8格式的转化 
        html = req.text
        bf = BeautifulSoup(html)
        texts = bf.find_all('div',class_ = 'content') #修改
        texts = texts[0].text.replace('\xa0','\n') #部分修改
        return texts
    """
        函数说明：将爬取的文章写入文件
        Parameter:
            name - 章节名称(string)
            path - 当前路径下，小说保存名称(string)
            text - 章节内容(text)
        Returns:
            无
        Modify:
            2021年3月7日
    """
    def writer(self,name,path,text):
        write_flag = True
        with open(path,'a',encoding='utf-8') as f:
            f.write(name+'\n')
            f.writelines(text)
            f.write('\n\n')
if __name__ == '__main__':
    dl = downloader()
    dl.get_download_url()
    print('《一念永恒》开始下载: ')
    print(dl.nums)
    for i in range(dl.nums):
        dl.writer(dl.names[i],'一念永恒.txt',dl.get_contents(dl.urls[i]))
        print("已下载第"+ str(i+1) +"章：%.3f%%" % float(i/dl.nums) + '\r')
        sys.stdout.write("已下载：%.3f%%" % float(i/dl.nums) + '\r')
        sys.stdout.flush
    print("《一年永恒》下载完成")

可以的话，点个赞谢谢~~~