Python爬虫实战 | (2) 爬取网络小说

最新推荐文章于 2023-05-04 11:17:47 发布

CoreJT

最新推荐文章于 2023-05-04 11:17:47 发布

阅读量925

点赞数 1

分类专栏： Python3网络爬虫从理论到实践Base 文章标签： Python爬虫实战 request 正则表达式爬取小说

本文链接：https://blog.csdn.net/sdu_hao/article/details/96016737

版权

Python3网络爬虫从理论到实践Base 专栏收录该内容

30 篇文章 48 订阅

订阅专栏

在本篇博客中，我们将使用requests+正则表达式爬取笔趣阁的小说，获取小说的名字、文本等内容。

http://www.xbiquge.la/xiaoshuodaquan/

首先打开上面的网址，我们会发现是小说列表，选择其中一部小说，打开会是章节列表，打开某一章后才是文本。所以，我们要首先获取小说列表，然后打开某一部小说后，再获取章节列表，最后在爬取对应的内容。依旧是四部曲：

首先搭建起程序主体框架：

import os
import re
import time
import requests
from requests import RequestException


def get_page(url):
    pass

def get_list(page):
    pass

def get_chapter(novel_url):
    pass

def get_content(chapter,name):
    pass

def write_tofile(chapter_content,chapter_name,name):
    pass

if __name__=='__main__':
    #首页url
    url = 'http://www.xbiquge.la/xiaoshuodaquan/'
    #发送请求，获取响应
    page = get_page(url)
    #获取小说列表 解析响应
    novel_list = get_list(page)
    print(novel_list)
    #查找某部小说 进行爬取
    name = '全职法师'

    for item in novel_list:
        if item[1] == name:
            # 如果在列表中有这部小说，就返回该小说的章节列表
            novel_chapter = get_chapter(item[0])
            print(novel_chapter)
            #按小说章节 分别保存到文本文件
            for chapter in novel_chapter:
                get_content(chapter,name)

发送请求，获取响应：

def get_page(url):
    try:
        headers = {
            'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
        }
        response = requests.get(url,headers=headers)
        if response.status_code==200:
            response.encoding = response.apparent_encoding
            return response.text
        return None
    except RequestException:
        return None

解析首页的响应，获取小说列表：

发现首页所有的小说都在li标签中，每个小说都包在一个a标签中，并有链接。

def get_list(page):
    #我们可以只通过a标签来解析
    pattern = re.compile('<a href="(.*?)">(.*?)</a>',re.S)
    list = pattern.findall(page)
    return list[10:] #只通过a标签来解析 前10个并不是小说，所以从第11个开始

获取小说的章节列表：

发现小说的所有章节都在dd标签中，每个章节都包在一个a标签中，并有链接。

def get_chapter(novel_url):
    html = get_page(novel_url)
    pattern = re.compile("<dd><a href='(.*?)' >(.*?)</a></dd>",re.S)
    chapters = pattern.findall(html)
    return chapters[:5] #取前5章 也可以取全部

获取章节内容：

章节内容在上图的div标签中。

def get_content(chapter,name):
    chapter_url = 'http://www.xbiquge.la'+chapter[0]
    html = get_page(chapter_url)
    pattern = re.compile('<div id="content">(.*?)<p>',re.S)
    chapter_content = pattern.findall(html)
    write_tofile(chapter_content,chapter[1],name)

把爬取的小说内容存储起来，小说名作为目录，其下的各个章节分别保存为文本文件，文件名为章节名：

def write_tofile(chapter_content,chapter_name,name):
    if not os.path.exists(name):
        os.mkdir(name)
    for content in chapter_content:
        #去掉空格和换行符
        content = content.replace("&nbsp;&nbsp;&nbsp;&nbsp;","").replace("<br />","")
        with open(name+'/{}.txt'.format(chapter_name),'w',encoding='utf-8') as f:
            f.write(content)

完整代码：

import os
import re
import time
import requests
from requests import RequestException


def get_page(url):
    try:
        headers = {
            'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
        }
        response = requests.get(url,headers=headers)
        if response.status_code==200:
            response.encoding = response.apparent_encoding
            return response.text
        return None
    except RequestException:
        return None


def get_list(page):
    #我们可以只通过a标签来解析
    pattern = re.compile('<a href="(.*?)">(.*?)</a>',re.S)
    list = pattern.findall(page)
    return list[10:] #只通过a标签来解析 前10个并不是小说，所以从第11个开始

def get_chapter(novel_url):
    html = get_page(novel_url)
    pattern = re.compile("<dd><a href='(.*?)' >(.*?)</a></dd>",re.S)
    chapters = pattern.findall(html)
    return chapters[:5] #取前5章 也可以取全部

def get_content(chapter,name):
    chapter_url = 'http://www.xbiquge.la'+chapter[0]
    html = get_page(chapter_url)
    pattern = re.compile('<div id="content">(.*?)<p>',re.S)
    chapter_content = pattern.findall(html)
    write_tofile(chapter_content,chapter[1],name)

def write_tofile(chapter_content,chapter_name,name):
    if not os.path.exists(name):
        os.mkdir(name)
    for content in chapter_content:
        #去掉空格和换行符
        content = content.replace("&nbsp;&nbsp;&nbsp;&nbsp;","").replace("<br />","")
        with open(name+'/{}.txt'.format(chapter_name),'w',encoding='utf-8') as f:
            f.write(content)


if __name__=='__main__':
    #首页url
    url = 'http://www.xbiquge.la/xiaoshuodaquan/'
    #发送请求，获取响应
    page = get_page(url)
    #获取小说列表 解析响应
    novel_list = get_list(page)
    print(novel_list)
    #查找某部小说 进行爬取 当然也可以爬取所有的小说
    name = '全职法师'

    for item in novel_list:
        if item[1] == name:
            # 如果在列表中有这部小说，就返回该小说的章节列表
            novel_chapter = get_chapter(item[0])
            print(novel_chapter)
            #按小说章节 分别保存到文本文件
            for chapter in novel_chapter:
                get_content(chapter,name)