Python爬虫爬取小说转换成epub格式

最新推荐文章于 2024-04-16 10:05:38 发布

七夕猛虎

最新推荐文章于 2024-04-16 10:05:38 发布

阅读量3.6k

点赞数 4

分类专栏： python工具文章标签： python 爬虫

本文链接：https://blog.csdn.net/qiximenghu/article/details/111058915

版权

python工具专栏收录该内容

7 篇文章 3 订阅

订阅专栏

前言

Python版本 python3.7

Python库：requests, bs4

epub转换程序准备：pandoc: https://pandoc.org/installing.html ,下载安装windows版本。

浏览器：Chrome

本篇文章只介绍简单的小说网站内容爬取，以及文本转epub格式工具的使用。主要是为了学习爬虫，请支持正版小说。

一、基本操作

1.查看网页源码

这一步比较简单，一般的浏览器都支持。打开网页后，按F12即可查看网页源码。或者右键单击网页内容，然后选择检查即可打开网页源码。

2.复制网页内容中特定链接Selector值

小说网站里面一般都有一定的结构，比如正文内容，上一章，下一章，目录等链接。这些链接所在的位置可以用通过Selector值表示，我们可以通过bs4中的BeautifulSoup中的select函数快速从网页源码找到对应的内容。

3.txt文档转换成epub格式文档

从Pandoc的官网教程中：https://pandoc.org/epub.html 我们可以看到转换epub有以下需要注意的地方：

a.书名和作者名前面加一个"%"号；

b.每段文本结尾需要两次换行;

c.章节名前加"#"号;

以下是转换后的epub文档，截取了其中的封面和第二章：

Pandoc转换epub文档还有更多的高级选项，可以参考这个链接：https://pandoc.org/demos.html

二、使用步骤

1.爬取内容测试

首先我们用VS Code创建一个Python文件，导入以下需要模块。（如何使用VS Code调试Python代码此处不做赘述，网上教程比较多）

import requests
import codecs
import os
from bs4 import BeautifulSoup

以 https://qxs.la 这个网站为例，我们选一本小说，进入小说的第一章阅读页面。复制小说标题的Selector，内容的Selector，下一章链接的Selector。然后通过以下代码获取这三个Selector的Tag。首先是通过requests.get函数获取到整个网页的源代码，然后通过BeautifukSoup解析网页源码。

link = 'https://qxs.la/262333/53712930/'

# get total html page
page = requests.get(link)
soup = BeautifulSoup(page.text, 'html.parser')
title = soup.select('body > div.text.t_c > h1')[0]
content = soup.select('#content')[0]
next_page_link = soup.select('#nextLink')[0]

其中select的参数就是你复制到的selector值，我们可以通过VS Code调试控制台和运行侧边栏看到我们获取到的变量Tag的内容。

接下来就是将title和content这两个Tag中的文本信息提取，将next_page_link Tag中的链接信息提取出来。获取文本信息可以调用Tag的getText接口：

title只有一行，不加参数直接调用，content里面的段落比较多，需要指定separator参数，否则获取到的文本内容会难以区分段落。

title = title.getText()
content = content.getText('\n')

接下来获取下一页中的链接内容。我们可以通过调试观察到next_page_link中href key对应的value为我们需要的内容，可以使用Tag的get接口，然后再将其和网站主站点url组成我们需要的网址。

next_page_link = next_page_link.get('href')
next_page_link = 'https://qxs.la' + next_page_link

到此处，一章内容就爬取完毕了，我们只需要再通过requests.get访问next_page_link，一直循环下去即可。那么什么时候终止循环呢？我们可以通过浏览器访问小说最后一章，然后观察最后一章的下一章链接内容是什么，然后把那个链接作为循环终止的判断条件即可。

2.小说内容处理

可以查看爬取到的content内容，第一行的广告有点影响阅读体验，每行开始有两个\u3000全角的空白字符。因为最终是要抓换小说内容到epub格式，那么就需要把爬取到的内容先转成那种每段后面两次换行，每条章节名前面加一个"#"的格式。

可以发现广告的行数为固定的行数，那么直接先将content分行，丢弃前面的广告行，然后每行进行一个strip操作，去除前后的空白字符就获取到的纯净的小说内容了。

# remove ad line
content = content.splitlines()[10:]
new_content = []
for line in content:
    new_content.append(line.strip())

然后将我们爬取到的内容按照Pandoc转epub格式写入到文件中去，保存为utf8格式：

# write to file
f = codecs.open(book_name + '.txt', 'w', encoding='utf-8')
# write book name and author
f.write('% ' + book_name + '\n')
f.write('% ' + book_author + '\n')

# write chapter
f.write('# ' + title + '\n')

# write content
for line in new_content:
    f.write(line + '\n\n')

f.close()

3.转换txt文件到epub文件

这一步也比较简单，就是一个Python调用外部可执行程序的操作，通过os.system接口调用即可。

# convert txt to epub
cmd = 'pandoc.exe %s -o %s'%(book_name + '.txt', book_name + '.epub')
os.system(cmd)

最终代码

import requests
import codecs
import re
import os
from bs4 import BeautifulSoup

book_name = '外挂傍身的杂草'
book_author = '低调青年'
site = 'https://qxs.la'

first_page_href = '/262333/53712930/'
end_href = site + '//qxs.la/end.htm?aid=262333&cid=65153984'
# only save 1 chapter for debug
# end_href = site + '/262333/53712931/'

link = site + first_page_href

# write to file
f = codecs.open(book_name + '.txt', 'w', encoding='utf-8')
# write book name and author
f.write('% ' + book_name + '\n')
f.write('% ' + book_author + '\n')

while link != end_href:
    # get total html page
    page = requests.get(link)
    soup = BeautifulSoup(page.text, 'html.parser')
    title = soup.select('body > div.text.t_c > h1')[0]
    content = soup.select('#content')[0]
    next_page_link = soup.select('#nextLink')[0]

    title = title.getText()
    content = content.getText('\n')
    next_page_link = next_page_link.get('href')
    link = site + next_page_link

    # remove ad line
    content = content.splitlines()[10:]
    new_content = []
    for line in content:
        new_content.append(line.strip())

    # write chapter
    f.write('# ' + title + '\n')

    # write content
    for line in new_content:
        f.write('    ' + line + '\n\n')
    
    print(title + ' have been saved!')

f.close()

# convert txt to epub
cmd = 'pandoc.exe %s -o %s'%(book_name + '.txt', book_name + '.epub')
os.system(cmd)

print("convert over!")

手机上阅读效果：

写在后面

有的网站有反爬虫系统，需要再request.get接口加上浏览器标识才可以获取到内容，如下所示：

import requests
import codecs
import re
from bs4 import BeautifulSoup

site = 'https://bing.ioliu.cn'
url = site + '/ranking'
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36",
}

html = requests.get(url, headers=headers)
soup = BeautifulSoup(html.text, 'html.parser')

有的网站会根据限制同一IP的请求频率，那就需要再每次request.get后随机延迟一段时间再请求才可以。

最后，其实这个网站的小说可以直接下载txt的，不需要爬虫的。