python爬取小说

最新推荐文章于 2024-08-07 09:00:00 发布

qq_43350118

最新推荐文章于 2024-08-07 09:00:00 发布

阅读量385

点赞数 1

文章标签： python 爬虫 html

原文链接：https://mp.csdn.net/mp_blog/creation/editor/126488245

版权

本文介绍了使用Python爬虫下载小说的步骤，包括获取小说章节链接、解析章节内容并保存到文件。通过设置请求头、使用BeautifulSoup解析HTML，实现对小说网站的自动化抓取。最终，通过循环遍历所有章节链接，将整本小说下载完毕。

摘要由CSDN通过智能技术生成

最近学了些爬虫的知识，于是想用来练练手，思来想去，下个小说吧，也是可以下好了可以看看解解闷。

那么第一步，百度搜索，这样的网站应该很多，直接搜索小说的名字就好了。随便点开一个，url有了。

和简单的爬虫不同的是，我们第一步要获取各个章节的url链接。

右键点击查看源代码，就可以查看到相同的内容了，这个html形式的内容我们也不需要了解太多，只要搜索发现各个章节的链接在哪就可以了。

这里我发现各个章节的链接是在class = ‘chapter’目录下的，于是，尝试写写代码来获取这个章节链接。

import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
headers = {
    'User-Agent':UserAgent().chrome
}
url = 'https://www.pilokibook.com/0/19009/'
#链接作了修改，可以自行查找
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
chapters_url = soup.find_all('ul', class_='chapters')
soup_chaprers = BeautifulSoup(str(chapters_url), 'lxml')
urls = soup_chaprers.find_all('a')
for url in urls:
    print(url.get('href'))

requests库是用来发送请求的，fake_useragent库用来模拟请求头，Beautiful_soup用来提取链接。于是在最后，我们就获得了各个章节的链接。对大部分网站而言，链接是相对该网页而言的，使用时还需要补全。有了这个链接，我们就可以获取章节的内容了。随便点开一个章节，查看其源代码。

很显然，类似的工作。对于各个章节的链接，正文都在class=novelcontent中。既然这样，继续就是了。

import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
headers = {
    'User-Agent':UserAgent().chrome
}
url = 'https://www.pifflibook.com/0/199/2044378.html'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
content = soup.find_all('div', id='novelcontent')
soup_content = BeautifulSoup(str(content), 'lxml')
text = soup_content.find_all('p')
print(text)

得到的text就是需要的正文了，但是发现格式有点奇怪，没关系，稍微修改一下，就可以了

text1 = str(text)
text2 = text1.replace('[<p>[</p>, <p>一', '\n  ')
text3 = text2.replace('</p>, <p>', '\n  ')
text4 = text3.replace('</p>]', '')
with open('rerer.txt','a',encoding='utf-8') as f:
    f.write(text4)

稍稍整理一下思路，通过小说主页面得到章节链接，在通过章节链接获取相应的正文，再写入到文
件中就可以了。使用一个循环吧，主要的内容都有了，也不开线程了，不写函数和类方法了，就直接爬取吧。

import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
import time
a = time.time()
headers = {
    'User-Agent':UserAgent().chrome
}
url = 'https://www.pilibookkk.com/0/19999/'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
chapters_url = soup.find_all('ul', class_='chapters')
soup_chaprers = BeautifulSoup(str(chapters_url), 'lxml')
urls = soup_chaprers.find_all('a')
# 获取网页源代码
chapters_urls = []

for url in urls:
    chapters_urls.append(url.get('href'))
# 将章节链接获取到chapters_urls
with open('人生海海.txt', 'a', encoding='utf-8') as f:
    for chapternum in range(len(chapters_urls)):
        response = requests.get(chapters_urls[chapternum], headers=headers)
        soup = BeautifulSoup(response.text, 'lxml')
        content = soup.find_all('div',class_='novelcontent')
        soup_content = BeautifulSoup(str(content), 'lxml')
        text = soup_content.find_all('p')
        # 获取书籍正文
        text1 = str(text)
        text2 = text1.replace('[<p>[</p>, <p>一', '\n  ')
        text3 = text2.replace('</p>, <p>', '\n  ')
        text4 = text3.replace('</p>]', '')
        text5 = text4.replace('[<p>[', '')
        f.write('第{}章'.format(chapternum + 1))
        f.write('\n')
        f.write(text5)
        f.write('\n\n')
        # 写入到文件中
        print('已下载{}'.format((chapternum + 1)/len(chapters_urls)))
b = time.time()

print('下载完成')
print('用时{}'.format(b-a))

进行了一点点的格式修改，运行结果也挺满意：