python 小说小说_python潇湘书院网站小说爬虫

最新推荐文章于 2024-02-28 11:34:14 发布

weixin_39552204

最新推荐文章于 2024-02-28 11:34:14 发布

阅读量500

点赞数 1

文章标签： python 小说小说

本文链接：https://blog.csdn.net/weixin_39552204/article/details/114913906

版权

很久没有写爬虫了，最近接到一个抓取小说的项目顺便做此纪录练练手，之后工作中可能也会有部分场景要用到爬虫，爬取竞争对手进行数据分析什么的。

目标网站：潇湘书院

环境准备：

python3

requests库

BeautifulSoup库

整体思路

抓取这个小说网站免费板块的所有内容，查看页面发现这个板块一共有6697页，每页有20本小说，那整体思路就是先抓取每页的20个小说名称和url，然后进入每本小说的阅读地址，拿到每一个章节的标题和url，抓取每章节的正文内容并写到本地txt文本中。

123

单页分析

这里请求使用requests，解析页面用非常方便的BeautifulSoup，在一个文章标题上右键检查，在高亮的这条a标签右键，copy selector，通过这条selector来定位BeautifulSoup解析后小说所在的位置。

body > div.content > div > div > div.inner-mainbar > div.search-result > div.result-list > ul > li:nth-child(1) > div > h4 > a

其中li:nth-child(1)很明显是指这个小说在当前页面小说列表中排列第一个，我们想要本页面所有的20本小说，所以就删掉这个:nth-child(1)，再将selector语句精简一下只要能定位到即可，如下：

div.result-list > ul > li > div > h4 > a

然后将它写入代码，print查看一下结果。

import requests,os

from bs4 import BeautifulSoup

webdata = requests.get('http://www.xxsy.net/search?vip=0&sort=2')

soup = BeautifulSoup(webdata.text,'lxml')

books = soup.select('div.result-list > ul > li > div > h4 > a')

print(books)

结果是一个列表，列表中有20个元素，分别对应的20个小说，我们想要的是每个元素中href后面的链接和小说的名字，用循环提取出来。

...

for book in books:

bookname = book.text

bookurl = 'http://www.xxsy.net' + book.get('href')

bookid = book.get('href').split('/')[-1].split('.')[0]

print(bookname,bookurl,bookid)

这样就很简单的拿到了单个页面20本小说的标题和url地址，可以看到上面我把地址中的一串数字用单独取出来了，这个数字其实就是小说对应的唯一ID，后面要用到。

章节分析

点击一个小说进入详情页面，切换至”作品目录”下可以看到所有的章节，这时候在一个章节上右键检查，跟上面方法一样，同样可以很简单的获取到该小说所有章节的名称和url地址以及章节ID。

为了在一篇文章介绍尽可能多的方法，这里我不用上面这个地方来获取章节信息，从另一个地方进入。点击“开始阅读”进入第一章节的正文，在页面的左边可以看到一个“目录”按钮，点看以后就能看到所有的章节名称了。然而在这些章节名称上右键发现当前页面禁用了鼠标右键功能，无法检查元素当然也就不能copy它的selector。

这种按钮的点击肯定是向服务器发送了请求的，打开fiddler进行抓包，再次点击目录按钮，此时可以看到这个请求已经被成功捕获，点击这条请求查看详细信息，这个接口的功能就是查询所有的章节信息。

真实的请求地址是http://www.xxsy.net/partview/GetChapterListNoSub?bookid=945608&isvip=0，其中bookid就是我们前面找到的bookid，这里可以做成参数化依次传入其他小说的ID。

...

url = 'http://www.xxsy.net/partview/GetChapterListNoSub?bookid=945608&isvip=0'

titles = BeautifulSoup(requests.get(url).text,'lxml').select('ul > li > a')

for title in titles:

titlename = title.text #章节名称

titleurl = 'http://www.xxsy.net' + title.get('href') #章节地址

print(titlename,titleurl)

正文下载

上一步取到的titleurl是每一章节的阅读地址，直接requests请求并解析拿到正文内容

contents = BeautifulSoup(requests.get(titleurl).text, 'lxml').select('div#auto-chapter > p') #正文内容

返回结果是一个列表，每一个元素是一个段落，将每一段内容前后无用的标签剔除并写到本地的txt文本，写入方式为a+,每次写入时在后面追加，不会覆盖之前的内容，文本自动按照前面获取到的小说名来命名。

...

for content in contents:

content = str(content).replace('

', '\n').replace('

', '')

f = open(path + '%s.txt' % bookname, 'a+')

f.write(content)

f.close()

到此为止三个步骤已经完成，可以顺利的爬下一个章节的内容了，现在通过几个for循环将这几个步骤整合，就可以源源不断的开始下载小说了。

#coding=utf-8

import requests,time,os

from bs4 import BeautifulSoup

path = r'C:\Users\lipei\Desktop\潇湘书院爬虫\小说\\' #本地存放小说的路径

def get_books(url):

webdata = requests.get(url,timeout=60)

soup = BeautifulSoup(webdata.text,'lxml')

books = soup.select('div.result-list > ul > li > div > h4 > a')

for book in books:

bookid = book.get('href').split('/')[-1].split('.')[0] #小说ID作为下一个请求ur中的参数

bookname = book.text #小说名称

bookurl = 'http://www.xxsy.net' + book.get('href') #小说url地址

if bookname+'.txt' in oldlists: #判断是否已经下载过，若存在则跳过

continue

else:

pass

print('=====================正在下载【' + bookname + '】=====================')

url = 'http://www.xxsy.net/partview/GetChapterListNoSub?bookid=%s&isvip=0' % bookid

titles = BeautifulSoup(requests.get(url,timeout=60).text,'lxml').select('ul > li > a')

for title in titles:

titleurl = 'http://www.xxsy.net' + title.get('href') #章节url地址

titlename = title.text #章节名称

# print(titlename,titleurl)

try:

f = open(path + '%s.txt' % bookname, 'a+') #章节名称写到txt文本

f.write('\n'*2 + titlename + '\n')

f.close()

except:

pass

contents = BeautifulSoup(requests.get(titleurl,timeout=60).text, 'lxml').select('div#auto-chapter > p')

for content in contents:

content = str(content).replace('

', '\n').replace('

', '')

try:

f = open(path + '%s.txt' % bookname, 'a+') #正文内容卸载txt文本，紧接在章节名称的下面

f.write(content)

f.close()

except:

pass

print(titlename + '[已下载]')

if __name__ == "__main__":

urls = ['http://www.xxsy.net/search?vip=0&sort=2&pn={}'.format(i) for i in range(6697)] #免费板块每一页的url通过最后的pn参数控制

global oldlists

oldlists = os.listdir(path) #爬虫开始之前检查当前目录已有的文件

for url in urls:

get_books(url)

#控制每天爬取的数量，达到要求后停止任务，通过任务开始前后目录中的文件数量相减来判断

newlists = os.listdir(path)

num = len(newlists) - len(oldlists)

if num >= 20:

print('今日任务下载完毕，今日下载小说%d本' % num)

break

else:

pass

写在最后

在测试过程中发现短时间持续请求该网站的话会可能被服务器拒绝，但是并没有封禁IP，这可能是这个网站唯一的反爬措施了，然而并没有卵用，将请求放在一个无限循环里面，若被拒绝就自动重连，连上以后跳出循环。

后面如果遇到封禁IP的网站再讲如何通过更换IP来规避。

...

while True:

try:

webdata = requests.get(url,timeout=60)

break #连接成功就跳出循环

except:

time.sleep(3)

如有失效，请留言告知丨转载请注明原文链接：python潇湘书院网站小说爬虫

weixin_39552204

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

python 小说 小说_python潇湘书院网站小说爬虫

python 小说小说_python潇湘书院网站小说爬虫