python爬取小说详解

最新推荐文章于 2023-05-21 21:01:00 发布

qq_41871270

最新推荐文章于 2023-05-21 21:01:00 发布

阅读量514

点赞数

分类专栏： python 初学者

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/qq_41871270/article/details/80027473

版权

初学者同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

6 篇文章 0 订阅

订阅专栏

爬取的小说url为http://www.biquw.com/book/29142/

第一步：geturl 解析url（用BeautifulSoup）

start_url="http://www.biquw.com/book/29142/"

url=start_url+str(11987333)+'.html'

html=requests.get(url,timeout=15)

soup=BeautifulSoup(html.content,'lxml')//这里用html.text也是可以的

易知小说内容放在id=‘htmlContent’的标签内

题目放在class=‘h1title’

所以寻找所有标签为id=‘htmlContent’或者class=‘h1title’内的内容

title=soup.find_all('div',class_='h1title')

content=soup.find_all('div',id='htmlContent')

s=''.join('%s' %id for id in content)//为什么选择这样而不是用s=’’.join(content)

上网查了资料，说list包含数字，不能直接转化成字符串。

Join资料在这个网站https://blog.csdn.net/laochu250/article/details/67649210

运行后发现还有一个’<br/>’，第一个想法是用replace来替换’<br/>’

所以运行

s.replace('<br/>','')

print(s)

但是结果却不是这样<br/>还是存在

后来发现我是真的蠢

在python中字符串是immutable的对象，replace是不会直接变更字符串内容的，只会创建一个新的。需要重新引用将replace返回的替换后的字符串结果。

现在那些该死的<br/>终于没了

发现标题中还有那些标签

用replace删去

s=''.join('%s' %id for id in content)

t=''.join('%s' %id for id in title)

s=s.replace('<br/>','')

t=t.replace('<div class="h1title">','')

t=t.replace('</h1>','')

t=t.replace('</div>','')

接下来就是用for循环遍历所有的文章

第一章的url=http://www.biquw.com/book/29142/11987333.html

最后一章的url=http://www.biquw.com/book/29142/11989832.html

只有最后的四个数字变了，那么就开始遍历

是不是感觉没有问题？那么打开我们写入的文件看看

What？这些问号什么鬼

还是用replace，是什么？就是空格啦，本质就是\xa0啦，反正我是这么理解的，所以s=s.replace('\xa0','')替换吧

解决了，没有问题了似乎一切都搞定了，但是嗯嗯嗯.....还有一个问题，低效率，是的低效率for i in range(11987333,11989832):

按照这个规律第二章最后的数字应该是11987333但实际上最后的数据是11987335

第二章比第一章大了2个数字，那么第三章呢？比第二章大了3个数字。

http://www.biquw.com/book/29142/11987338.html

但是不搭嘎，反正我累了。

最后附源代码：

import requests

import os

import re

from bs4 import BeautifulSoup

start_url="http://www.biquw.com/book/29142/"

for i in range(11987333,11989832):

url=start_url+str(i)+'.html'

html=requests.get(url,timeout=15)

soup=BeautifulSoup(html.content,'lxml')

title=soup.find_all('div',class_='h1title')

content=soup.find_all('div',id='htmlContent')

s=''.join('%s' %id for id in content)

t=''.join('%s' %id for id in title)

s=s.replace('<br/>','')

s=s.replace('<div class="contentbox clear" id="htmlContent">','')

s=s.replace('\xa0','')

t=t.replace('<div class="h1title">','')

t=t.replace('</h1>','')

t=t.replace('</div>','')

print(t)

with open("召唤千军.txt",'a') as f:

f.write(t)

f.write(s)

f.close()

大佬们看了多提意见呗，萌新一个呢。这还搞了好久的感觉智商收到了碾压

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python爬取小说详解

爬取的小说url为http://www.biquw.com/book/29142/第一步：geturl 解析url（用BeautifulSoup）start_url="http://www.biquw.com/book/29142/"url=start_url+str(11987333)+'.html'html=requests.get(url,timeout=15)soup=BeautifulS...
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。