红楼梦小说爬取-范例

爬取小说

目标网站https://www.ddshu.net


我们观察红楼梦后方的数字,代号为148,然后点击前十篇进行逻辑推理

  1. 第一章——第五章(781449—781453.html) 在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述

  2. 但是观察到**第六章(782003)时发现断层
    在这里插入图片描述
    继续观察发现到最后一章
    第120章(782117)**时正好连续接上
    在这里插入图片描述

  3. 所以我们这边从第六章开始爬取

爬取代码

url_base='https://www.ddshu.net/148_'
for i in range(114): #120回章节120条网址
    print(url_base+str(i+782003)+'.html')
    url_1='https://www.ddshu.net/148_782003.html'
   ##导出所有循环的url网址块
##输出:
> https://www.ddshu.net/148_782003.html
> https://www.ddshu.net/148_782004.html
> https://www.ddshu.net/148_782005.html
> https://www.ddshu.net/148_782006.html
> https://www.ddshu.net/148_782007.html
> https://www.ddshu.net/148_782008.html
> https://www.ddshu.net/148_782009.html
> https://www.ddshu.net/148_782010.html
> https://www.ddshu.net/148_782011.html
> https://www.ddshu.net/148_782012.html
> https://www.ddshu.net/148_782013.html
> https://www.ddshu.net/148_782014.html
> https://www.ddshu.net/148_782015.html
> https://www.ddshu.net/148_782016.html
> https://www.ddshu.net/148_782017.html
> https://www.ddshu.net/148_782018.html
> https://www.ddshu.net/148_782019.html
> https://www.ddshu.net/148_782020.html
> https://www.ddshu.net/148_782021.html
> https://www.ddshu.net/148_782022.html
> https://www.ddshu.net/148_782023.html
> https://www.ddshu.net/148_782024.html
> https://www.ddshu.net/148_782025.html
> https://www.ddshu.net/148_782026.html
> https://www.ddshu.net/148_782027.html
> https://www.ddshu.net/148_782028.html
> https://www.ddshu.net/148_782029.html
> https://www.ddshu.net/148_782030.html
> https://www.ddshu.net/148_782031.html
> https://www.ddshu.net/148_782032.html
> https://www.ddshu.net/148_782033.html
> https://www.ddshu.net/148_782034.html
> https://www.ddshu.net/148_782035.html
> https://www.ddshu.net/148_782036.html
> https://www.ddshu.net/148_782037.html
> https://www.ddshu.net/148_782038.html
> https://www.ddshu.net/148_782039.html
> https://www.ddshu.net/148_782040.html
> https://www.ddshu.net/148_782041.html
> https://www.ddshu.net/148_782042.html
> https://www.ddshu.net/148_782043.html
> https://www.ddshu.net/148_782044.html
> https://www.ddshu.net/148_782045.html
> https://www.ddshu.net/148_782046.html
> https://www.ddshu.net/148_782047.html
> https://www.ddshu.net/148_782048.html
> https://www.ddshu.net/148_782049.html
> https://www.ddshu.net/148_782050.html
> https://www.ddshu.net/148_782051.html
> https://www.ddshu.net/148_782052.html
> https://www.ddshu.net/148_782053.html
> https://www.ddshu.net/148_782054.html
> https://www.ddshu.net/148_782055.html
> https://www.ddshu.net/148_782056.html
> https://www.ddshu.net/148_782057.html
> https://www.ddshu.net/148_782058.html
> https://www.ddshu.net/148_782059.html
> https://www.ddshu.net/148_782060.html
> https://www.ddshu.net/148_782061.html
> https://www.ddshu.net/148_782062.html
> https://www.ddshu.net/148_782063.html
> https://www.ddshu.net/148_782064.html
> https://www.ddshu.net/148_782065.html
> https://www.ddshu.net/148_782066.html
> https://www.ddshu.net/148_782067.html
> https://www.ddshu.net/148_782068.html
> https://www.ddshu.net/148_782069.html
> https://www.ddshu.net/148_782070.html
> https://www.ddshu.net/148_782071.html
> https://www.ddshu.net/148_782072.html
> https://www.ddshu.net/148_782073.html
> https://www.ddshu.net/148_782074.html
> https://www.ddshu.net/148_782075.html
> https://www.ddshu.net/148_782076.html
> https://www.ddshu.net/148_782077.html
> https://www.ddshu.net/148_782078.html
> https://www.ddshu.net/148_782079.html
> https://www.ddshu.net/148_782080.html
> https://www.ddshu.net/148_782081.html
> https://www.ddshu.net/148_782082.html
> https://www.ddshu.net/148_782083.html
> https://www.ddshu.net/148_782084.html
> https://www.ddshu.net/148_782085.html
> https://www.ddshu.net/148_782086.html
> https://www.ddshu.net/148_782087.html
> https://www.ddshu.net/148_782088.html
> https://www.ddshu.net/148_782089.html
> https://www.ddshu.net/148_782090.html
> https://www.ddshu.net/148_782091.html
> https://www.ddshu.net/148_782092.html
> https://www.ddshu.net/148_782093.html
> https://www.ddshu.net/148_782094.html
> https://www.ddshu.net/148_782095.html
> https://www.ddshu.net/148_782096.html
> https://www.ddshu.net/148_782097.html
> https://www.ddshu.net/148_782098.html
> https://www.ddshu.net/148_782099.html
> https://www.ddshu.net/148_782100.html
> https://www.ddshu.net/148_782101.html
> https://www.ddshu.net/148_782102.html
> https://www.ddshu.net/148_782103.html
> https://www.ddshu.net/148_782104.html
> https://www.ddshu.net/148_782105.html
> https://www.ddshu.net/148_782106.html
> https://www.ddshu.net/148_782107.html
> https://www.ddshu.net/148_782108.html
> https://www.ddshu.net/148_782109.html
> https://www.ddshu.net/148_782110.html
> https://www.ddshu.net/148_782111.html
> https://www.ddshu.net/148_782112.html
> https://www.ddshu.net/148_782113.html
> https://www.ddshu.net/148_782114.html
> https://www.ddshu.net/148_782115.html
> https://www.ddshu.net/148_782116.html

输出后可以点击网址检查url的循环是否正确
##
import requests ##import requests库

req=requests.get(url_1)
req.status_code##检查是否允许爬取
#输出:200

req.encoding=req.apparent_encoding
##encoding是从http中的header中的charset字段中提取的编码方式,若header中没有charset字段则默认为ISO-8859-1编码模式,则无法解析中文。apparent_encoding会从网页的内容中分析网页编码的方式,所以apparent_encoding比encoding更加准确。当网页出现乱码时可以把apparent_encoding的编码格式赋值给encoding。

req.text##检查导出

html=req.text

from bs4 import BeautifulSoup##impot soup库

soup=BeautifulSoup(html,'html.parser')
##soup = Beautiful(xxx,‘html.parser’,xxx)是指定Beautiful的解析器为“html.parser”还有BeautifulSoup(markup,“lxml”)BeautifulSoup(markup, “lxml-xml”) BeautifulSoup(markup,“xml”)等等很多种

在这里插入图片描述
我们输出中观察html,发现小说text内容被包含在div id="content"块中,需要把他提取出来。

soup.find_all('div',id='content')

len(soup.find_all('div',id='content'))
##确定长度
type(soup.find_all('div',id='content'))

content=(soup.find_all('div',id='content'))

到这里我们已经可以综合上方,写出爬取代码块

import requests
from bs4 import BeautifulSoup

url_1='https://www.ddshu.net/148_782003.html'
req=requests.get(url_1)
req.status_code
req.encoding=req.apparent_encoding
req.text
html=req.text

soup=BeautifulSoup(html,'html.parser')
print(soup.find_all('div',id='content')[0].text.replace('\n',''))

输出第六章到第一百二十章所有文本

下面是每一章的爬起顺便跟上网址,正好区分开

for i in range(114): #120回章节120条网址
    print(url_base+str(i+782003)+'.html')
    req=requests.get(url_1)
    req.status_code
    req.encoding=req.apparent_encoding
    req.text
    html=req.text
    
    soup=BeautifulSoup(html,'html.parser')
    print(soup.find_all('div',id='content')[0].text.replace('\n',''))

下面是几章基础知识的补充
[1]: https://blog.csdn.net/weixin_41665477/article/details/100567659
[2]: https://blog.csdn.net/weixin_41665477/article/details/100621838
[3]: https://blog.csdn.net/weixin_41665477/article/details/100624454
[4]: https://blog.csdn.net/weixin_41665477/article/details/100639756
[5]: https://blog.csdn.net/weixin_41665477/article/details/102019667

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值