python爬取小说项目概述,Python爬虫入门实战项目--爬取新笔趣阁小说

最新推荐文章于 2024-03-24 18:00:20 发布

weixin_39892447

最新推荐文章于 2024-03-24 18:00:20 发布

阅读量423

点赞数 1

文章标签： python爬取小说项目概述

1、网页查看

进入到全部小说，这就是我们要爬取的小说，这些够看很长时间了

2、完整代码及注释分析

import requests

from bs4 import BeautifulSoup

import os

import re

headers = {

"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36"

}

#保存路径

path = "./小说"

#如果路径不存在就创建

if not os.path.exists(path):

os.mkdir(path)

#访问的url

url = "http://www.xbiquge.la/xiaoshuodaquan/"

#发起get请求

response = requests.get(url=url, headers=headers)

#注意设置编码，不然为乱码

response.encoding = "utf-8"

#解析网页

data = BeautifulSoup(response.text, "html.parser")

#参考图1，获取ul下的所有li

ul = data.find(class_="novellist").find_all("li")

#遍历

for li in ul:

li_data = BeautifulSoup(str(li), "html.parser")

#参考图2

#小说名称

name = li_data.find("a").text

#详情页url

page_url = li_data.find("a")["href"]

#拼接路径

path = path + "/" + name

print("正在爬取："+name)

if not os.path.exists(path):

os.mkdir(path)

#向详情页发起请求

page_response = requests.get(url=page_url, headers=headers)

page_response.encoding = "utf-8"

page_data = BeautifulSoup(page_response.text, "html.parser")

#参考图3

dl = page_data.find("dl").find_all("dd")

#遍历dl

for dd in dl:

dd_data = BeautifulSoup(str(dd),"html.parser")

#参考图4

chapter = dd_data.find("a").text

chapter_url = "http://www.xbiquge.la" + dd_data.find("a")["href"]

#对每一章节url发起请求

res = requests.get(url=chapter_url,headers=headers)

res.encoding = "utf-8"

try:

#参考图5

#获取每一章节中的文本内容，使用select选择器进行定位

text = BeautifulSoup(res.text,"html.parser").select("#content")[0].text

except:

pass

#使用正则进行替换

section_text = re.sub('\s+', '\r\n\t', text).strip('\r\n').replace("亲,点击进去,给个好评呗,分数越高更新越快,据说给新笔趣阁打满分的最后都找到了漂亮的老婆哦!手机站全新改版升级地址：http://m.xbiquge.la，数据和书签与电脑站同步，无广告清新阅读！","")

#保存文件

with open(path +"/"+chapter+".txt",'wb') as f:

f.write(section_text.encode("UTF-8"))

View Code

3、图片辅助分析

图1

图2

图3

图4

图5

3、运行结果

标签：Python,text,li,爬取,--,url,path,data,find

来源： https://www.cnblogs.com/cy0628/p/14164440.html

weixin_39892447

关注

1
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
python爬取小说项目概述,Python爬虫入门实战项目--爬取新笔趣阁小说

1、网页查看进入到全部小说，这就是我们要爬取的小说，这些够看很长时间了2、完整代码及注释分析import requestsfrom bs4 import BeautifulSoupimport osimport reheaders = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。