Python爬取网站小说

等离子带花奶酪蛋糕

已于 2024-08-24 11:02:07 修改

阅读量642

点赞数

分类专栏： Python爬虫文章标签： python 爬虫

于 2023-07-21 15:54:52 首次发布

本文链接：https://blog.csdn.net/qq_42831302/article/details/131843441

版权

Python爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本文介绍了在Windows10环境下，使用VisualStudioCode和Python3.8进行网络爬虫开发的方法。通过UA伪装避免被网站识别为爬虫，利用Requests模块获取网页内容，然后结合BeautifulSoup解析HTML，提取所需信息。同时，文章还展示了如何保存爬取到的数据和分析网页元素的策略。

摘要由CSDN通过智能技术生成

开发环境：Windows 10

开发工具：Visual Studio Code

Python版本：Python 3.8.0

文章目录

前言

一、效果图

二、相关知识点

1、UA伪装

2、Requests模块的安装及使用

3、 BeautifulSoup的安装及使用

4、数据保存

5、网页元素分析

三、源码

前言

本文写于2023年7月21日，目前运行有效，如网站更改网页元素可能失效。如果文章有哪些错误请多多指教。

一、效果图

二、相关知识点

UA伪装

UA（User-Agent）是指用户代理，当使用浏览器访问网站时，我们查看 Request Headers 的 User-Agent可以发现有浏览器的名字（Chrome），表示用户的代理是浏览器。当我们使用爬虫访问时会显示用户代理为Python，而有些网站会禁止代理为Python的请求，所以需要使用进行UA伪装。

Requests模块的安装及使用

#安装模块
pip install requests

#导入模块
import requests

# 获取要爬取的网页的url（以requests库官方文档为例）
url = 'https://requests.readthedocs.io/projects/cn/zh_CN/latest/user/quickstart.html'
# 用get()方法请求下载网页
req = requests.get(url)
# text属性返回网页源码的内容
text = req.text
# 打印源码
print(text)

BeautifulSoup的安装及使用

#下载bs模块
pip install bs4
#下载lxml解析器
pip install lxml

#导入BeautifulSoup
from bs4 import BeautifulSoup

# 用BeautifulSoup()方法将源码内容生成能用BeautifulSoup解析的lxml格式文件，text为requests请求到的源码
soup = BeautifulSoup(text,'lxml')
#找到class属性为"head_wrapper"的<div>标签
soup.find('div',class_= "head_wrapper")
# 返回所有<a>标签
soup.find_all('a')

数据保存

# 打开文件并写入内容
with open(filename, "w") as f:
# write 写入
f.write('hello word 你好 \n')
#关闭文件  
f.close()


#打开文件，若文件不存在系统自动创建。 
#参数name 文件名，mode 模式
#w 只能操作写入  r 只能读取   a 向文件追加    w+ 可读可写   r+可读可写    a+可读可追加    wb+写入进制数据
#w模式打开文件，如果文件中有数据，再次写入内容，会把原来的覆盖掉

网页元素分析

一般思路是爬取文章主页的所有章节链接，分析单个章节元素后循环保存。但由于网站将章节分页太多，导致如果只爬取主页的章节，只能获取章节的部分内容。所以直接进入第一章的页面，循环爬取下一页的链接即可。

源码

import requests
from bs4 import BeautifulSoup
from datetime import datetime

#伪装为浏览器请求
headers = {
#    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.50',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.82'
}

#创建nol类
class nol:
    #循环保存方法
    def save_content(self):

        ##list数组，保存爬取到的链接（需要自己替换成要爬取小说的第一个章节页面）
        circle_url = ['http://m.zgzl.net/read_uad/rfuyg.html']

        ##循环circle_url数组,index是下标
        for index,url in enumerate(circle_url):
            #根据index替换解析的链接
            bool_url =circle_url[index]
            #主网址
            base_url = 'http://m.zgzl.net'
            
            #解析网页源码 
            response_1 = requests.get(bool_url)
            #解决中文字符乱码
            response_1.encoding = 'utf-8'
            #将源码内容生成能用BeautifulSoup解析的lxml格式文件
            detail_oup = BeautifulSoup(response_1.text,"lxml")     

            try:
                #提取网页标题
                div_titile = detail_oup.find('div',class_='nr_function').find('h1')
                #提取网页正文
                div_content = detail_oup.find('div',class_ ='novelcontent')
                #获取下一页链接
                nextpage_url_LIst = detail_oup.find(name='a',class_ = 'p4').get('href')
                #拼接网址
                nextpage_url = base_url+nextpage_url_LIst
                #将拼接好的网址添加到list数组
                circle_url.append(nextpage_url)

                #在0到5秒范围内随机暂停执行
                #time.sleep(random.randint(0,5))

                #保存的小说名（自己更改）
                save_path = '和前女友上综艺gl.txt'
                #打开文档并写入
                with open(save_path,'a+',encoding='utf-8') as f:
                    #写入章节标题
                    f.write(div_titile.text)
                    #写入章节正文
                    f.write(div_content.text)
                    #关闭文件
                    #f.close()
                print("爬取，----"+div_titile.text+"----章节成功")
            except Exception as e:
                print("-----------------已爬取所有章节-----------------")
    #时间方法
    def time(self):
        #获取当前时间
        currentDateAndTime = datetime.now()
        #获取--时:分:秒--格式时间
        currentTime = currentDateAndTime.strftime("%Y-%m-%d %H:%M:%S")
        #返回14:47:17格式时间
        return currentTime

if __name__ == '__main__':
    #创建nol类实例
    n = nol()
    #获取开始时间
    startTime = n.time()
    print("爬取开始时间----------"+startTime)
    #调用保存方法
    n.save_content()
    #获取结束时间
    endTime = n.time()

    #将str转为datetime格式
    st = datetime.strptime(startTime,"%Y-%m-%d %H:%M:%S")
    et = datetime.strptime(endTime,"%Y-%m-%d %H:%M:%S")
    print("爬取结束时间----------"+endTime)
    print("共耗时"+str(et-st))