python爬虫爬网络小说

最新推荐文章于 2024-08-19 22:50:52 发布

空城机

最新推荐文章于 2024-08-19 22:50:52 发布

阅读量2.9k

点赞数 1

分类专栏： python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_36171287/article/details/90180862

版权

python 专栏收录该内容

47 篇文章 7 订阅

订阅专栏

网络爬虫定义：

爬虫本质上是一段程序（一个脚本）
能帮我们自动批量采集我们需要的文本信息、图片等资源
模拟浏览器的自动浏览网页（99%）

python是写爬虫的首选语言

网络资源：网页、图片、视频、音频、文件

url：统一资源定位符（网址）

python环境安装和requests模块的安装在命令行下使用pip install requests

下载网页的python爬虫程序

新建一个python项目，代码如下：

import requests
#下载一个网页

url = 'http://www.linlida.com/0_646/'
#模拟浏览器发送HTTP请求
response = requests.get(url)

print(response)

将输出换成text格式：

import requests
#下载一个网页

url = 'http://www.linlida.com/0_646/'
#模拟浏览器发送HTTP请求
response = requests.get(url)

print(response.text)

这时出现的内容是乱码形式，所以要修改编码方式

先看一下自己的编码类型

print(response.encoding)
print(response.apparent_encoding)

然后根据自己的类型选择手动转换编码的类型，网上大都是变成utf-8才能不乱码，而我的要使用gbk

原因来源

import requests

#下载一个网页
url = 'http://www.linlida.com/0_646/'
#模拟浏览器发送HTTP请求
response = requests.get(url)
#修改编码方式
#response.encoding = "utf-8"
response.encoding = "gbk"
print(response.text)

开发爬虫的步骤

目标数据

网站
页面

分析数据加载

分析目标所对应的url

下载数据
清洗数据，处理数据
数据持久化

下载小说爬虫小程序

根据上面下载一个网页的程序，我们只要将所有章节的a标签的href中的网址下载下来，就是章节的url了。

而章节所在内容主要在dl标签中

获取dl的方法是使用re模块。

单纯使用以下语句将不会找到dl,因为不匹配不可及字符

dl = re.findall(r'<dl>.*?</dl>',html)
print(dl)

应在后面再加参数re.S，表示匹配所有字符，现在代码：

import requests
import re

#下载一个网页
url = 'http://www.linlida.com/0_646/'
#模拟浏览器发送HTTP请求
response = requests.get(url)
#修改编码方式
#response.encoding = "utf-8"
response.encoding = "gbk"
#目标小说主页网页源码
html = response.text
#获取每一章小说的信息（章节,url）
dl = re.findall(r'<dl>.*?</dl>',html,re.S)
print(dl)

我们需要将列表剥离出来，所以加[0]

dl = re.findall(r'<dl>.*?</dl>',html,re.S)[0]
print(dl)

接下来提取url和章节名称

#获取每一章小说的信息（章节,url）
dl = re.findall(r'<dl>.*?</dl>',html,re.S)[0]
chapter_list = re.findall(r'<dd><a href="(.*?)">(.*?)</a></dd>',dl)
print(chapter_list)

到目前为止已经获取到了章节名和url

同理可以获取小说名，可以根据小说名新建txt文本保存小说内容

#获取小说名字
title = re.findall(r'<h1>(.*?)</h1>',html)
print(title)

新建txt代码：

#新建一个txt，保存小说内容
fb = open('%s.txt' %title,'w',encoding='utf-8')

#循环每一个章节，分别去下载(前10个章节)
for chapter_info in chapter_list[:10]:
    print(chapter_info)

#循环每一个章节，分别去下载(前10个章节)
for chapter_info in chapter_list[:10]:
    chapter_url,chapter_name = chapter_info
    chapter_url = 'http://www.linlida.com'+chapter_url
    print(chapter_name,chapter_url)

下载章节内容：

#循环每一个章节，分别去下载(前10个章节)
for chapter_info in chapter_list[:1]:
    chapter_url,chapter_name = chapter_info
    chapter_url = 'http://www.linlida.com%s' % chapter_url
    print(chapter_name,chapter_url)
    #下载章节内容
    chapter_response = requests.get(chapter_url)
    chapter_response.encoding = 'gbk'
    #提取章节内容
    chapter_text = re.findall(r'<div id="content">(.*?)</div>',chapter_response.text,re.S)[0]
    #清洗章节数据
    chapter_text = chapter_text.replace(' ','')
    chapter_text = chapter_text.replace('&nbsp;', '')
    chapter_text = chapter_text.replace('<br/>', '')
    print(chapter_text)

后面将chapter_content内容写入txt文档中

现在全部代码为：

# -*- coding: utf-8 -*-
import requests
import re

#下载一个网页
url = 'http://www.linlida.com/0_646/'
#模拟浏览器发送HTTP请求
response = requests.get(url)
#修改编码方式
#response.encoding = "utf-8"
response.encoding = "gbk"
#目标小说主页网页源码
html = response.text
#获取每一章小说的信息（章节,url）
dl = re.findall(r'<dl>.*?</dl>',html,re.S)[0]
chapter_list = re.findall(r'<dd><a href="(.*?)">(.*?)</a></dd>',dl)
#获取小说名字
title = re.findall(r'<h1>(.*?)</h1>',html)
print(title)
#新建一个txt，保存小说内容
fb = open('%s.txt' %title,'w',encoding='utf-8')
#循环每一个章节，分别去下载(前10个章节)
for chapter_info in chapter_list[:10]:
    chapter_url,chapter_name = chapter_info
    chapter_url = 'http://www.linlida.com%s' % chapter_url

    #下载章节内容
    chapter_response = requests.get(chapter_url)
    chapter_response.encoding = 'gbk'
    #提取章节内容
    chapter_text = re.findall(r'<div id="content">(.*?)</div>',chapter_response.text,re.S)[0]
    #清洗章节数据
    chapter_text = chapter_text.replace(' ','')
    chapter_text = chapter_text.replace('&nbsp;', '')
    chapter_text = chapter_text.replace('<br/>', '')
    #持久化
    fb.write(chapter_name)
    fb.write('\n')
    fb.write(chapter_text)
    fb.write('\n')

txt中效果：