用requests包爬虫爬取小说---python

最新推荐文章于 2024-07-28 15:46:11 发布

W_maoxian

最新推荐文章于 2024-07-28 15:46:11 发布

阅读量473

点赞数

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/hc1104349963/article/details/109334720

版权

python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

依赖包

import requests
import re
from retry import retry

过程

首先需要一个可用小说网络
例：在http://www.xbiquge.la/0/10/ 上爬取名为 武炼巅峰的小说
在这里插入图片描述
右键查看网页源码

设置编码格式并获取小说名字

url = 'http://www.xbiquge.la/0/10/'
response = requests.get(url)
response.encoding = 'utf-8'
html = response.text
print(html)
title = re.findall(r'<meta property="og:novel:book_name" content="(.*?)"/>', html)[0]

本地创建txt用于保存小说

fb = open('%s.txt' % title, 'w', encoding='utf-8')

在这里插入图片描述

获取小说章节列表，查看源码发现小说章节号放在list下，关键字在<a******>中，获取关键字不同小说网站的关键字有所不同，请对应修改

dl = re.findall(r'<div id="list">.*?</div>', html, re.S)[0]
chapter_info_list = re.findall(r"<a href='(.*?)' >(.*?)</a>", dl)

部分网络为出现请求超时，如超时，则反复请求有可能被网站封IP

@retry()
def make_trouble(url):
    '''Retry until succeed'''
    out = requests.get(url, timeout=10000)
    if out.status_code == 200:
        return out
    else:
        print('retrying...')
        raise

在**chapter_info_list **中依次请求网页进行章节内容爬取

for i in range(len(chapter_info_list)):
    chapter_url, chapter_title = chapter_info_list[i]
    chapter_url = "http://www.xbiquge.la%s " % chapter_url
    chapter_url = chapter_url.replace(' ', '')
    chapter_response = make_trouble(chapter_url)
    # chapter_response = requests.get(chapter_url, timeout=10000)
    chapter_response.encoding = 'utf-8'
    chapter_html = chapter_response.text
    chapter_content = re.findall(r'<div id="content">(.*?)</div>', chapter_html, re.S)
    if len(chapter_content) > 0:
        chapter_content = chapter_content[0]
        chapter_content = chapter_content.replace('&nbsp;', '')
        chapter_content = chapter_content.replace('<br />', '')
        chapter_content = chapter_content.replace('&amp;t;', '')
        fb.write(chapter_title)
        fb.write('\n')
        fb.write(chapter_content)
        fb.write('\n')
        print(chapter_url, chapter_title)
    else:
        fb.write(chapter_title + '缺失')
        fb.write('\n')

GAME OVER

赞！正在爬取中，等待结束。
在这里插入图片描述

W_maoxian

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
用requests包爬虫爬取小说---python

依赖包import requestsimport refrom retry import retry过程首先需要一个可用小说网络例：在http://www.xbiquge.la/0/10/ 上爬取名为武炼巅峰的小说右键查看网页源码设置编码格式并获取小说名字url = 'http://www.xbiquge.la/0/10/'response = requests.get(url)response.encoding = 'utf-8'html = response
复制链接

扫一扫

专栏目录