依赖包
import requests
import re
from retry import retry
过程
首先需要一个可用小说网络
例:在http://www.xbiquge.la/0/10/ 上爬取名为 武炼巅峰的小说
右键查看网页源码
设置编码格式并获取小说名字
url = 'http://www.xbiquge.la/0/10/'
response = requests.get(url)
response.encoding = 'utf-8'
html = response.text
print(html)
title = re.findall(r'<meta property="og:novel:book_name" content="(.*?)"/>', html)[0]
本地创建txt用于保存小说
fb = open('%s.txt' % title, 'w', encoding='utf-8')
获取小说章节列表,查看源码发现小说章节号放在list下,关键字在<a******>中,获取关键字不同小说网站的关键字有所不同,请对应修改
dl = re.findall(r'<div id="list">.*?</div>', html, re.S)[0]
chapter_info_list = re.findall(r"<a href='(.*?)' >(.*?)</a>", dl)
部分网络为出现请求超时,如超时,则反复请求有可能被网站封IP
@retry()
def make_trouble(url):
'''Retry until succeed'''
out = requests.get(url, timeout=10000)
if out.status_code == 200:
return out
else:
print('retrying...')
raise
在**chapter_info_list **中依次请求网页进行章节内容爬取
for i in range(len(chapter_info_list)):
chapter_url, chapter_title = chapter_info_list[i]
chapter_url = "http://www.xbiquge.la%s " % chapter_url
chapter_url = chapter_url.replace(' ', '')
chapter_response = make_trouble(chapter_url)
# chapter_response = requests.get(chapter_url, timeout=10000)
chapter_response.encoding = 'utf-8'
chapter_html = chapter_response.text
chapter_content = re.findall(r'<div id="content">(.*?)</div>', chapter_html, re.S)
if len(chapter_content) > 0:
chapter_content = chapter_content[0]
chapter_content = chapter_content.replace(' ', '')
chapter_content = chapter_content.replace('<br />', '')
chapter_content = chapter_content.replace('&t;', '')
fb.write(chapter_title)
fb.write('\n')
fb.write(chapter_content)
fb.write('\n')
print(chapter_url, chapter_title)
else:
fb.write(chapter_title + '缺失')
fb.write('\n')
GAME OVER
赞!正在爬取中,等待结束。