本文用于纪录爬取番茄小说过程中遇到的困难和解决措施
首先找到我想要的小说第一章
1.找网络接口
右键检查,清空网络日志,刷新,找到6893843740742386183document类型请求标头URL确实是https://fanqienovel.com/reader/6893843740742386183,没有做其他措施,所以直接用requests库伪装访问,这一部分代码如下
# cap01_url = 'https://fanqienovel.com/reader/6893843740742386183?enter_from=reader' # cap02_url = 'https://fanqienovel.com/reader/6893843740834660878?enter_from=reader' # cap03_url = 'https://fanqienovel.com/reader/6893843740910158344?enter_from=reader' 观察发现,网页之间并没有直接关联 import requests cap01_url = 'https://fanqienovel.com/reader/6893843740742386183?enter_from=reader' # cap02_url = 'https://fanqienovel.com/reader/6893843740834660878?enter_from=reader' # cap03_url = 'https://fanqienovel.com/reader/6893843740910158344?enter_from=reader' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0'} response = requests.get(url=cap01_url, headers=headers) print(response.text)
2.接下来就使用xpath提取数据了
“”“先不要管乱码问题待会解决”“”
通过观察层级结构,我们使用xpath语法 //div[@class=“muye-reader-content noselect”]/div//p 获得文章内容,
同时保存了章节标题 //h1[@class = “muye-reader-title”],先把这一部分内容获取到
tree = etree.HTML(response.text) title = tree.xpath('//h1[@class = "muye-reader-title"]/text()') content_tags = tree.xpath('//div[@class="muye-reader-content noselect"]/div//p/text()') print(len(content_tags)) for content_tag in content_tags: print(content_tag) print(type(title))
%2F…%2Fc%2B%2B%E5%9F%BA%E7%A1%80%E9%85%8D%E5%A5%97%E5%9B%BE%E7%89%87%2Fimage-20240126094644192.png&pos_id=img-SVrl7MUw-1706328060334)
获取内容如图,具有乱码
通过对网页结构分析,发现文字有一些超出了编码范围,于是可以推断出,字体暗藏玄机,找到网页字体文件后,下载到本地,用Fontforge打开,发现只从e3e8到e55b有文字,所以可以得出番茄使用了两套字体加载文本内容,当字符超出一定范围,就使用另一种
: # 原字符减去e338获取到另一套字体的该编码字符 bias = cc - CODE_ST if charset[bias] == '?': # 特殊处理 return chr(cc) return charset[bias] # 获取小说章节 print(title) content = [] for content_tag in content_tags: para = '' for char in content_tag: cc = ord(char) if CODE_ST <= cc <= CODE_ED: ch = interpreter(cc) para += ch else: para += char # 这里应该是拼接字符,而不是其ASCII码 content.append(para) print(content) for para in content: fp.write(' ') fp.write(para) fp.write('\n')
到这按理说应该去找底部下一章节按钮的超链接,但是番茄给我们反爬了,没有有用信息
只能去网络接口里面找,在点击下一张之后,发现网络请求中有一个请求返回的json数据不仅有内容还有下一页的**ItemID**,所以解析获得,然后拼接到本身url(替换原本的id, *注意:原本的url里面的参数可以删去部分,不影响返回,最后的url是这样*https://fanqienovel.com/api/reader/full?itemId=6893843740834660878)
注意,经过实验下一张的编号是itemId,而不是nextitemId
cap01_url = 'https://fanqienovel.com/api/reader/full?itemId=6893843740834660878' # cap02_url = 'https://fanqienovel.com/reader/6893843740834660878?enter_from=reader' # cap03_url = 'https://fanqienovel.com/reader/6893843740910158344?enter_from=reader' fp = open('人类不死以后.txt', 'a', encoding='utf-8') headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0'} response = requests.get(url=cap01_url, headers=headers) data = response.json() json_obj = response.json() # 解析JSON数据为Python字典 next_id = json_obj['data']['chapterData']['nextItemId'] # 解嵌套 next_url = 'https://fanqienovel.com/api/reader/full?itemId=' + str(next_id)
到此,所以需要的信息已经爬完了,只需要处理循环逻辑,保存文件就行
最后全部代码如下,做了一些修改
""" # Time: 2024/1/22/22:59 # Theme: 爬虫程序 # Author: 0zxm # E-mail: m15813109801@163.com # Dependencies: urllib.request, lxml """ import requests from lxml import etree CODE_ST = 58344 # 十六进制e3e8的十进制 CODE_ED = 58715 # 十六进制e55b的十进制 charset = ['D', '在', '主', '特', '家', '军', '然', '表', '场', '4', '要', '只', 'v', '和', '?', '6', '别', '还', 'g', '现', '儿', '岁', '?', '?', '此', '象', '月', '3', '出', '战', '工', '相', 'o', '男', '首', '失', '世', 'F', '都', '平', '文', '什', 'V', 'O', '将', '真', 'T', '那', '当', '?', '会', '立', '些', 'u', '是', '十', '张', '学', '气', '大', '爱', '两', '命', '全', '后', '东', '性', '通', '被', '1', '它', '乐', '接', '而', '感', '车', '山', '公', '了', '常', '以', '何', '可', '话', '先', 'p', 'i', '叫', '轻', 'M', '士', 'w', '着', '变', '尔', '快', 'l', '个', '说', '少', '色', '里', '安', '花', '远', '7', '难', '师', '放', 't', '报', '认', '面', '道', 'S', '?', '克', '地', '度', 'I', '好', '机', 'U', '民', '写', '把', '万', '同', '水', '新', '没', '书', '电', '吃', '像', '斯', '5', '为', 'y', '白', '几', '日', '教', '看', '但', '第', '加', '候', '作', '上', '拉', '住', '有', '法', 'r', '事', '应', '位', '利', '你', '声', '身', '国', '问', '马', '女', '他', 'Y', '比', '父', 'x', 'A', 'H', 'N', 's', 'X', '边', '美', '对', '所', '金', '活', '回', '意', '到', 'z', '从', 'j', '知', '又', '内', '因', '点', 'Q', '三', '定', '8', 'R', 'b', '正', '或', '夫', '向', '德', '听', '更', '?', '得', '告', '并', '本', 'q', '过', '记', 'L', '让', '打', 'f', '人', '就', '者', '去', '原', '满', '体', '做', '经', 'K', '走', '如', '孩', 'c', 'G', '给', '使', '物', '?', '最', '笑', '部', '?', '员', '等', '受', 'k', '行', '一', '条', '果', '动', '光', '门', '头', '见', '往', '自', '解', '成', '处', '天', '能', '于', '名', '其', '发', '总', '母', '的', '死', '手', '入', '路', '进', '心', '来', 'h', '时', '力', '多', '开', '己', '许', 'd', '至', '由', '很', '界', 'n', '小', '与', 'Z', '想', '代', '么', '分', '生', '口', '再', '妈', '望', '次', '西', '风', '种', '带', 'J', '?', '实', '情', '才', '这', '?', 'E', '我', '神', '格', '长', '觉', '间', '年', '眼', '无', '不', '亲', '关', '结', '0', '友', '信', '下', '却', '重', '己', '老', '2', '音', '字', 'm', '呢', '明', '之', '前', '高', 'P', 'B', '目', '太', 'e', '9', '起', '稜', '她', '也', 'W', '用', '方', '子', '英', '每', '理', '便', '西', '数', '期', '中', 'C', '外', '样', 'a', '海', '们', '任'] # 解析章节加密内容 def interpreter(cc): # 原字符减去e338获取到另一套字体的该编码字符 bias = cc - CODE_ST if charset[bias] == '?': # 特殊处理 return chr(cc) return charset[bias] cap_url = 'https://fanqienovel.com/api/reader/full?itemId=6893843740742386183' cap02_url = 'https://fanqienovel.com/reader/6893843740910158344' # cap02_url = 'https://fanqienovel.com/reader/6893843740834660878?enter_from=reader' # cap03_url = 'https://fanqienovel.com/reader/6893843740910158344?enter_from=reader' fp = open('人类不死以后.txt', 'a', encoding='utf-8') headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0'} """第一章爬取""" content_url = 'http://fanqienovel.com/reader/6893843740742386183' response = requests.get(url=content_url, headers=headers) tree = etree.HTML(response.text) title = tree.xpath('//h1[@class = "muye-reader-title"]/text()') fp.write(title[0] + '\n') content_tags = tree.xpath('//div[@class="muye-reader-content noselect"]/div//p/text()') # 获取小说章节 # print(title) content = [] for content_tag in content_tags: para = '' for char in content_tag: cc = ord(char) if CODE_ST <= cc <= CODE_ED: ch = interpreter(cc) para += ch else: para += char # 这里应该是拼接字符,而不是其ASCII码 content.append(para) # print(content) print('正在下载第1章') for para in content: fp.write(' ') fp.write(para) fp.write('\n') index = 1 while True: # TODO:获取下一章节ID response = requests.get(url=cap_url, headers=headers) data = response.json() json_obj = response.json() # 解析JSON数据为Python字典 next_id = json_obj['data']['chapterData']['nextItemId'] # 解嵌套 next_id_url = 'https://fanqienovel.com/api/reader/full?itemId=' + str(next_id) next_content_url = 'http://fanqienovel.com/reader/' + str(next_id) # print(next_id_url) cap_url = next_id_url # 迭代更新获取下一章id的url # TODO:获取每章节内容 response = requests.get(url=next_content_url, headers=headers) tree = etree.HTML(response.text) title = tree.xpath('//h1[@class = "muye-reader-title"]/text()') fp.write(title[0] + '\n') content_tags = tree.xpath('//div[@class="muye-reader-content noselect"]/div//p/text()') # print(len(content_tags)) # 获取小说章节 content = [] for content_tag in content_tags: para = '' for char in content_tag: cc = ord(char) if CODE_ST <= cc <= CODE_ED: ch = interpreter(cc) para += ch else: para += char # 这里应该是拼接字符,而不是其ASCII码 content.append(para) # print(content) index += 1 for para in content: fp.write(' ') fp.write(para) fp.write('\n') print('正在下载第' + str(index) + '章')
此程序还是要开vip才能拿到未解锁章节
(注意:本程序只能用于学习参考)
参考链接:
- https://www.bilibili.com/video/BV1Pj41197rh