本章节内容仅用于个人学习
问题描述
- 爬取网页数据获取章节目录
- 通过每章节网页获取章节内容
问题分析
第一章节的url: https://fanqienovel.com/reader/6998108582835126784?enter_from=page
第二章节的url: https://fanqienovel.com/reader/6998109170566169092?enter_from=page
- 得知数字为每章节的
标识id
- 化简得每章节的url为:https://fanqienovel.com/reader/ + id
F12
查看内容所在位置与标签
代码展示
# -*- coding:utf-8 -*-
# @Time :2022/7/24 19:13
# @AUTHOR :booozai
import requests
from bs4 import BeautifulSoup
import lxml
if __name__ == '__main__':
url = 'https://fanqienovel.com/page/6998103530665937956'
# UA伪装
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
response = requests.get(url=url,headers=headers).text
# 创建BeautifulSoup对象
soup = BeautifulSoup(response,'lxml')
# 获取<div class="chapter-item">标签下内容
li_list = soup.select('.chapter-item')
fp = open('./xiaoshuo.txt', 'w', encoding='utf-8')
for li in li_list:
# 章节标题
title = li.a.string
# id标识
href = li.a['href']
# 获取每章的url
detail_url = 'https://fanqienovel.com' + href
detail_response = requests.get(url=detail_url,headers=headers).text
# 创建实例对象
detail_soup = BeautifulSoup(detail_response,'lxml')
# 获取<div class_="muye-reader-content noselect>下的内容
page = detail_soup.find('div',class_="muye-reader-content noselect")
# 获取内容
content = page.text
# 写入
fp.write(title + ':'+ '\n' + content + '\n')
print(title+'爬取成功!')