python爬虫，爬取起点网站小说

最新推荐文章于 2023-09-21 20:51:25 发布

saber_sss

最新推荐文章于 2023-09-21 20:51:25 发布

阅读量2.5k

点赞数

分类专栏： python 爬虫

本文链接：https://blog.csdn.net/saber_sss/article/details/108464288

版权

本文介绍如何使用Python的bs4和os模块爬取起点网站的玄幻小说信息。通过分析https://www.qidian.com/xuanhuan的接口，设置防爬策略，封装BeautifulSoup，实现对多页小说链接的抓取。

摘要由CSDN通过智能技术生成

使用python再来做一次爬虫：主要抓取玄幻类型的小说
目标网址:起点
使用模块：bs4，os模块
基本思路：
获取需求页面的元素代码，装到bs4容器里面，然后进行操作

首先获取接口：https://www.qidian.com/xuanhuan，可以看到，亲求方法是get
在这里插入图片描述
首先获取玄幻小说的所有页面元素代码，然后装到bs4容器里进行操作：

url = "https://www.qidian.com/xuanhuan"
method = 'get'
headers = {
   "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)","Referer":"https://www.qidian.com"}
res = requests.get(url,headers=headers)
res.encoding = 'utf-8'
# print(res.text)
soup = BeautifulSoup(res.text,'html.parser')
xuanhuan = soup.select('.book-list')
print('book-list:',xuanhuan)
number = 0

headers是对一些防爬机制的简单处理
因为有很多的页面和链接。所有建议把 BeautifulSoup直接封装：

from bs4 import BeautifulSoup
import requests
class soupx:
    def soup(self,method,url):
        headers = {
   "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)", "Referer": "https://www.qidian.com"}
        res = requests.request(method,url,headers=headers)
        res.encoding = 'utf-8'
        soup = BeautifulSoup(res.text,'html.parser')
        return soup

完整代码块：

import os
from reptile.soup4 import soupx
import time

path = 'D:/xiaoshuo/'
#windows不能创建自带的目录，添加逻辑判断
if os.path.exists(path):
    print('目录已经存在')
    flag = 1
else:
    os.makedirs(path)
    flag = 0


url = "https://www.qidian.com/xuanhuan"
method = 'get'
# headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)","Referer":"https://www.qidian.com"}
soup = soupx().soup(method=method,url=url)
#把bs操作模块封装成一个类，后面直接调用这个模块
# res = requests.get(url,headers=headers)
# res.encoding = 'utf-8'
# print(res.text)
# soup = BeautifulSoup(res.text,'html.parser')
xuanhuan = soup.select('.book-list')
print('book-list:',xuanhuan)
number = 0
for book in xuanhuan:
    #获取所有玄幻日周月前十的内容
    print('book:',book)
    soup1 = book.select('a')
    soup1.pop(1)
    soup1.pop(1)
    soup1.pop(1)
    number += 1

最低0.47元/天解锁文章

saber_sss

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python爬虫，爬取起点网站小说

使用python再来做一次爬虫：主要抓取玄幻类型的小说目标网址:起点使用模块：bs4，os模块基本思路：获取需求页面的元素代码，装到bs4容器里面，然后进行操作首先获取接口：https://www.qidian.com/xuanhuan，可以看到，亲求方法是get首先获取玄幻小说的所有页面元素代码，然后装到bs4容器里进行操作：url = "https://www.qidian.com/xuanhuan"method = 'get'headers = {"user-agent":"Moz
复制链接

扫一扫

专栏目录