【python爬虫】面向对象方法实现斗图网案例_怎么用面向对象写python爬虫的翻页-CSDN博客

本文链接：https://blog.csdn.net/flyskymood/article/details/124310272

文章目录

前言
😎以下是全部代码

前言

最近文章都没有怎么给大家更新实用的小案例了，本章就来写一个斗图的小案例吧！
（这次的案例用点面向对象的方法基础知识实现）相信大家在和别人聊天的时候都很喜欢斗图吧,一言不合就斗图，斗着斗着没有图片了，下面来介绍一个利用Python在斗图网上批量下载大量搞笑表情包图片,让你以后都斗图杠杠的。

📕往期知识点

📕往期内容回顾

💡 【python】字典使用教程（超级详细）不看你怎么够别人卷
💡【python教程】保姆版教使用pymysql模块连接MySQL实现增删改查
💡 selenium自动化测试实战案例哔哩哔哩信息至Excel
💡舍友打一把游戏的时间，我实现了一个selenium自动化测试并把数据保存到MySQL

在这里插入图片描述

😍最终效果

先来看一下最终效果吧。这么多斗图不怕不够别人杠了。
在这里插入图片描述
顺便把其他的一些信息保存到了Excel中。

🎃基本开发环境

pycharm
Python 3.8

主要相关模块

request
BeautifulSoup
csv

🚗页面分析

首先第一还是先对网页进行分析先，这样好明确爬取思路，进入网址，右键检查看网页源代码，然后输入一下关键字信息看看在网页源代码中是否存在，如果存在则证明网页是静态的，例如下图。
在这里插入图片描述
右键检查发现每个图片都保持在ul id=‘post_container’下面的li标签中，

接下来翻页看看链接有什么变化，翻页之后我们可以发现page后面的数字发生了变化，规律是
/page/1
/page/2
/page/3

那么我们后面就可以给一个变量来实现翻页，这样就能实现全部页面抓取。
在这里插入图片描述

🤓主要思路

用request发起请求网页请求，用BeautifulSoup（解析库）拿到每个图片的li标签，对标题，图片等信息进行提取，把需要的信息提取出来后进行相应的保存方式等。

🤩实现步骤

首先导入相对应的库，创建一个类添加对象的属性，写个方法对页面发起请求得到页面源码。
在这里插入图片描述
有了网页源码之后我们就可以用BeautifulSoup解析进行解析内容，找到全部的图片标签li，循环每一个li用BeautifulSoup标签和属性的方法找到我们所需的，拿到图片链接和其他的信息，构造一个字典把数据存进字典中再把字典添加进列表用于数据保存进Exel中。图片的链接传给类中其他方法用来保存。
在这里插入图片描述
下图方法实现对图片进行下载到本地。

下图方法则是对列表数据进行存储。

最后就是在主方法中设置一个循环翻页，实现翻页抓取。

😯实现效果

在这里插入图片描述

😎以下是全部代码

# @Author : 王同学
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import csv
import os.path


num = 1
class spider():
    # 构造方法
    def __init__(self):
        # 伪造头
        ua = UserAgent()
        # 添加属性
        self.headers = {'User-Agent': ua.random}


    def get_content(self,url):
        try:
            response = requests.get(url,headers=self.headers)
            if response.status_code == 200:
                return response.text
        except Exception as e:
            print(e)



    def get_data(self,response):
        all_data = []
        # 实例化
        soup = BeautifulSoup(response,'lxml')
        # 找到全部的标签
        all_li = soup.find('div',class_="mainleft").find('ul').find_all('li')
        # 遍历
        for i in all_li:
            title = i.find('div',class_="article").find('h2').find('a').text
            time = i.find('div',class_="info").find('span').text
            images = i.find('a').find('img').get('src')
            details = i.find('a').get('href')

            item = {
                'title': title,
                'time': time,
                'images': images,
                'details': details
                }  # 字典

            all_data.append(item)

            # 调用类里面的方法
            self.save_images(images)

        # 添加属性
        self.all_data = all_data


    def save_csv(self):
        headers = ['title','time','images','details']
        # 打开文件
        with open('斗图.csv',mode='a',newline='',encoding='utf-8')as filte:
            f = csv.DictWriter(filte,headers)
            f.writeheader()
            f.writerows(self.all_data)



    def save_images(self,images):
        global num
        if not os.path.exists('斗图'):
            os.mkdir('斗图')

        resul = requests.get(url=images,headers=self.headers).content
        # 打开文件夹
        with open('斗图\\' + str(num) + '.gif',mode='wb')as f:
            f.write(resul)
            print('正在保存图片',num)
            num += 1



    def main(self):
        for i in range(1,18):
            url = f'http://www.bbsnet.com/page/{i}'
            print(f'=====================正在保存第{i}页的信息==========================')
            response = self.get_content(url)
            self.get_data(response)
            self.save_csv()




if __name__ == '__main__':
    mood = spider() # 创建对象
    mood.main()