MiziSpider爬虫程序源码【单线程】【子类写法】【函数写法】

松花蛋卷蛋

已于 2023-06-23 14:54:32 修改

阅读量1.1k

点赞数 29

文章标签：爬虫 python 开发语言

于 2023-06-23 14:24:32 首次发布

本文链接：https://blog.csdn.net/Lping9147/article/details/131349805

版权

最近一直在学目标检测，那里面python代码就达到了1000行左右，大量的数学逻辑，学着学着就想逃回农村，没有买卖就没有伤害。偶然间看到爬虫，花了一天时间学了一下，找了个网站，测试了一下，还是可以的，爬虫这个程序照着原理图做，还是还简单的。该爬虫只用于锻炼思维，交流学习。后续还可以继续完善。

1、通过本次爬虫程序，你将收获：

（1）、熟悉自定义函数调用，文件管理，子类写法，以及对网页html有一定程度的了解。

（2）、第三方库的熟悉，BeautifulSoup、Lxml、requests等等

（3）、练习手敲代码，活学活用

（4）、子类的写法，以及调用参数（重点）

2、改进方向

（1）多进程爬虫

（2）多线程爬虫

（3）分布式爬虫

3、原理及应用

一、原理

爬虫（Web scraping）是指自动化地从网页中提取数据的过程。通过编写爬虫程序，您可以访问互联网上的各种网站，并从网页中抓取所需的数据，如文本、图像、链接等。通用爬虫的流程如下下所示。

以下是一般的爬虫实现步骤：

1. **确定目标：** 确定您要爬取的网站或特定页面。确定您要提取的数据类型和所需的数据量。

2. **分析网页结构：** 使用开发者工具或查看网页源代码，了解目标网站的网页结构、HTML标签及数据位置。

3. **选择合适的爬虫工具/库：** 选择适合您的编程语言的爬虫工具或库。常见的选择包括Python的Scrapy、BeautifulSoup和Requests库等。

4. **编写爬虫代码：** 使用选择的爬虫工具或库编写代码，以访问目标网站、抓取数据和处理数据。这可能涉及到发送HTTP请求、解析HTML、提取所需数据等任务。

5. **处理数据：** 对爬取的数据进行清洗、处理和整理，使其符合您的需求，可能需要使用正则表达式、字符串操作或其他数据处理方法。

6. **存储和分析数据：** 将爬取的数据存储到数据库中或以其他格式保存，以便后续的数据分析和使用。

7. **设置爬虫策略：** 考虑目标网站的访问频率限制和爬虫伦理。设置适当的请求头、延迟时间和请求次数等，以遵守网站的服务条款并避免对网站造成过大的负担。

8. **运行和监控爬虫：** 运行您编写的爬虫代码，并实时监控其运行状态和输出。根据需要进行调整和修复代码中的问题。

请注意，当编写爬虫程序时，需要尊重网站的服务条款和法律法规，并确保所采集的数据是符合合法目的的。此外，了解基本的网络编程知识和HTML、CSS、XPath等方面的基础知识也会有所帮助。

二、应用

爬虫技术可以用于各种不同的目的和应用。以下是爬虫可以做的几个常见任务：

1. **数据采集和抓取：** 爬虫可以帮助您从互联网上收集大量的数据。您可以爬取新闻网站的文章、电子商务网站的产品信息、社交媒体平台的用户信息等。

2. **数据分析和挖掘：** 爬虫可以用来获取大量数据，并进行进一步的数据分析和挖掘。您可以通过分析爬取的数据，发现趋势、模式、关系等，并从中提取有价值的信息。

3. **搜索引擎索引：** 搜索引擎使用爬虫来抓取互联网上的网页，并建立索引以实现快速搜索。爬虫帮助搜索引擎发现新内容，更新现有内容，并确定网页的重要性和排名。

4. **价格比较和监测：** 通过爬取电子商务网站的产品信息，可以进行价格比较和监测。这对于在线购物和市场竞争分析非常有用。

5. **舆情监测：** 爬虫可以帮助您监测社交媒体、新闻网站和论坛等平台上关于特定话题的舆情和用户评论。

6. **内容聚合和推荐：** 爬虫可以从多个来源收集数据，并将其聚合为一个平台或应用程序。这样做可以提供更丰富的内容，并根据用户的兴趣进行个性化推荐。

7. **网站测试和监控：** 通过模拟用户行为和爬取网页内容，可以对网站进行测试和监控。这有助于发现网站的错误、漏洞和性能问题。

8. **自动化操作：** 爬虫可以用来执行一些常规和重复的网页操作，例如登录、提交表单、下载文件等。

需要注意的是，在进行任何爬虫活动时，应该遵守相关的法律法规和网站的服务条款。同时，尊重网站的机器人协议和数据隐私，并确保您的爬虫行为不会对目标网站造成过大的负担或损害。

4、步骤

（1）获取html页面。【get_html】

（2）解析页面，获得第一层目标。【parser_html】

（3）获取第二层目标总数。【get_pic_html】

（4）获取所有目标源文件。【get_pic_src】

（5）下载目标文件。【downloadpic】

注意：具体情况具体分析，一定要结合自己想要爬取的网页，做结构性分析，有些网页需要做多次结构分析，取决于你想一个py文件要下载多少文件，野心越大，需要做越多的分析，编程背后都是逻辑，学无止境呀呀呀呀呀。

5、环境准备

（1）电脑：什么电脑都可以，什么系统也行，不局限于windowns、macos、linux

（2）环境：简单点说，就是你固定生活的地方，可以搬动，但不建议，费心费力费时间

（3）版本：python 3开头的，太老的，也没有去关注

（4）本次使用的是mac，python 3.9、pycharm22.1

（5）目标网址：美女图片（纯粹是因为是图片数据库哦！！总得找一个爬哇）

6、代码实现

一、函数式写法

import os
import re
import time

import requests       
from requests import exceptions  
from bs4 import BeautifulSoup   
import numpy as np

url = "https://www.3gbizhi.com/meinv" 
img_path = "./picture"  # 定义一个文件路径来存储图片
ds_list = []            # 定义一个全局变量来记录照片的html地址
src_list = []           # 定义一个全局变量来记录照片的url地址
count = 0               # 定义一个全局变量来记录照片的数量
index_lst = []

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko)Chrome/80.0.3987.100 Safari/537.36",
    'Connection': 'Keep-Alive', 
    'Referer': "https://www.3gbizhi.com/meinv/xgmn/" 
}


def down_html_first(url):
    """ 获取第一阶段的html页面"""
    try:
        resp = requests.get(url, headers=headers)
        resp.raise_for_status()
        return resp.text
    except exceptions.Timeout as e:
        print(e)
    except exceptions.HTTPError as e:
        print(e)


def parse_html(html):
    """ 解析获取所有的妹子html  """
    soup = BeautifulSoup(html, "lxml")
    a_list = soup.find_all("a")  # find all named a tag in parsed data，and return a list
    href_list_f = []
    for a in a_list:
        url_ = a.get('href')
        if url_ is None:
            continue
        if url_.startswith('https') and url_.endswith('html'):
            href_list_f.append(url_)
    href_list_new = []
    for b in href_list_f:
        if re.findall("index", b, flags=0):
            index_lst.append(b)
            continue
        if re.findall("connect", b, flags=0):
            continue
        href_list_new.append(b)
    save(index_lst, "./index.txt")
    # href_list_new = [x for i, x in enumerate(href_list_new) if x not in href_list_new[:i]]
    href_list_new = list(set(href_list_new))  # drop duplicate
    print(f"steps two：wanted htmls have been parsing，the first class web's gal of {len(href_list_new)} in total！")
    return href_list_new


def save(lst, path):
    for i in lst:
        with open(f"{path}", "a+") as f:
            f.write(i + "\n")
    print(f"保存了{len(lst)}个目标！")


def get_pic_html_num(html):
    """ 获取妹子所有图片的html及数量 """
    count = 0  
    print(html)
    for url2 in html:
        addr = url2.rsplit(".", 1)[0]
        resp = requests.get(url2, headers)
        soup = BeautifulSoup(resp.text, "lxml")
        a_list = list(soup.find_all("a"))
        # global ds_list=[]
        for a in a_list:
            url_ = a.get('href')
            if url_ is None:
                continue
            if url_.startswith('https') and url_.endswith('html'):
                if url_.rsplit("_")[0] == addr:
                    ds_list.append(url_)
                    count = count + 1
    print(f"第三步：已经在正常获取获取妹子所有图片的html及数量为{count}")
    print(ds_list[:9])
    return ds_list


def get_pic_url(href_list):
    """ 获取所有妹子的图片url """
    for href_ in href_list:
        html_img = down_html_first(href_)
        soup = BeautifulSoup(html_img, "lxml")
        img_list = soup.find_all("img", attrs={"id": "contpic"})
        # print(img_list)
        for a in img_list:
            url_ = a.get('src')
            if url_ is None:
                continue
            if url_.startswith('https') and url_.endswith('.jpg'):
                src_list.append(url_)
    print(f"第四步：已经获取{len(src_list)}的图片原地址！！，请稍等，正在准备下载中")
    print(src_list[:2])
    return src_list


def download_pic(src_list):
    """ 下载妹子的图片 """
    if not os.path.exists(img_path):
        print(f'{img_path} does not exist')
        os.makedirs(f"{img_path}/")
        print(f'{img_path} has been created!!')
    type(src_list)
    for url1 in src_list:
        print(url1)
        img = requests.get(url1, headers=headers)
        t = time.time()
        now_time = lambda: int(round(t * 1000))
        print('downloading... ', url1)
        with open("./picture/{0}_{1}.jpg".format(count, now_time()), 'ab') as f:
            f.write(img.content)
        time.sleep(1)
        img.close()
        print('the pictures that you want to drop have download completely!')


if __name__ == '__main__':
    count = 0
    html = down_html_first(url)  
    print("第一步：已经正常获得第一层的网页！！！")
    time.sleep(np.random.randint(0, 2))
    # 解析html页面文档,返回一个html的列表
    href = parse_html(html)  
    destination = get_pic_html_num(href) 
    drs = get_pic_url(destination)
    print("第五步：已经正常所有目标的url，正在下载高清原图！！")
    save(drs, ".//source_yrl.txt")
    download_pic(drs)

二、子类写法

# Editor——>File and Code Template——>Python Script
# 姓名：亮
# 开发时间：2023/6/23 11:37
import os
import re
import time
import datetime

import requests   
from requests import exceptions  #
from bs4 import BeautifulSoup    # 导入网页解析库
import numpy as np


class MeiziSpider:
    def __init__(self, url, img_path):
        self.url = url
        self.img_path = img_path
        self.html_list = []
        self.src_list = []
        self.count = 0
        self.num = 0
        self.index_lst = []
        self.step = 0
        self.headers = {
            'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko)Chrome/80.0.3987.100 Safari/537.36",
            'Connection': 'Keep-Alive',
            'Referer': "https://www.3gbizhi.com/meinv/xgmn/"
        }

    def get_html(self, url):
        """1.获取html页面"""
        try:
            resp = requests.get(url, headers=self.headers)
            resp.raise_for_status()
            return resp.text
        except exceptions.Timeout as e:
            print(e)
        except exceptions.HTTPError as e:
            print(e)

    def parse_html(self, html):
        """2.解析html页面，获取第一层目标"""
        soup = BeautifulSoup(html, "lxml")
        a_list = soup.find_all("a")
        href_list_f = []
        for a in a_list:
            url_ = a.get('href')
            if url_ is None:
                continue
            if url_.startswith('https') and url_.endswith('html'):
                href_list_f.append(url_)
        href_list_s = []
        for index in href_list_f:
            if re.findall("index", index, flags=0):
                self.index_lst.append(index)
                continue
            if re.findall("connect", index, flags=0):
                continue
            href_list_s.append(index)
        self.save(self.index_lst, "./index.txt")
        href_list_s = list(set(href_list_s))
        self.step += 1
        print(f"steps {self.step}/5: first-layer web pages have been parsed, get targets of {len(href_list_s)}.")
        return href_list_s

    def save(self, lst, path):
        for i in lst:
            with open(f"{path}", "a+") as f:
                f.write(i + "\n")
        print(f"Saved {len(lst)} targets!")

    def get_pic_html(self, html):
        """3.解析html页面，获取第二层目标"""
        for url2 in html:
            addr = url2.rsplit(".", 1)[0]
            resp = requests.get(url2, self.headers)
            soup = BeautifulSoup(resp.text, "lxml")
            a_list = list(soup.find_all("a"))
            for tag in a_list:
                url_ = tag.get('href')
                if url_ is None:
                    continue
                if url_.startswith('https') and url_.endswith('html'):
                    if url_.rsplit("_")[0] == addr:
                        self.html_list.append(url_)
                        self.count += 1
        self.step += 1
        print(f"Step {self.step}/5: Successfully obtained all the HTMLs and the number of pictures is {self.count}")
        print(self.html_list[:1])
        return self.html_list

    def get_pic_src(self, href_list):
        """4.解析html页面，获取所有目标的源地址"""
        for href_ in href_list:
            html_img = self.get_html(href_)
            soup = BeautifulSoup(html_img, "lxml")
            img_list = soup.find_all("img", attrs={"id": "contpic"})
            for a in img_list:
                url_ = a.get('src')
                if url_ is None:
                    continue
                if url_.startswith('https') and url_.endswith('.jpg'):
                    self.src_list.append(url_)
        self.step += 1
        print(f"Step {self.step}/5: Successfully obtained {len(self.src_list)} original image addresses! ")
        print(self.src_list[:1])
        return self.src_list

    def download_pic(self, src_list):
        """5.解析https地址，下载所有的目标"""
        if not os.path.exists(self.img_path):
            print(f'{self.img_path} does not exist')
            print(f'{self.img_path} is creating now,please wait for a second!')
            os.makedirs(f"{self.img_path}/")
            self.step += 1
            print(f'{self.img_path} has been created! Please wait while preparing to download.')
            print(f"Step {self.step}/5: Successfully obtained all the target URLs and downloading source images!")

        for url1 in src_list:
            img = requests.get(url1, headers=self.headers)
            self.num += 1
            t = time.time()
            now_time = lambda: int(round(t * 1000))
            print('downloading... ', url1)
            with open(f"{self.img_path}/{self.num}_{now_time()}.jpg", 'ab+') as f:
                f.write(img.content)
            time.sleep(1)
            img.close()
        print('The pictures you selected have been downloaded completely!')

    def start_spider(self):
        """开始启动爬虫程序"""
        start = time.time()
        html = self.get_html(self.url)
        self.step += 1
        print(f"Step {self.step}/5: Successfully obtained the first layer of web pages!")
        time.sleep(np.random.randint(0, 2))
        href = self.parse_html(html)
        destination = self.get_pic_html(href)
        pic_src = self.get_pic_src(destination)
        self.save(pic_src, ".//source_url.txt")
        self.download_pic(pic_src)
        end = time.time()
        total_time = end - start
        time_format = str(datetime.timedelta(seconds=total_time))
        print(f"The programmer named MeiziSpider running {time_format} seconds in total!")


if __name__ == '__main__':
    url = "https://www.3gbizhi.com/meinv"
    img_path = "./picture1"
    spider = MeiziSpider(url, img_path)
    spider.start_spider()

结果展示部分

三、比较

函数写法，还是比较随意的，一个个写完就完行了，相互调用，比较简单，但看着不太清楚，层次不如子类写法，子类写法看着舒服，层级清晰，不用担心变量找不到，但是呢，要求也相对高一些，需要对子类调用的又一个整体清晰的认识，二者也是联系的，并不是孤立存在的。

四、缺点

（1）时间效率

本次实现的代码，时间响应还是很慢的，一在于代码本身，没有做优化，使用多线程等技术二、在于多次使用嵌套循环。三、python语言本身的问题，四、跟电脑、网络带宽有影响。前两者属于主观因素，可以进行优化，第三者属于半客观半主管，最后是属于客观因素，没有钱，穷，设么都对。因此本代码只能适用普通娱乐，学习交流，不适合商用，

（2）空间效率

跟时间效率相似，可以做空间回收机制，减少内存的使用。

五、申明

本次代码，整理于网络资源，自己根据实际需求改写，功能简单，若重合，实属于巧合。

松花蛋卷蛋

关注

29
点赞
踩
13

收藏

觉得还不错? 一键收藏
打赏
1
评论
MiziSpider爬虫程序源码【单线程】【子类写法】【函数写法】

最近一直在学深度学习目标检测，那里面python代码就达到了1000行左右，背后大量的数学逻辑，学着学着就想逃回农村，没有买卖就没有伤害。偶然间看到爬虫，花了一天时间学了一下，找了个网站，测试了一下，还是可以的，爬虫这个程序照着原理图做，还是还简单的。该爬虫只用于锻炼思维，交流学习。后续还可以继续完善。
复制链接

扫一扫