python爬取王者荣耀英雄列表，运用BeautifulSoup库。批量爬取页面单个元素的示例。

前程的前程也迷茫

已于 2023-09-13 22:36:16 修改

阅读量803

点赞数 1

文章标签： python 爬虫

于 2023-09-13 22:27:22 首次发布

本文链接：https://blog.csdn.net/HQC66666/article/details/132866030

版权

一、项目介绍

“王者荣耀”是一款在中国大陆非常流行的多人在线战斗竞技游戏。这款游戏拥有众多的英雄角色，每个角色都有自己独特的能力和风格。在这篇文章中，我们将使用Python进行网络爬虫，获取王者荣耀官网上的所有英雄名字，并将结果存储在一个文本文件中。

此项目仅供学习使用，禁止搬运产生非法行为！

二、代码分析（全部代码在最下面）

首先，我们需要导入必要的库：requests用于发送HTTP请求，os用于处理文件和目录路径，BeautifulSoup用于解析HTML页面。

import requests
import os
from bs4 import BeautifulSoup

接下来，我们定义了一个名为get_hero_name的函数，该函数用于获取英雄名字。

def get_hero_name():
    count = 0
    with open('王者荣耀英雄名单.txt', 'w', encoding='utf-8') as f:
        f.write('')
    for num in range(100, 600):
        myUrl = 'https://pvp.qq.com/web201605/herodetail/{}.shtml'.format(num)
        html = requests.get(myUrl)
        html.encoding = html.apparent_encoding
        if html.status_code == 404:
            continue
        else:
            soup = BeautifulSoup(html.text, "html.parser")
            hero_tags = soup.select_one('h2.cover-name').string
            print(hero_tags)
            with open('王者荣耀英雄名单.txt', 'a+', encoding='utf-8') as f:
                f.write(hero_tags + '\n')
            count += 1
    size = get_file_size('王者荣耀英雄名单.txt')
    print(f'爬取完成，共爬取{count}项数据，已保存到‘’王者荣耀英雄名单.txt‘’中，文件大小{size}')

在这个函数中，我们首先清除之前可能存在的文件内容，然后通过循环访问每个英雄的URL，获取其名字，并将结果追加到文本文件中。我们使用了BeautifulSoup库来解析HTML页面并提取英雄名字。此外，我们还计算了最终文件的大小，并输出了爬取的数据项数量和文件名。

接下来，我们定义了一个名为get_file_size的函数，用于计算文件大小。

def get_file_size(file):
    size = os.path.getsize(file)
    if size < 1024:
        return round(size, 2), 'Byte'
    else:
        KBX = size / 1024
        if KBX < 1024:
            return round(KBX, 2), 'K'
        else:
            MBX = KBX / 1024
            if MBX < 1024:
                return round(MBX, 2), 'M'
            else:
                return round(MBX / 1024), 'G'

最后，我们在主程序中调用get_hero_name函数来执行爬取任务。

三、总代码

import requests
import os
from bs4 import BeautifulSoup


def get_hero_name():
    count = 0
    #  先清除文件内容
    with open('王者荣耀英雄名单.txt', 'w', encoding='utf-8') as f:
        f.write('')

    for num in range(100, 600):
        myUrl = 'https://pvp.qq.com/web201605/herodetail/{}.shtml'.format(num)
        html = requests.get(myUrl)
        # 解决乱码问题
        html.encoding = html.apparent_encoding
        if html.status_code == 404:
            continue
        else:
            # 使用 BeautifulSoup 库解析页面内容
            soup = BeautifulSoup(html.text, "html.parser")

            # 在页面上找到所有英雄名字的标签
            hero_tags = soup.select_one('h2.cover-name').string
            print(hero_tags)
            with open('王者荣耀英雄名单.txt', 'a+', encoding='utf-8') as f:  ## 打开文件
                f.write(hero_tags + '\n')  ## 写入文件
            count += 1
    size = get_file_size('王者荣耀英雄名单.txt')
    print(f'爬取完成，共爬取{count}项数据，已保存到‘’王者荣耀英雄名单.txt‘’中，文件大小{size}')


# 计算文件大小
def get_file_size(file):
    size = os.path.getsize(file)  # 返回的是字节大小
    # 为了更好地显示，应该时刻保持显示一定整数形式，即单位自适应
    if size < 1024:
        return round(size, 2), 'Byte'
    else:
        KBX = size / 1024
        if KBX < 1024:
            return round(KBX, 2), 'K'
        else:
            MBX = KBX / 1024
            if MBX < 1024:
                return round(MBX, 2), 'M'
            else:
                return round(MBX / 1024), 'G'


if __name__ == '__main__':
    get_hero_name()