python爬虫-使用request,lxml库爬取游戏排名_排行榜什么好爬取-CSDN博客

本文链接：https://blog.csdn.net/weixin_44632609/article/details/105538296

本文介绍了一个简单的Python爬虫实例，用于抓取hao123网站的网络游戏排行榜，通过分析网页结构并使用XPath语法从静态页面中提取排名数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

爬取目标URL：http://wy.hao123.com/top
开发环境：

PyCharm 2019.2.3
Python3.6
火狐浏览器

使用的三方库：

requests
lxml

执行结果

开始

抓取网页

打开火狐浏览器，输入地址 http://wy.hao123.com/top ；
按F12功能键，启动调试功能；
调试界面通过分析使用下面的xpath代码可以获取到想要的数据

"//div[@class='list1 margin-right']|//div[@class='list1 ']"

匹配结果

编写爬虫代码

spider_hao123_01.py

"""
    爬取hao123网站的网络游戏排行榜
    http://wy.hao123.com/top
"""
import requests
from lxml import etree


def load_context():
    """ 获取http://wy.hao123.com/top网页内容 """

    # 定义HTTP请求头
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'
    }
    # url地址
    url = "http://wy.hao123.com/top"

    # 使用requests的get方法获取网页内容
    rsp_ctx = requests.get(url, headers=headers)
    # 返回网页内容
    return rsp_ctx.text


def parse_context(c):
    html = etree.HTML(c)
    # 获取全部排名div标签
    divs = html.xpath("//div[@class='list1 margin-right']|//div[@class='list1 ']")
    with open('./hao123_wy_top.txt', 'w') as f:
        for div in divs:
        	# 获取排名的名称
            title = div.xpath("./div[@class='tlt']")[0].text
            # print(title)
            f.write(title + '\n')
            # 获取排名数据项
            games = div.xpath(".//li")
            for idx, game in enumerate(games):
            	# 获取游戏名称
                game_name = game.xpath("./p/a")[0].text
                # 获取游戏类型
                game_type = game.xpath("./em")[0].text
                f.write("\t" + str(idx+1) + "\t" + game_name + "\t" + game_type + '\n')
    print("hao123网游排名获取完毕。")


if __name__ == "__main__":
    ctx = load_context()
    parse_context(ctx)