爬取NBA球员数据

lyc_QAQ

已于 2024-02-04 19:08:46 修改

阅读量1.4k

点赞数 14

文章标签：网络爬虫 python

于 2024-02-04 18:26:15 首次发布

本文链接：https://blog.csdn.net/weixin_64426947/article/details/136030494

版权

介绍： 为获得美国篮球联赛NBA中的每个球员基本信息，所以我针对国内的nba篮球网站https://china.nba.cn/playerindex/，分析其网页结果，实现一个爬虫程序，对此网站上的NBA球员数据进行爬取并保存到本地的csv文件中。

一、找数据源

首先进入我想要爬取的网页目录。

然后打开网页开发者工具，进入Network目录，并选择XHR，刷新网页之后，会出现playerlist.json这个文件，这个文件中包含了该网页中所以球员的数据。但是我第一次尝试发现XHR目录中找不到目标json，经过多次尝试，我发现当我开启vpn，使用代理IP访问网站时，就可以找到此json文件。

二、分析存储结构

打开palyerlist.json文件，对其目录结构进行分析。

从json结构中可以看出，球员的基本信息在payload/players目录下，该目录下的playProfile中包含了每个球员的国家、名字、身高、体重等等信息。

三、爬取数据

首先实现了一个抓取json包的函数，通过构造一个HTTP请求头来模拟一个具有特定用户代理的浏览器请求对需要抓取的网站进行请求访问，然后使用requests库发起HTTP GET请求，从指定网页链接中获取数据，最后用json解析HTTP响应的文本内容。

import requests
import json
import csv
url='https://china.nba.cn/stats2/league/playerlist.json?locale=zh_CN'
def getJson(url):
    headers={
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36 Edg/95.0.1020.53'
    }
    response = requests.get(url,headers=headers)
    json_data = json.loads(response.text)
    return json_data

getdata函数，根据第二步中对json文件结构的分析，构造列表和字典来，逐层对json中球员的各个特征进行提取并保存。

def  getData(json_data):
    playerList=[]
    #定位到每个球员
    for item in json_data['payload']['players']:
        player_dataDict={}
        #球员名字
        name = item['playerProfile']['code']
        #球员国籍
        country = item['playerProfile']['country']
        #中文名
        Chninese_name = item['playerProfile']['displayName']
        #nba经验
        experience = item['playerProfile']['experience']
        #选秀年
        draft_year = item['playerProfile']['draftYear']
        #身高
        height = item['playerProfile']['height']
        #体重
        weight = item['playerProfile']['weight']
        #球员编号
        playerId = item['playerProfile']['playerId']
        #球场位置
        position = item['playerProfile']['position']
        #选秀来源
        schoolType = item['playerProfile']['schoolType']
        #球队
        team = item['teamProfile']['displayAbbr']

        player_dataDict['球员编号'] = playerId
        player_dataDict['英文名'] = name
        player_dataDict['中文名'] = Chninese_name
        player_dataDict['国籍'] = country
        player_dataDict['选秀年'] = draft_year
        player_dataDict['nba经验'] = experience
        player_dataDict['身高'] = height
        player_dataDict['体重'] = weight
        player_dataDict['球场位置'] = position
        player_dataDict['效力球队'] = team
        player_dataDict['选秀来源'] = schoolType

        print(player_dataDict)
        playerList.append(player_dataDict)
    return playerList

最后将提取的数据，用‘utf-8’中文的编码格式写入csv文件。

def writeData(playerList):
    #写入数据
    with open('player_data.csv','w',encoding='utf-8',newline='')as f:
        write=csv.DictWriter(f, fieldnames=['球员编号','英文名','中文名','国籍',
                                            '选秀年','nba经验','身高','体重',
                                            '球场位置','效力球队','选秀来源'])
        write.writeheader()
        for each in playerList:
            write.writerow(each)

调用结果如下

if __name__ == "__main__":
    json_data = getJson(url)
    playerList=[]
    playerList += getData(json_data)
    writeData(playerList)

四、总结

此次爬虫实验使我可以更好运用网页开发者模式来分析目标网站的结构和数据格式，通过使用Python中的爬虫库，我成功地从网页上提取了NBA球员的相关数据，但在爬虫过程中，我也遇到了网站的反爬虫机制，为了规避这些机制我学到了一些应对策略，如设置合适的请求头、使用代理IP等，以确保稳定的数据爬取，此次实验我用到了代理IP成功找到网站中想要爬取的数据。