用python爬取NBA球队的所有比赛记录


本文摘要

目标任务:爬取 stat-nba网站上,2018-2019赛季每支NBA球队所有比赛记录
所用技术:python语言、requests库请求数据、lxml库解析数据、xpath匹配数据


1. 首先分析URL

  • 湖人队比赛记录页面url:

    http://www.stat-nba.com/query_team.php?crtcol=date_out&order=0&QueryType=game&GameType=season&Team_id=LAL&PageNum=1000&Season0=2018&Season1=2019

  • 勇士队比赛记录页面url:

    http://www.stat-nba.com/query_team.php?crtcol=date_out&order=0&QueryType=game&GameType=season&Team_id=GSW&PageNum=1000&Season0=2018&Season1=2019

  • 老鹰队比赛记录页面url:

    http://www.stat-nba.com/query_team.php?crtcol=date_out&order=0&QueryType=game&GameType=season&Team_id=ATL&PageNum=1000&Season0=2018&Season1=2019

    发现在url中使用不同的Team_id就能跳转到相应的球队。


2. 获取所有NBA球队的Team_id和中文名

import requests
from lxml import etree


def team_name():
    # 请求html
    url = "http://www.stat-nba.com/teamList.php"
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
    res = requests.get(url=url, headers=headers)
    res.encoding = "utf-8"
    text = res.text

    # 解析html
    parse_html = etree.HTML(text)
    base_xpath1 = '//td/div[@class="team"]/a/div/text()'
    base_xpath2 = '//td/div[@class="team"]/a/@href'
    chinese_name = parse_html.xpath(base_xpath1)
    english_name = parse_html.xpath(base_xpath2)
    print(chinese_name)

    # 存储球队名字
    name = {}
    for e, c in zip(english_name, chinese_name):
        e = e[-8:-5]
        name[e] = c
    with open("teamName.csv", "w", encoding="utf-8") as f:
        for key, value in name.items():
            # python自动解决不同操作系统csv文件的换行符差异
            line = key+","+value+"\n"
            f.write(line)


team_name()



3. 爬取所有NBA球队2018-2019赛季的详细比赛记录

import requests
from lxml import etree
import time


def team_record(f, english_name, season0, season1):
    # 爬取html
    url = "http://www.stat-nba.com/query_team.php?crtcol=date_out&order=0&QueryType=game&GameType=season\
    &Team_id="+english_name+"&PageNum=1000&Season0="+season0+"&Season1="+season1
    headers = {'UserAgent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
    res = requests.get(url=url, headers=headers)
    res.encoding = "utf-8"
    html = res.text
    print(url)
    print(html)

    # 解析html
    parse_html = etree.HTML(html)
    base_xpath_th = "//th/text()"
    base_xpath_tr = "//tbody/tr"
    all_th = parse_html.xpath(base_xpath_th)
    all_tr = parse_html.xpath(base_xpath_tr)
    # 把标题栏写入文件
    title = ",".join(all_th)+"\n"
    f.write(title)
    for tr in all_tr:
        # td_list = tr.xpath(".//*/text()")
        td_list = tr.xpath("./td/text() | ./td/a/text()")
        data = ",".join(td_list)+"\n"
        # 把该队每场比赛数据写入文件
        f.write(data)


def main(season_start, season_end):
    # 打开球队名称文件
    team_name = open("teamName.csv", "r", encoding="utf-8")
    # 打开球队战绩文件
    record = open(season_start+"-"+season_end+"teamRecord.csv", "w", encoding="utf-8")

    # 爬取球队战绩
    for name in team_name:
        name = name.replace("\n", "")
        english_name = name.split(",")[0]
        chinese_name = name.split(",")[1]
        record.write(chinese_name+"队\n")
        team_record(record, english_name, season_start, season_end)
        record.write("\n")
        time.sleep(1)

    # 关闭球队战绩文件
    record.close()
    # 关闭球队名称文件
    team_name.close()


if __name__ == "__main__":
    start = time.time()
    season_begin = "2018"
    season_finish = "2019"
    main(season_begin, season_finish)
    end = time.time()
    print("总运行时间:", end-start)


注:
分析不同赛季下,各球队比赛记录页面的url,
发现更改"&Season0=2018&Season1=2019"的取值,
就能切换赛季,故上述代码用变量代替了2018和2019两个字符串。

运行完上述代码,自己的文件夹里会多出两个文件,
点开来看看数据都爬下来了没。
NBA球队Team_id
在这里插入图片描述
彩蛋
原网站上的url只能显示每个NBA球队82场常规赛的数据,
但只要修改url中的“&GameType=season”字段,
就可以同时显示出各球队常规赛和季后赛的数据!

修改方法如下:
只要在"&GameType=season"后面加一个换行符 / 空格 / 直接它删掉,就行了。


4. 后记

最近需要些NBA相关的数据,无意间发现了stat-nba这个网站,
虽然stat-nba的界面配色复古了些,但数据却是相当的全~
而且这个网站真的…特别好爬……
hhh大家放心,以上代码爬取的数据量不大,
且爬的慢,不会给该网站造成困扰的~

本篇博客到这里就要接近尾声了,
第一次用Markdown格式写博客,感觉还不错~
有问题欢迎留言,
喜欢的可以点赞收藏哦~

  • 7
    点赞
  • 21
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值