本文摘要
目标任务:爬取 stat-nba网站上,2018-2019赛季每支NBA球队所有比赛记录
所用技术:python语言、requests库请求数据、lxml库解析数据、xpath匹配数据
1. 首先分析URL
-
湖人队比赛记录页面url:
http://www.stat-nba.com/query_team.php?crtcol=date_out&order=0&QueryType=game&GameType=season&Team_id=LAL&PageNum=1000&Season0=2018&Season1=2019
-
勇士队比赛记录页面url:
http://www.stat-nba.com/query_team.php?crtcol=date_out&order=0&QueryType=game&GameType=season&Team_id=GSW&PageNum=1000&Season0=2018&Season1=2019
-
老鹰队比赛记录页面url:
http://www.stat-nba.com/query_team.php?crtcol=date_out&order=0&QueryType=game&GameType=season&Team_id=ATL&PageNum=1000&Season0=2018&Season1=2019
发现在url中使用不同的Team_id就能跳转到相应的球队。
2. 获取所有NBA球队的Team_id和中文名
- 在球队页面 http://www.stat-nba.com/teamList.php中找到了 Team_id 和 球队中文名
- 爬取Team_id 和 球队中文名
import requests
from lxml import etree
def team_name():
# 请求html
url = "http://www.stat-nba.com/teamList.php"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
res = requests.get(url=url, headers=headers)
res.encoding = "utf-8"
text = res.text
# 解析html
parse_html = etree.HTML(text)
base_xpath1 = '//td/div[@class="team"]/a/div/text()'
base_xpath2 = '//td/div[@class="team"]/a/@href'
chinese_name = parse_html.xpath(base_xpath1)
english_name = parse_html.xpath(base_xpath2)
print(chinese_name)
# 存储球队名字
name = {}
for e, c in zip(english_name, chinese_name):
e = e[-8:-5]
name[e] = c
with open("teamName.csv", "w", encoding="utf-8") as f:
for key, value in name.items():
# python自动解决不同操作系统csv文件的换行符差异
line = key+","+value+"\n"
f.write(line)
team_name()
3. 爬取所有NBA球队2018-2019赛季的详细比赛记录
import requests
from lxml import etree
import time
def team_record(f, english_name, season0, season1):
# 爬取html
url = "http://www.stat-nba.com/query_team.php?crtcol=date_out&order=0&QueryType=game&GameType=season\
&Team_id="+english_name+"&PageNum=1000&Season0="+season0+"&Season1="+season1
headers = {'UserAgent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
res = requests.get(url=url, headers=headers)
res.encoding = "utf-8"
html = res.text
print(url)
print(html)
# 解析html
parse_html = etree.HTML(html)
base_xpath_th = "//th/text()"
base_xpath_tr = "//tbody/tr"
all_th = parse_html.xpath(base_xpath_th)
all_tr = parse_html.xpath(base_xpath_tr)
# 把标题栏写入文件
title = ",".join(all_th)+"\n"
f.write(title)
for tr in all_tr:
# td_list = tr.xpath(".//*/text()")
td_list = tr.xpath("./td/text() | ./td/a/text()")
data = ",".join(td_list)+"\n"
# 把该队每场比赛数据写入文件
f.write(data)
def main(season_start, season_end):
# 打开球队名称文件
team_name = open("teamName.csv", "r", encoding="utf-8")
# 打开球队战绩文件
record = open(season_start+"-"+season_end+"teamRecord.csv", "w", encoding="utf-8")
# 爬取球队战绩
for name in team_name:
name = name.replace("\n", "")
english_name = name.split(",")[0]
chinese_name = name.split(",")[1]
record.write(chinese_name+"队\n")
team_record(record, english_name, season_start, season_end)
record.write("\n")
time.sleep(1)
# 关闭球队战绩文件
record.close()
# 关闭球队名称文件
team_name.close()
if __name__ == "__main__":
start = time.time()
season_begin = "2018"
season_finish = "2019"
main(season_begin, season_finish)
end = time.time()
print("总运行时间:", end-start)
注:
分析不同赛季下,各球队比赛记录页面的url,
发现更改"&Season0=2018&Season1=2019"的取值,
就能切换赛季,故上述代码用变量代替了2018和2019两个字符串。
运行完上述代码,自己的文件夹里会多出两个文件,
点开来看看数据都爬下来了没。
彩蛋:
原网站上的url只能显示每个NBA球队82场常规赛的数据,
但只要修改url中的“&GameType=season”字段,
就可以同时显示出各球队常规赛和季后赛的数据!
修改方法如下:
只要在"&GameType=season"后面加一个换行符 / 空格 / 直接它删掉,就行了。
4. 后记
最近需要些NBA相关的数据,无意间发现了stat-nba这个网站,
虽然stat-nba的界面配色复古了些,但数据却是相当的全~
而且这个网站真的…特别好爬……
hhh大家放心,以上代码爬取的数据量不大,
且爬的慢,不会给该网站造成困扰的~
本篇博客到这里就要接近尾声了,
第一次用Markdown格式写博客,感觉还不错~
有问题欢迎留言,
喜欢的可以点赞收藏哦~