《python数据挖掘入门与实践》决策树预测nba数据集

前言: 学到决策树预测球队输赢时,按照书中网址去下载数据集,无奈怎么也没下载成功。即使下载了excel文件也是破损的。咱可是学了python的银,那好吧,我就把它爬取下来。(资源在下面)

代码:

'''
    爬取《python数据挖掘入门与实践》提到的nba赛况
    https://www.basketball-reference.com/leagues/NBA_2014_games-october.html
    操作:编译.py后,使用save()方法即可
'''
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

BASE_URL = 'https://www.basketball-reference.com/leagues/NBA_2014_games-{month}.html' 
all_month = np.array(['october','november','december','january','february','march','april','may','june'])

def get_content():
    list = []
    for i in range(len(all_month)):
        url = BASE_URL.format(month=all_month[i])
        print(url)
        html = urlopen(url).read()
        bsObj = BeautifulSoup(html,'lxml')
        rows = [dd for dd in bsObj.select('tbody tr')]#selectk()可以多重刷选
        for row in rows:
            cell = [i.text for i in row.find_all('td')]#对于每一个tr标签内也可以进行td标签筛选
            list.append(cell)
    return list#返回二维列表
#存储为scv格式
def save():
    file = open('D:\\Python\\PythonProject\\nba_decisiontree_test\\matches.csv','w')#地址要自己改
    list = get_content()
    df_data = pd.DataFrame(columns=[1,2,3,4,5,6,7,8,9] ,data=list)
    df_data.to_csv(file)
    print('done')

输出:

>>> save()
https://www.basketball-reference.com/leagues/NBA_2014_games-october.html
https://www.basketball-reference.com/leagues/NBA_2014_games-november.html
https://www.basketball-reference.com/leagues/NBA_2014_games-december.html
https://www.basketball-reference.com/leagues/NBA_2014_games-january.html
https://www.basketball-reference.com/leagues/NBA_2014_games-february.html
https://www.basketball-reference.com/leagues/NBA_2014_games-march.html
https://www.basketball-reference.com/leagues/NBA_2014_games-april.html
https://www.basketball-reference.com/leagues/NBA_2014_games-may.html
https://www.basketball-reference.com/leagues/NBA_2014_games-june.html
done

数据展示:
这里写图片描述

补充: 看到后面发现还有一份数据需要用,但是上面的代码却不能够用在这里。原因是球队排行的数据被注释掉了(查看网页源码可发现)。所以这里用到了正则表达式去获取注释。

代码:

'''
    #get_standing_data.py
    获取《python数据挖掘入门与实践》决策树nba球队预测的球队排行数据
    存储地址自行修改
'''
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re

#pattern = re.compile('<!--[\s\S]*?-->')#html注释的正则:<!--[\s\S]*?-->
pattern = re.compile('<tbody>[\s\S]*?</tbody>')#模仿html注释的正则
url = 'https://www.basketball-reference.com/leagues/NBA_2013_standings.html'
html = urlopen(url).read()
bsObj = BeautifulSoup(html,'lxml')
content = bsObj.find(id='all_expanded_standings').prettify()
match = re.search(pattern,content)
str_tbody = match.group()
html_tbody = BeautifulSoup(str_tbody,'lxml')#将str字符串传入获得html对象
list = []
for tr in html_tbody.find_all('tr'):
    rows = [td.text for td in tr.find_all('td')]
    list.append(rows)

#转成csv格式
file = 'D:\\Python\\PythonProject\\nba_decisiontree_test\\standing.csv'#自行修改
df_data = pd.DataFrame(data=list)
df_data.to_csv(file)
print('done')



部分数据展示:

>>> df_data
                        0      1      2      3      4      5     6     7   \
0               Miami Heat  66-16   37-4  29-12  41-11   25-5  14-4  12-6   
1    Oklahoma City Thunder  60-22   34-7  26-15   21-9  39-13   7-3   8-2   
2        San Antonio Spurs  58-24   35-6  23-18   25-5  33-19   8-2   9-1   
3           Denver Nuggets  57-25   38-3  19-22  19-11  38-14   5-5  10-0   
4     Los Angeles Clippers  56-26   32-9  24-17   21-9  35-17   7-3   8-2   
5        Memphis Grizzlies  56-26   32-9  24-17   22-8  34-18   8-2   8-2   
6          New York Knicks  54-28  31-10  23-18  37-15  17-13  10-6  12-6   
7            Brooklyn Nets  49-33  26-15  23-18  36-16  13-17  11-5  13-5   
8           Indiana Pacers  49-32  30-11  19-21  31-20  18-12  6-11  13-3   
9    Golden State Warriors  47-35  28-13  19-22  19-11  28-24   7-3   5-5   
10           Chicago Bulls  45-37  24-17  21-20  34-18  11-19  13-5   9-7   
11         Houston Rockets  45-37  29-12  16-25   21-9  24-28   7-3   7-3   
12      Los Angeles Lakers  45-37  29-12  16-25  17-13  28-24   6-4   6-4   
13           Atlanta Hawks  44-38  25-16  19-22  29-23  15-15  7-11  11-7   
14               Utah Jazz  43-39  30-11  13-28  17-13  26-26   5-5   5-5   
15          Boston Celtics  41-40  27-13  14-27  27-24  14-16   7-9   8-9   
16        Dallas Mavericks  41-41  24-17  17-24  17-13  24-28   5-5   6-4   

文件资源: 有用的话点个赞呗

链接:https://pan.baidu.com/s/1eUfa914 密码:5ptu

———关注我的公众号,一起学数据挖掘————
这里写图片描述

  • 13
    点赞
  • 30
    收藏
    觉得还不错? 一键收藏
  • 6
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值