本人不看NBA,有的球队改名了,我挑不出来,后面就没法处理了
源代码来自Python数据挖掘入门分析与实践,我处理了一下
https://github.com/iamjiangmioamiao/NBA
我下了数据
# -*- coding: utf-8 -*-
"""
Created on Sun Aug 20 18:47:09 2017
@author: Administrator
"""
#将多个Excel文件合并成一个
import xlrd
import xlsxwriter
import glob
from openpyxl import workbook
#打开一个excel文件
def open_xls(file):
fh=xlrd.open_workbook(file)
return fh
#获取excel中所有的sheet表
def getsheet(fh):
return fh.sheets()
#获取sheet表的行数
def getnrows(fh,sheet):
table=fh.sheets()[sheet]
return table.nrows
#读取文件内容并返回行内容
def getFilect(file,shnum):
fh=open_xls(file)
table=fh.sheets()[shnum]
num=table.nrows
for row in range(num):
rdata=table.row_values(row)
datavalue.append(rdata)
return datavalue
#获取sheet表的个数
def getshnum(fh):
data_arry = []
for file in glob.glob("/data/seventeen/*.xlsx"):
#glob方法: 返回所有匹配的文件路径列表,该方法需要一个参数用来指定匹配的路径字符串(本字符串可以为绝对路径也可以为相对路径),
get_xlsx_data(file)
row = 1
col = 0
for linev in data_arry:
#print linev
worksheet.write_row(row,col,linev)
row += 1
x=0
sh=getsheet(fh)
for sheet in sh:
x+=1
return x
if __name__=='__main__':
#定义要合并的excel文件列表
import os
import os.path
import pandas as pd
rootdir = '../data/seventeen'
parent = '..\data\seventeen'
allxls=[]
#parent = 'E:\python_data_mining\data\seventeen'
for filenames in os.walk(rootdir):
#print (filenames[2])
for i in filenames[2]:
data = pd.read_excel(os.path.join(parent,i))
#print (data)
#print(data)
allxls.append(os.path.join(parent,i))
#存储所有读取的结果
datavalue=[]
for fl in allxls:
fh=open_xls(fl)
x=getshnum(fh)
for shnum in range(x):
print("正在读取文件:"+str(fl)+"的第"+str(shnum)+"个sheet表的内容...")
rvalue=getFilect(fl,shnum)
#定义最终合并后生成的新文件
endfile='../data/seventeen/end.xlsx'
wb1=xlsxwriter.Workbook(endfile)
#创建一个sheet工作对象
ws=wb1.add_worksheet()
for a in range(len(rvalue)):
for b in range(len(rvalue[a])):
c=rvalue[a][b]
ws.write(a,b,c)
wb1.close()
print("文件合并完成")
然后,数据有了,开始导入数据,引用pandas来导入数据
from collections import defaultdict
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data_filename = '../data/seventeen/end.xlsx'
results = pd.read_excel(data_filename,skiprows=[0,])#
results.columns = ["Date", "Score Type", "Visitor Team", "VisitorPts", "Home Team", "HomePts", "OT?", "Notes"]
results["HomeWin"] = results["VisitorPts"] < results["HomePts"]
y_true = results["HomeWin"].values
print (y_true[:5])
print("Home Win percentage: {0:.1f}%".format(100 * results["HomeWin"].sum() / results["HomeWin"].count()))
print (results.loc[:5])
Date Score Type Visitor Team VisitorPts \
0 Sat, Apr 1, 2017 Box Score Atlanta Hawks 104
1 Sat, Apr 1, 2017 Box Score Los Angeles Lakers 104
2 Sat, Apr 1, 2017 Box Score Sacramento Kings 123
3 Sat, Apr 1, 2017 Box Score Phoenix Suns 117
4 Sun, Apr 2, 2017 Box Score Atlanta Hawks 82
5 Sun, Apr 2, 2017 Box Score Indiana Pacers 130
Home Team HomePts OT? Notes
0 Chicago Bulls 106 NaN NaN
1 Los Angeles Clippers 115 NaN NaN
2 Minnesota Timberwolves 117 NaN NaN
3 Portland Trail Blazers 130 NaN NaN
4 Brooklyn Nets 91 NaN NaN
5 Cleveland Cavaliers 135 2OT NaN
results["HomeWin"] = results["VisitorPts"] < results["HomePts"]
y_true = results["HomeWin"].values
print (y_true[:5])
print("Home Win percentage: {0:.1f}%".format(100 * results["HomeWin"].sum() / results["HomeWin"].count()))
print (results.loc[:5])
上段代码,创建了一个新的特征homewin,算出来主场胜利队伍比例,显示出来其结果如下
[ True True False True True]
Home Win percentage: 58.3%
Date Score Type Visitor Team VisitorPts \
0 Sat, Apr 1, 2017 Box Score Atlanta Hawks 104
1 Sat, Apr 1, 2017 Box Score Los Angeles Lakers 104
2 Sat, Apr 1, 2017 Box Score Sacramento Kings 123
3 Sat, Apr 1, 2017 Box Score Phoenix Suns 117
4 Sun, Apr 2, 2017 Box Score Atlanta Hawks 82
5 Sun, Apr 2, 2017 Box Score Indiana Pacers 130
Home Team HomePts OT? Notes HomeWin
0 Chicago Bulls 106 NaN NaN True
1 Los Angeles Clippers 115 NaN NaN True
2 Minnesota Timberwolves 117 NaN NaN False
3 Portland Trail Blazers 130 NaN NaN True
4 Brooklyn Nets 91 NaN NaN True
5 Cleveland Cavaliers 135 2OT NaN True
继续,新建两个特征,用上一场比赛赢家来判定球队实力
results["HomeLastWin"] = False
results["VisitorLastWin"] = False
## This creates two new columns, all set to False
print (results.loc[:5])
print (results["HomeLastWin"])
结果如下:
Date Score Type Visitor Team VisitorPts \
0 Sat, Apr 1, 2017 Box Score Atlanta Hawks 104
1 Sat, Apr 1, 2017 Box Score Los Angeles Lakers 104
2 Sat, Apr 1, 2017 Box Score Sacramento Kings 123
3 Sat, Apr 1, 2017 Box Score Phoenix Suns 117
4 Sun, Apr 2, 2017 Box Score Atlanta Hawks 82
5 Sun, Apr 2, 2017 Box Score Indiana Pacers 130
Home Team HomePts OT? Notes HomeWin HomeLastWin \
0 Chicago Bulls 106 NaN NaN True False
1 Los Angeles Clippers 115 NaN NaN True False
2 Minnesota Timberwolves 117 NaN NaN False False
3 Portland Trail Blazers 130 NaN NaN True False
4 Brooklyn Nets 91 NaN NaN True False
5 Cleveland Cavaliers 135 2OT NaN True False
VisitorLastWin
0 False
1 False
2 False
3 False
4 False
5 False
正题来了,我们创建了这么多特征,就是为了用决策树处理,但是光创建不行,我们要把数据更新,用iterrows(),来一波应用
import pandas as pd
df = pd.DataFrame([[1, 1.5],[2,2]], columns=['int', 'float'])
print (df)
for index,row in df.iterrows():
print (index,row)
int float
0 1 1.5
1 2 2.0
0 int 1.0
float 1.5
Name: 0, dtype: float64
1 int 2.0
float 2.0
Name: 1, dtype: float64
现买现卖:
from collections import defaultdict
won_last = defaultdict(int)
###
###
###
#####
###
###
###
###
###
for index, row in results.iterrows(): # Note that this is not efficient
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
row["HomeLastWin"] = won_last[home_team]
row["VisitorLastWin"] = won_last[visitor_team]
results.loc[index] = row
# Set current win
won_last[home_team] = row["HomeWin"]#
won_last[visitor_team] = not row["HomeWin"]#
##这段代码是为了建立一个两个新的特征,记录上一场比赛赢家
##每次homenlastwin记录下了homewin的当前数据作为新的数值,Homewin迭代到下一行数据
print (results.loc[20:25])
结果如下:
Date Score Type Visitor Team VisitorPts \
20 Tue, Apr 4, 2017 Box Score Denver Nuggets 134
21 Tue, Apr 4, 2017 Box Score Chicago Bulls 91
22 Tue, Apr 4, 2017 Box Score Milwaukee Bucks 79
23 Tue, Apr 4, 2017 Box Score Brooklyn Nets 141
24 Tue, Apr 4, 2017 Box Score Dallas Mavericks 87
25 Tue, Apr 4, 2017 Box Score Memphis Grizzlies 89
Home Team HomePts OT? Notes HomeWin HomeLastWin \
20 New Orleans Pelicans 131 NaN NaN False False
21 New York Knicks 100 NaN NaN True False
22 Oklahoma City Thunder 110 NaN NaN True False
23 Philadelphia 76ers 118 NaN NaN False False
24 Sacramento Kings 98 NaN NaN True True
25 San Antonio Spurs 95 OT NaN True True
VisitorLastWin
20 True
21 True
22 False
23 True
24 True
25 False
下面用决策树了:
clf = DecisionTreeClassifier(random_state=14)
## Create a dataset with just the neccessary information
X_previouswins = results[["HomeLastWin", "VisitorLastWin"]].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_previouswins, y_true, scoring='accuracy')
print("Using just the last result from the home and visitor teams")
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))
预测结果如下:Using just the last result from the home and visitor teams
Accuracy: 58.3%
特征还不够啊,下一步是继续继续创建特征:
results["HomeWinStreak"] = 0
results["VisitorWinStreak"] = 0
### Did the home and visitor teams win their last game?
from collections import defaultdict
win_streak = defaultdict(int)
###
###
###
###
###
###
###
###
for index, row in results.iterrows(): # Note that this is not efficient
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
row["HomeWinStreak"] = win_streak[home_team]
row["VisitorWinStreak"] = win_streak[visitor_team]
results.loc[index] = row
# Set current win
if row["HomeWin"]:
win_streak[home_team] += 1
win_streak[visitor_team] = 0
else:
win_streak[home_team] = 0
win_streak[visitor_team] += 1
print (results.loc[:5])
clf = DecisionTreeClassifier(random_state=14)
X_winstreak = results[["HomeLastWin", "VisitorLastWin", "HomeWinStreak", "VisitorWinStreak"]].values
scores = cross_val_score(clf, X_winstreak, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))
ladder = pd.read_excel('../data/sportsref_download.xlsx')
#print (ladder)
results["HomeTeamRanksHigher"] = 0
for index, row in results.iterrows():
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
# if home_team == "New Orleans Pelicans":
# home_team = "New Orleans Hornets"
# elif visitor_team == "New Orleans Pelicans":
# visitor_team = "New Orleans Hornets"
#print (ladder['Team'])
#print (ladder[ladder["Team"] == home_team]["Rk"])
home_rank = ladder[ladder["Team"] == home_team]["Rk"].values
#print (type(home_rank))
visitor_rank = ladder[ladder["Team"] == visitor_team]["Rk"].values
try:
row["HomeTeamRanksHigher"] = int(home_rank > visitor_rank)
except:
row["HomeTeamRanksHigher"]=0
#print (len(list(home_rank)))
#row["HomeTeamRanksHigher"] = int(home_rank > visitor_rank)
#print (row["HomeTeamRanksHigher"])
results.loc[index] = row
#print (results[:5])
X_homehigher = results[["HomeLastWin", "VisitorLastWin", "HomeTeamRanksHigher"]].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_homehigher, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))
from sklearn.grid_search import GridSearchCV
parameter_space = {
"max_depth": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
}
clf = DecisionTreeClassifier(random_state=14)
grid = GridSearchCV(clf, parameter_space)
grid.fit(X_homehigher, y_true)
#print("Accuracy: {0:.1f}%".format(grid.best_score_ * 100))
last_match_winner = defaultdict(int)
results["HomeTeamWonLast"] = 0
for index, row in results.iterrows():
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
teams = tuple(sorted([home_team, visitor_team])) # Sort for a consistent ordering
# Set in the row, who won the last encounter
row["HomeTeamWonLast"] = 1 if last_match_winner[teams] == row["Home Team"] else 0
results.loc[index] = row
# Who won this one?
winner = row["Home Team"] if row["HomeWin"] else row["Visitor Team"]
last_match_winner[teams] = winner
#results.ix[:5]
X_home_higher = results[["HomeTeamRanksHigher", "HomeTeamWonLast"]].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_home_higher, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))
#
结果如下:
Using whether the home team is ranked higher
Accuracy: 63.3%
Using whether the home team is ranked higher
Accuracy: 66.0%
Using whether the home team is ranked higher
Accuracy: 68.4%
再往下,我看不懂球队名字了,显示有新的label错误,就是球队名字变了,显示的错误代码是改的我脑仁疼,就不改了