文章目录
思维脑图
import os
import numpy as np
import pandas as pd
home_folder = "./PythonDataMining/"
data_folder = os.path.join(home_folder,'data')
data_filename = os.path.join(data_folder, "leagues_NBA_2014_games_games.csv")
3.1.2 用pandas加载数据集
results = pd.read_csv(data_filename)
results.iloc[:5]
Date | Score Type | Visitor Team | VisitorPts | Home Team | HomePts | OT? | Notes | |
---|---|---|---|---|---|---|---|---|
0 | Tue Oct 29 2013 | Box Score | Orlando Magic | 87 | Indiana Pacers | 97 | NaN | NaN |
1 | Tue Oct 29 2013 | Box Score | Los Angeles Clippers | 103 | Los Angeles Lakers | 116 | NaN | NaN |
2 | Tue Oct 29 2013 | Box Score | Chicago Bulls | 95 | Miami Heat | 107 | NaN | NaN |
3 | Wed Oct 30 2013 | Box Score | Brooklyn Nets | 94 | Cleveland Cavaliers | 98 | NaN | NaN |
4 | Wed Oct 30 2013 | Box Score | Atlanta Hawks | 109 | Dallas Mavericks | 118 | NaN | NaN |
3.1.3 清洗数据集
results = pd.read_csv(data_filename, skiprows=[0,])
# Fix the name of the columns
results.columns = ["Date", "Score Type", "Visitor Team", "VisitorPts", "Home Team", "HomePts", "OT?", "Notes"]
results.iloc[:5]
Date | Score Type | Visitor Team | VisitorPts | Home Team | HomePts | OT? | Notes | |
---|---|---|---|---|---|---|---|---|
0 | Tue Oct 29 2013 | Box Score | Los Angeles Clippers | 103 | Los Angeles Lakers | 116 | NaN | NaN |
1 | Tue Oct 29 2013 | Box Score | Chicago Bulls | 95 | Miami Heat | 107 | NaN | NaN |
2 | Wed Oct 30 2013 | Box Score | Brooklyn Nets | 94 | Cleveland Cavaliers | 98 | NaN | NaN |
3 | Wed Oct 30 2013 | Box Score | Atlanta Hawks | 109 | Dallas Mavericks | 118 | NaN | NaN |
4 | Wed Oct 30 2013 | Box Score | Washington Wizards | 102 | Detroit Pistons | 113 | NaN | NaN |
results['HomeWin'] = results['VisitorPts'] < results['HomePts']
y_true = results['HomeWin'].values
results.iloc[:5]
Date | Score Type | Visitor Team | VisitorPts | Home Team | HomePts | OT? | Notes | HomeWin | |
---|---|---|---|---|---|---|---|---|---|
0 | Tue Oct 29 2013 | Box Score | Los Angeles Clippers | 103 | Los Angeles Lakers | 116 | NaN | NaN | True |
1 | Tue Oct 29 2013 | Box Score | Chicago Bulls | 95 | Miami Heat | 107 | NaN | NaN | True |
2 | Wed Oct 30 2013 | Box Score | Brooklyn Nets | 94 | Cleveland Cavaliers | 98 | NaN | NaN | True |
3 | Wed Oct 30 2013 | Box Score | Atlanta Hawks | 109 | Dallas Mavericks | 118 | NaN | NaN | True |
4 | Wed Oct 30 2013 | Box Score | Washington Wizards | 102 | Detroit Pistons | 113 | NaN | NaN | True |
print("Home Win 百分比: {0:.1f}%".format(100 * results["HomeWin"].sum() / results["HomeWin"].count()))
results["HomeLastWin"] = False
results["VisitorLastWin"] = False
# This creates two new columns, all set to False
results.iloc[:5]
Home Win 百分比: 58.0%
Date | Score Type | Visitor Team | VisitorPts | Home Team | HomePts | OT? | Notes | HomeWin | HomeLastWin | VisitorLastWin | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Tue Oct 29 2013 | Box Score | Los Angeles Clippers | 103 | Los Angeles Lakers | 116 | NaN | NaN | True | False | False |
1 | Tue Oct 29 2013 | Box Score | Chicago Bulls | 95 | Miami Heat | 107 | NaN | NaN | True | False | False |
2 | Wed Oct 30 2013 | Box Score | Brooklyn Nets | 94 | Cleveland Cavaliers | 98 | NaN | NaN | True | False | False |
3 | Wed Oct 30 2013 | Box Score | Atlanta Hawks | 109 | Dallas Mavericks | 118 | NaN | NaN | True | False | False |
4 | Wed Oct 30 2013 | Box Score | Washington Wizards | 102 | Detroit Pistons | 113 | NaN | NaN | True | False | False |
现在计算这些的实际值
主队和客队最后一场比赛赢了吗?
# Now compute the actual values for these
# Did the home and visitor teams win their last game?
from collections import defaultdict
won_last = defaultdict(int)
for index, row in results.iterrows(): # Note that this is not efficient
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
row["HomeLastWin"] = won_last[home_team]
row["VisitorLastWin"] = won_last[visitor_team]
results.iloc[index] = row
# Set current win
won_last[home_team] = row["HomeWin"]
won_last[visitor_team] = not row["HomeWin"]
results.iloc[20:25]
Date | Score Type | Visitor Team | VisitorPts | Home Team | HomePts | OT? | Notes | HomeWin | HomeLastWin | VisitorLastWin | |
---|---|---|---|---|---|---|---|---|---|---|---|
20 | Fri Nov 1 2013 | Box Score | Miami Heat | 100 | Brooklyn Nets | 101 | NaN | NaN | True | False | False |
21 | Fri Nov 1 2013 | Box Score | Cleveland Cavaliers | 84 | Charlotte Bobcats | 90 | NaN | NaN | True | False | True |
22 | Fri Nov 1 2013 | Box Score | Portland Trail Blazers | 113 | Denver Nuggets | 98 | NaN | NaN | False | False | False |
23 | Fri Nov 1 2013 | Box Score | Dallas Mavericks | 105 | Houston Rockets | 113 | NaN | NaN | True | True | True |
24 | Fri Nov 1 2013 | Box Score | San Antonio Spurs | 91 | Los Angeles Lakers | 85 | NaN | NaN | False | False | True |
3.2 决策树
决策树是一种有监督的机器学习算法,它看起来就像是由一系列节点组成的流程图,其中位
于上层节点的值决定下一步走向哪个节点。
%%html
<img src = './image/决策树1.png',width=100,height=100>
<img src = ‘./image/决策树1.png’,width=100,height=100>
跟大多数分类算法一样,决策树也分为两大步骤。
首先是训练阶段,用训练数据构造一棵树。上一章的近邻算法没有训练阶段,但是决策
树需要。从这个意义上说,近邻算法是一种惰性算法,在用它进行分类时,它才开始干
活。相反,决策树跟大多数机器学习方法类似,是一种积极学习的算法,在训练阶段完
成模型的创建。
其次是预测阶段,用训练好的决策树预测新数据的类别。以上图为例,[“is raining”,
“very windy”]的预测结果为“Bad”(坏天气)。
创建决策树的算法有多种,大都通过迭代生成一棵树。它们从根节点开始,选取最佳特征,
用于第一个决策,到达下一个节点,选择下一个最佳特征,以此类推。当发现无法从增加树的层
级中获得更多信息时,算法启动退出机制。
scikit-learn库实现了分类回归树(Classification and Regression Trees,CART)算法并将
其作为生成决策树的默认算法,它支持连续型特征和类别型特征。
3.2.1 决策树中的参数
3.2.2 决策树的使用|
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=14)
from sklearn.model_selection import cross_val_score
X_previouswins = results[['HomeLastWin','VisitorLastWin']].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf,X_previouswins,y_true,scoring = 'accuracy')
print('Using just the last result from the home and visitor teams')
print('Accuracy: {0:.1f}%'.format(np.mean(scores)*100))
Using just the last result from the home and visitor teams
Accuracy: 59.1%
3.3 体育赛事结果预测
# What about win streaks?
results["HomeWinStreak"] = 0
results["VisitorWinStreak"] = 0
# Did the home and visitor teams win their last game?
from collections import defaultdict
win_streak = defaultdict(int)
for index, row in results.iterrows(): # Note that this is not efficient
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
row["HomeWinStreak"] = win_streak[home_team]
row["VisitorWinStreak"] = win_streak[visitor_team]
results.loc[index] = row
# Set current win
if row["HomeWin"]:
win_streak[home_team] += 1
win_streak[visitor_team] = 0
else:
win_streak[home_team] = 0
win_streak[visitor_team] += 1
clf = DecisionTreeClassifier(random_state=14)
X_winstreak = results[["HomeLastWin", "VisitorLastWin", "HomeWinStreak", "VisitorWinStreak"]].values
scores = cross_val_score(clf, X_winstreak, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))
Using whether the home team is ranked higher
Accuracy: 58.4%
我们试试看哪个队在阶梯上更好。使用上一年的梯子
ladder_filename = os.path.join(data_folder, "leagues_NBA_2013_standings_expanded-standings.csv")
ladder = pd.read_csv(ladder_filename)
ladder.head()
Rk | Team | Overall | Home | Road | E | W | A | C | SE | ... | Post | ≤3 | ≥10 | Oct | Nov | Dec | Jan | Feb | Mar | Apr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Miami Heat | 66-16 | 37-4 | 29-12 | 41-11 | 25-5 | 14-4 | 12-6 | 15-1 | ... | 30-2 | 9-3 | 39-8 | 1-0 | 10-3 | 10-5 | 8-5 | 12-1 | 17-1 | 8-1 |
1 | 2 | Oklahoma City Thunder | 60-22 | 34-7 | 26-15 | 21-9 | 39-13 | 7-3 | 8-2 | 6-4 | ... | 21-8 | 3-6 | 44-6 | NaN | 13-4 | 11-2 | 11-5 | 7-4 | 12-5 | 6-2 |
2 | 3 | San Antonio Spurs | 58-24 | 35-6 | 23-18 | 25-5 | 33-19 | 8-2 | 9-1 | 8-2 | ... | 16-12 | 9-5 | 31-10 | 1-0 | 12-4 | 12-4 | 12-3 | 8-3 | 10-4 | 3-6 |
3 | 4 | Denver Nuggets | 57-25 | 38-3 | 19-22 | 19-11 | 38-14 | 5-5 | 10-0 | 4-6 | ... | 24-4 | 11-7 | 28-8 | 0-1 | 8-8 | 9-6 | 12-3 | 8-4 | 13-2 | 7-1 |
4 | 5 | Los Angeles Clippers | 56-26 | 32-9 | 24-17 | 21-9 | 35-17 | 7-3 | 8-2 | 6-4 | ... | 17-9 | 3-5 | 38-12 | 1-0 | 8-6 | 16-0 | 9-7 | 8-5 | 7-7 | 7-1 |
5 rows × 24 columns
#这里好像所有的特征都转变为只有几类,例如True和false ,不然那个信息增益要算很多
# We can create a new feature -- HomeTeamRanksHigher\
results["HomeTeamRanksHigher"] = 0
for index, row in results.iterrows():
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
if home_team == "New Orleans Pelicans":
home_team = "New Orleans Hornets"
elif visitor_team == "New Orleans Pelicans":
visitor_team = "New Orleans Hornets"
home_rank = ladder[ladder["Team"] == home_team]["Rk"].values[0]
visitor_rank = ladder[ladder["Team"] == visitor_team]["Rk"].values[0]
row["HomeTeamRanksHigher"] = int(home_rank > visitor_rank)
results.iloc[index] = row
results[:5]
Date | Score Type | Visitor Team | VisitorPts | Home Team | HomePts | OT? | Notes | HomeWin | HomeLastWin | VisitorLastWin | HomeWinStreak | VisitorWinStreak | HomeTeamRanksHigher | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Tue Oct 29 2013 | Box Score | Los Angeles Clippers | 103 | Los Angeles Lakers | 116 | NaN | NaN | True | 0 | 0 | 0 | 0 | 1 |
1 | Tue Oct 29 2013 | Box Score | Chicago Bulls | 95 | Miami Heat | 107 | NaN | NaN | True | 0 | 0 | 0 | 0 | 0 |
2 | Wed Oct 30 2013 | Box Score | Brooklyn Nets | 94 | Cleveland Cavaliers | 98 | NaN | NaN | True | 0 | 0 | 0 | 0 | 1 |
3 | Wed Oct 30 2013 | Box Score | Atlanta Hawks | 109 | Dallas Mavericks | 118 | NaN | NaN | True | 0 | 0 | 0 | 0 | 1 |
4 | Wed Oct 30 2013 | Box Score | Washington Wizards | 102 | Detroit Pistons | 113 | NaN | NaN | True | 0 | 0 | 0 | 0 | 0 |
X_homehigher = results[["HomeLastWin", "VisitorLastWin", "HomeTeamRanksHigher"]].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_homehigher, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
Using whether the home team is ranked higher
准确率: 60.2%
from sklearn.model_selection import GridSearchCV
parameter_space = {
"max_depth": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
}
clf = DecisionTreeClassifier(random_state=14)
grid = GridSearchCV(clf, parameter_space)
grid.fit(X_homehigher, y_true)
print("准确率: {0:.1f}%".format(grid.best_score_ * 100))
准确率: 60.5%
#谁赢了最后一场比赛?我们忽略了家/访客这一点
last_match_winner = defaultdict(int)
results['HomeTeamWonLast'] = 0
for index, row in results.iterrows():
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
teams = tuple(sorted([home_team, visitor_team])) # Sort for a consistent ordering
# 在当前行中记录上次交手的胜方
row["HomeTeamWonLast"] = 1 if last_match_winner[teams] == row["Home Team"] else 0
results.loc[index] = row
# 本次比赛的胜方
winner = row["Home Team"] if row["HomeWin"] else row["Visitor Team"]
last_match_winner[teams] = winner
results.loc[:5]
Date | Score Type | Visitor Team | VisitorPts | Home Team | HomePts | OT? | Notes | HomeWin | HomeLastWin | VisitorLastWin | HomeWinStreak | VisitorWinStreak | HomeTeamRanksHigher | HomeTeamWonLast | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Tue Oct 29 2013 | Box Score | Los Angeles Clippers | 103 | Los Angeles Lakers | 116 | NaN | NaN | True | 0 | 0 | 0 | 0 | 1 | 0 |
1 | Tue Oct 29 2013 | Box Score | Chicago Bulls | 95 | Miami Heat | 107 | NaN | NaN | True | 0 | 0 | 0 | 0 | 0 | 0 |
2 | Wed Oct 30 2013 | Box Score | Brooklyn Nets | 94 | Cleveland Cavaliers | 98 | NaN | NaN | True | 0 | 0 | 0 | 0 | 1 | 0 |
3 | Wed Oct 30 2013 | Box Score | Atlanta Hawks | 109 | Dallas Mavericks | 118 | NaN | NaN | True | 0 | 0 | 0 | 0 | 1 | 0 |
4 | Wed Oct 30 2013 | Box Score | Washington Wizards | 102 | Detroit Pistons | 113 | NaN | NaN | True | 0 | 0 | 0 | 0 | 0 | 0 |
5 | Wed Oct 30 2013 | Box Score | Los Angeles Lakers | 94 | Golden State Warriors | 125 | NaN | NaN | True | 0 | True | 0 | 1 | 0 | 0 |
X_home_higher = results[["HomeTeamRanksHigher", "HomeTeamWonLast"]].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_home_higher, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
Using whether the home team is ranked higher
准确率: 60.5%
最后我们来看一下,决策树在训练数据量很大的情况下,能否得到有效的分类模型。我们将
会为决策树添加球队,以检测它是否能整合新增的信息。
虽然决策树能够处理特征值为类别型的数据,但scikit-learn库所实现的决策树算法要求
先对这类特征进行处理。用LabelEncoder转换器就能把字符串类型的球队名转化为整型。代码
如下
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
encoding = LabelEncoder()
encoding.fit(results["Home Team"].values)
home_teams = encoding.transform(results["Home Team"].values)
visitor_teams = encoding.transform(results["Visitor Team"].values)
X_teams = np.vstack([home_teams, visitor_teams]).T
决策树可以用这些特征值进行训练,但DecisionTreeClassifier仍把它们当作连续型特
征。例如,编号从0到16的17支球队,算法会认为球队1和2相似,而球队4和10不同。但其实这没
意义,对于两支球队而言,它们要么是同一支球队,要么不同,没有中间状态!
为了消除这种和实际情况不一致的现象,我们可以使用OneHotEncoder转换器把这些整数转
换为二进制数字。每个特征用一个二进制数字①来表示。例如,LabelEncoder为芝加哥公牛队分配
的数值是7,那么OneHotEncoder为它分配的二进制数字的第七位就是1,其余队伍的第七位就是0。
每个可能的特征值都这样处理,而数据集会变得很大。代码如下:
onehot = OneHotEncoder()
X_teams = onehot.fit_transform(X_teams).todense()
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
准确率: 60.1%
正确率为60%,比基准值要高,但是没有之前的效果好。原因可能在于特征数增加后,决策
树处理不当。鉴于此,我们尝试修改算法,看看会不会起作用。数据挖掘有时就是不断尝试新算
法、使用新特征这样一个过程。
3.4 随机森林
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print("Using full team labels is ranked higher")
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
Using full team labels is ranked higher
准确率: 61.5%
X_all = np.hstack([X_home_higher,X_teams])
print(X_all.shape)
(1229, 62)
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_all, y_true, scoring='accuracy')
print("Using whether the home team is ranked higher")
print("准确率: {0:.1f}%".format(np.mean(scores) * 100))
Using whether the home team is ranked higher
准确率: 62.9%
我们也可以尝试CridSearchCV类的其他参数
parameter_space = {
"max_features": [2, 10, 'auto'],
"n_estimators": [100,],
"criterion": ["gini", "entropy"],
"min_samples_leaf": [2, 4, 6],
}
clf = RandomForestClassifier(random_state=14)
grid = GridSearchCV(clf, parameter_space)
grid.fit(X_all, y_true)
print("准确率: {0:.1f}%".format(grid.best_score_ * 100))
print(grid.best_estimator_)
准确率: 65.4%
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='entropy', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=6, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=14, verbose=0,
warm_start=False)
参考文献
<<机器学习>> --周志华
<<数据挖掘概念与技术>> 中文版的