本章介绍另一种分类算法:决策树,比起其他算法决策树最主要的一个优点诗决策过程是机器和人都能看懂的,我们使用机器学习到的模型就能完成预测任务,另一个优点是他可以处理多种不同类型的特征。
我们这章使用的数据请在文章开头的数据源中python数据挖掘/Chapter4中的文件
这一章的数据诗NBA2013-2014赛季的比赛数据,这是一个CSV文件,我们将它读取到pandas中看一下
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: dataset = pd.read_csv('leagues_NBA_2014_gam
...: es_games.csv')
In [4]: dataset.head()
Out[4]:
Date NaN Visitor/Neutral ... PTS.1 NaN.1 Notes
0 Tue Oct 29 2013 Box Score Orlando Magic ... 97 NaN NaN
1 Tue Oct 29 2013 Box Score Los Angeles Clippers ... 116 NaN NaN
2 Tue Oct 29 2013 Box Score Chicago Bulls ... 107 NaN NaN
3 Wed Oct 30 2013 Box Score Brooklyn Nets ... 98 NaN NaN
4 Wed Oct 30 2013 Box Score Atlanta Hawks ... 118 NaN NaN
现在这个数据有一些问题
- 第一列Date日期是字符串
- 表头需要优化
那么我们从新搞一哈
幸亏pandas可以将很多种字符串日期转化为标准日期对象
In [5]: dataset = pd.read_csv('/Users/gn/scikit--learn/data/leagues_NBA_2014_gam
...: es_games.csv',parse_dates = ['Date'],skiprows=[0,])
In [6]: dataset.columns = ['Data','Scire Type','Visitor Team','VisitorPts','Home
...: Team','HomePts','OT','Notes']
In [7]: dataset.head()
Out[7]:
Data Scire Type Visitor Team VisitorPts Home Team HomePts OT Notes
0 2013-10-29 Box Score Orlando Magic 87 Indiana Pacers 97 NaN NaN
1 2013-10-29 Box Score Los Angeles Clippers 103 Los Angeles Lakers 116 NaN NaN
2 2013-10-29 Box Score Chicago Bulls 95 Miami Heat 107 NaN NaN
3 2013-10-30 Box Score Brooklyn Nets 94 Cleveland Cavaliers 98 NaN NaN
4 2013-10-30 Box Score Atlanta Hawks 109 Dallas Mavericks 118 NaN N
现在看起来是不是好多了
由于数据中不包含胜负数据,需要我们将比分转化为直观的胜负
现在需要创建一些特征用于数据挖掘,我们使用上一场在主场的胜负和在客场的胜负来判断
In [11]: dataset['HomeWin'] = dataset['VisitorPts'] < dataset['HomePts']
In [12]: from collections import defaultdict
In [13]: won_last = defaultdict(int)
In [14]: dataset["HomeLastWin"] = False
In [15]: dataset["VisitorLastWin"] = False
In [16]: for index, row in dataset.iterrows():
...: home_team = row["Home Team"]
...: visitor_team = row["Visitor Team"]
...: row["HomeLastWin"] = won_last[home_team]
...: row["VisitorLastWin"] = won_last[visitor_team]
...: dataset.loc[index] = row
...: won_last[home_team] = row["HomeWin"]
...: won_last[visitor_team] = not row["HomeWin"]
...:
In [17]: dataset.loc[20:25]
Out[17]:
Data Scire Type Visitor Team VisitorPts Home Team HomePts OT Notes HomeWin HomeLastWin VisitorLastWin
20 2013-11-01 Box Score Milwaukee Bucks 105 Boston Celtics 98 NaN NaN False False False
21 2013-11-01 Box Score Miami