python决策树预测模型_[Python数据挖掘入门与实践]-第三章用决策树预测获胜球队...-CSDN博客

618196f59389

image.png

618196f59389

image.png

清洗数据集

通过上面的输出我们发现一些问题：

（1）Date属性不是Date对象而是String对象

（2）第一行标题列不完整或是部分列对应的属性名不正确

我们可以通过pd.read_csv函数来解决上述问题。

NOTES

# Don't read the first row, as it is blank, and parse the date column as a date

#usecols:选择表格中要用的列

#parse_dates:直接用列的index将该列转化为日期格式

#dayfirst:直接用列的index将该列转化为时间格式

#pd.columns:重新赋列名

results = pd.read_csv(data_filename,usecols=[0,1,2,3,4,5,6,7,8], parse_dates=[0], dayfirst=[1], skiprows=[0,])

# Fix the name of the columns

results.columns = ["Date","Start","Visitor Team","VisitorPts","Home Team","HomePts","OT","Notes",'Score Type']

results.ix[:5]

NOTES

本文介绍numpy数组中这四个方法的区别ndim、shape、dtype、astype。

##### 1.ndim

![image](https://upload-images.jianshu.io/upload_images/24215864-a0b2219229cd94b5?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

ndim返回的是数组的维度，返回的只有一个数，该数即表示数组的维度。

##### 2.shape

![image](https://upload-images.jianshu.io/upload_images/24215864-c8a6e4f365bd046b?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

shape：表示各位维度大小的元组。返回的是一个元组。

对于一维数组：有疑问的是为什么不是（1，6），因为arr1.ndim维度为1，元组内只返回一个数。

对于二维数组：前面的是行，后面的是列，他的ndim为2，所以返回两个数。

对于三维数组：很难看出，下面打印arr3，看下它是什么结构。

![image](https://upload-images.jianshu.io/upload_images/24215864-224f3d0c45afa6e9?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

先看最外面的中括号，包含[[1,2,3],[4,5,6]]和[[7,8,9],[10,11,12]]，假设他们为数组A、B，就得到[A,B]，如果A、B仅仅是一个数字，他的ndim就是2，这就是第一个数。但是A、B是（2，3）的数组。所以结合起来，这就是arr3的shape，为（2，2，3）。

将这种方法类比，也就可以推出4维、5维数组的shape。

##### 3.dtype

![image](https://upload-images.jianshu.io/upload_images/24215864-97b1d88e27731659?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

dtype：一个用于说明数组数据类型的对象。返回的是该数组的数据类型。由于图中的数据都为整形，所以返回的都是int32。如果数组中有数据带有小数点，那么就会返回float64。

有疑问的是：整形数据不应该是int吗？浮点型数据不应该是float吗？

解答：int32、float64是Numpy库自己的一套数据类型。

##### 4.astype

![image](https://upload-images.jianshu.io/upload_images/24215864-2e8e3017a26445c3?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

astype：转换数组的数据类型。

int32 --> float64 完全ojbk

float64 --> int32 会将小数部分截断

string_ --> float64 如果字符串数组表示的全是数字，也可以用astype转化为数值类型

![image](https://upload-images.jianshu.io/upload_images/24215864-63cd20e84ff740fd?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

注意其中的float，它是python内置的类型，但是Numpy可以使用。Numpy会将Python类型映射到等价的dtype上。

NOTES

df.dtypes # 各字段的数据类型

df.team.dtype # 某个字段的类型

s.dtype # S 的类型

df.dtypes.value_counts() # 各类型有多少个字段

618196f59389

image.png

NOTES-数据类型检测

pd.api.types.is_bool_dtype(s)

pd.api.types.is_categorical_dtype(s)

pd.api.types.is_datetime64_any_dtype(s)

pd.api.types.is_datetime64_ns_dtype(s)

pd.api.types.is_datetime64_dtype(s)

pd.api.types.is_float_dtype(s)

pd.api.types.is_int64_dtype(s)

pd.api.types.is_numeric_dtype(s)

pd.api.types.is_object_dtype(s)

pd.api.types.is_string_dtype(s)

pd.api.types.is_timedelta64_dtype(s)

pd.api.types.is_bool_dtype(s)

NOTES

1-type():

返回的是数据结构的类型(list, dict,numpy.ndarry)

>>> k = [1, 2]

>>> type(k)

>>> import numpy as np

>>> p = np.array(k)

>>> type(p)

2-dtype():

返回的是数据元素的类型(int, float)

>>> k = [1, 2]

>>> k.dtype

Traceback (most recent call last):

File "", line 1, in

AttributeError: 'list' object has no attribute 'dtype'

#由于 list、dict 等可以包含不同的数据类型，因此不可调用dtype()函数

>>> import numpy as np

>>> p = np.array(k)

>>> p.dtype

dtype('int32')

#np.array 中要求所有元素属于同一数据类型，因此可调用dtype()函数

3-astype():

改变np.array中所有数据元素的数据类型

>>> import numpy as np

>>> p = np.array(k)

>>> p

array([1, 2])

>>> p.astype(float)

array([1., 2.])

NOTES

1-loc

2-iloc

3-ix

代码中这段报错，因为r没有定义

results["HomeWin"] = results["VisitorPts"] < results["HomePts"]

# Our "class values"

y_true = results["HomeWin"].values

r = 0

for i in range(1315):

if results["HomeWin"][i] == True:

r +=1

print(r)

print("Home Win percentage: {0:.1f}%".format(100 * r / results["HomeWin"].count()))

上面这一大段都可以用一句话表示

results["HomeWin"].mean()

NOTES

1-iterrows()

这里的iterrows()返回值为元组,(index,row)

上面的代码里，for循环定义了两个变量，index，row，那么返回的元组，index=index，row=row.

from collections import defaultdict

won_last = defaultdict(int)

for index, row in results.iterrows(): # Note that this is not efficient

home_team = row["Home Team"]

visitor_team = row["Visitor Team"]

results["HomeLastWin"] = won_last[home_team]

results["VisitorLastWin"] = won_last[visitor_team]

results.ix[index] = row

# Set current win

won_last[home_team] = row["HomeWin"]

won_last[visitor_team] = not row["HomeWin"]

为什么第二行和第三行的row,不能换成results，因为用了会报错“'Series' objects are mutable, thus they cannot be hashed”

意思是 won_last['home_team'] 整体上是一个 Series，是容易改变的，因此不能作为 index 进行检索并赋值

NOTES

原来代码顺序有问题，并不能计算出两队上场是否获胜

# Now compute the actual values for these

# Did the home and visitor teams win their last game?

from collections import defaultdict

won_last = defaultdict(int)

for index, row in results.iterrows(): # Note that this is not efficient

home_team = row["Home Team"]

visitor_team = row["Visitor Team"]

won_last[home_team] = row["HomeWin"]

won_last[visitor_team] = not row["HomeWin"]

row["HomeLastWin"] = won_last[home_team]

row["VisitorLastWin"] = won_last[visitor_team]

results.ix[index] = row

# Set current win

results

NOTES

使用决策树进行预测

在scikit-learn包中已经实现了分类回归树（Classification and Regression Trees ）CART算法作为决策树的默认算法，它支持类别型（ categorical ）和连续型（continuous）特征。

决策树中的参数

决策树中的一个非常重要的参数就是停止标准（stopping criterion）。在构建决策树过程准备要结束时，最后几步决策仅依赖少量样本而且随机性很大，如果应用最后这几个少量的样本训练出的决策树模型会过拟合训练数据（overfit training data）。取而代之的是，使用停止标准会防止决策树对训练数据精度过高而带来的过拟合。

除了使用停止标准外，我们也可以根据已有样本将一颗树完整地构建出来，然后再通过剪枝（pruning）来获得一个通用模型，剪枝的过程就是将一些对整个决策树构建过程提供微不足道的信息的一些节点给去除掉。

scikit-learn中实现的决策树提供了以下两个选项来作为停止树构建的标准：

（1）min_samples_split：指定了在决策树中新建一个节点需要样本的数量。

（2）min_samples_leaf：指定为了保留节点，每个节点至少应该包含的样本数。

第一个参数控制决策树节点的创建，第二个参数决定节点是否会被保留。

决策树的另一个参数就是创建决策的标准，主要用到的就是基尼不纯度（Gini impurity）和信息增益（information gain）

（1）Gini impurity：用于衡量决策节点错误预测新样本类别的比例。

（2）information gain：用于信息论中的熵来表示决策节点提供多少新信息。

上面提到的这些参数值完成的功能大致相同（即决定使用什么样的准则或值去将节点拆分（split）为子节点）。值本身就是用来确定拆分的度量标准，因此值得选择会对最终的模型带来重要影响。

scores = cross_val_score(clf, X_previouswins, y_true, scoring='accuracy')

在cross_val_score中有一个scoring方法，官方文档并没有说清楚怎么设置不同的评价标准，下面链接说的很不错

https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter