数据科学第 5 章建模过程与决策树模型

最新推荐文章于 2023-04-25 21:30:22 发布

weixin_34380781

最新推荐文章于 2023-04-25 21:30:22 发布

阅读量407

点赞数

文章标签：人工智能数据结构与算法大数据

原文链接：https://segmentfault.com/a/1190000017135112

版权

一周没写文了，之前干什么去了呢？本周前半部分卡在画图了，然后1/3的时间在处理数据，处理数据是我目前在画图和机器学习上一个重大的障碍，python处理各种报错各种不适合，等我学会了kettle和spark你再来堵我呀?哼！！！！
前天开始去省图看书了，除了要花10块坐地铁，1个多小时的时间，没有缺点。有暖气这点就很好了，而且连续两天都碰到一个70多岁的大爷，边看书边在本子上做笔记，励志。

下面写这两天实现的东西。先看了第5章的1-4节，
第1节：主要讲机器学习的分类：无监督、有监督、半监督。之前还分不清，多看一些资料后，就晓得了，有监督就是原来已经分好类，新的数据通过特征来确定应该划分到哪一类。无监督：原数据没有分类，通过机器学习，来确定分几类，可以用与精细化运营。半监督：还没遇到。
第2节：skilit learn 的api
讲了机器学习的过程：
a、选择模型
b、选择超参数
c、确定x、y。有的模型不需要y
d、训练模型，就fit(x,y)
e、预测值predict
如果在第3步对数据集进行划分，即x，y划分为训练集和测试集，预测完后可以看模型训练之后的准确性。这一点就是第3节的主要内容。
第3节：划分测试集与训练集，查看模型的准确性，选择最优模型
第4节：各种特征，文本、分类、图像，缺失值填充，特征管道。感觉特征管道比较有用，可以减少缺失值处理过程。

了解整个过程后，开始学决策树，之前跟着视频学过，代码也敲了一遍，但这次主要以书为主。

1、背景

数据：在网上找的一个电商网站的是否购买的数据。
模型：决策树
目标：看哪些特征对购买决策影响最大，预测是否购买

2、实现过程

2.1导入数据

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#(上面是要用的库，后面还有)

inputfile =  'C:/Users/xiaom/Desktop/data/online_shoppers_intention.csv'
df = pd.read_csv(inputfile)
#print(df.head())   打印前5行数据
#print(df.info())   打印数据的信息

'''下面是原始数据的信息：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
Administrative             12330 non-null int64
Administrative_Duration    12330 non-null float64
Informational              12330 non-null int64
Informational_Duration     12330 non-null float64
ProductRelated             12330 non-null int64
ProductRelated_Duration    12330 non-null float64
BounceRates                12330 non-null float64
ExitRates                  12330 non-null float64
PageValues                 12330 non-null float64
SpecialDay                 12330 non-null float64
Month                      12330 non-null object
OperatingSystems           12330 non-null int64
Browser                    12330 non-null int64
Region                     12330 non-null int64
TrafficType                12330 non-null int64
VisitorType                12330 non-null object
Weekend                    12330 non-null bool
Revenue                    12330 non-null bool
dtypes: bool(2), float64(7), int64(7), object(2)
memory usage: 1.5+ MB             '''

这是数据集的后几列

2.2数据处理

删除多余列、转换数据

#1、数据处理：删除前面6列。特征太多了，去掉一些，还删除了Month
df.drop(['Administrative', 'Administrative_Duration', 'Informational',
 'Informational_Duration' ,'ProductRelated', 'ProductRelated_Duration','Month'],axis=1, inplace=True)
df.head()

#2、转换VisitorType的值.   Weenkend,Revenue是bool值，还没找到转换的方法
#print(df.groupby('VisitorType').count()['Weekend']  #先看这个字段有几种值。

#这个字段是分新老客户，在groupby之后，基本分为2类，所以处理数据为0和1
df['VisitorType'] = np.where(df['VisitorType'] == 'Returning_Visitor',1,0)  
print(df.head())

2.3 选择模型，设置超参数，设置x,y

对应前3步
选择决策树的方法，默认是gini，可以指定为entropy。设置了最大的划分层为5，最小的一层最少有100个数据。括号种的参数可以不设置，为默认值。

#导入包：cross_validation 升级为 model_selection
from sklearn import  tree
from sklearn.model_selection import train_test_split
clf = tree.DecisionTreeClassifier(criterion='entropy',max_depth=5, min_samples_split= 100)  

x = df.iloc[:,:10].as_matrix()
y = df.iloc[:,10].as_matrix()

2.4 训练、预测

用了2种方法，一个没有划分训练、测试集，另外一个划分了训练与测试集，如下：

方法1：用全部数据测试，最后把图保存为pdf

clf.fit(x,y)

import graphviz  #这个是为了画图
## feature_names= df.columns[:10] 是设定画出的决策树pdf中标列名，就是知道哪些字段的得分，取值和x取值相同，画完记得核对一下是否有错位的。
dot_data = tree.export_graphviz(clf,out_file=None, feature_names= df.columns[:10])
graph = graphviz.Source(dot_data)
graph.render("tree5")

打开tree5.pdf，这就是结果，从结果看，预测的有点奇怪，如果用这个模型给我，肯定不及格，不过现在已掌握为主，后面再追求质量

方法2：划分训练集与测试集

x1,x2,y1,y2 = train_test_split(x,y,random_state=0,train_size=0.7)
clf.fit(x1,y1)
ypred = clf.predict(x2)

#查看分类器的分类结果：
from sklearn import metrics
print(metrics.classification_report(ypred,y2))  #截图中的 1


#画出上面的结果：截图中的 2
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y2,ypred)
sns.heatmap(mat.T,square = True, annot = True, fmt= 'd',cbar = False)
plt.xlabel('true label')
plt.ylabel('predicted label')
plt.show()

2.5 查看模型的准确率

y2model = clf.predict(x2)
#计算模型的正确率
from sklearn.metrics import accuracy_score
score = accuracy_score(y2, y2model)
print('模型的准确率为：')
print(score)

输出为：
模型的准确率为：
0.8796972154636388