在数据挖掘中,决策树是最常见的基础模型。很多优秀的模型都是基于决策树改进而来。决策树的原理网上有很多的介绍,不在此赘述。
从http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements下载Advertisements(广告)数据集
。数据集所描述的是网上的图像,目标是确定图像是不是广告。
前三 个特征分别指图像的高度、宽度和宽高比。最后一列是数据的类别,
1
表示是广告,
0
表示不是广告 。在前面的文章《
python核密度估计(KernelDensity)》
点击打开链接。我们已经对数据集的分布进行了分析。下面依然把处理的代码放上来。
#--coding:utf-8 import warnings warnings.filterwarnings("ignore") import numpy as np import pandas as pd import os from sklearn.neighbors import KernelDensity import matplotlib.pyplot as plt import mpl_toolkits.mplot3d filepath = os.path.join("../dataset/Internet Advertisements/Data Folder","ad.data") #对DataFrame的列做数据转换 def Converter_number(x): try: return np.float64(x) except ValueError: return np.nan #字典推导式 converters = {key:Converter_number for key in range(1558)} converters[1558] = lambda x: 1 if x.strip()=='ad.' else 0 ads = pd.read_csv(filepath,header=None,converters=converters) print(ads[:5]) # 0 1 2 3 4 5 6 7 8 9 ... 1549 \ # 0 125.0 125.0 1.0000 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 # 1 57.0 468.0 8.2105 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 # 2 33.0 230.0 6.9696 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 # 3 60.0 468.0 7.8000 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 # 4 60.0 468.0 7.8000 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 # # 1550 1551 1552 1553 1554 1555 1556 1557 1558 # 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 # 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 # 2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 # 3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 # 4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 # # [5 rows x 1559 columns] #首先去掉空值,并查看数据的统计 ads = ads.dropna(axis=0) print(ads.describe()) #我们可以看到,大量的特征的25%分位数显示,特征分布 # 0 1 2 3 4 \ # count 2359.000000 2359.000000 2359.000000 2359.000000 2359.000000 # mean 63.912251 155.631624 3.912982 0.759644 0.002120 # std 54.881130 130.237867 6.047220 0.427390 0.045999 # min 1.000000 1.000000 0.001500 0.000000 0.000000 # 25% 25.000000 80.500000 1.033450 1.000000 0.000000 # 50% 51.000000 110.000000 2.111100 1.000000 0.000000 # 75% 84.000000 184.000000 5.333300 1.000000 0.000000 # max 640.000000 640.000000 60.000000 1.000000 1.000000 # # 5 6 7 8 9 \ # count 2359.0 2359.000000 2359.000000 2359.000000 2359.000000 # mean 0.0 0.006359 0.004663 0.004663 0.014837 # std 0.0 0.079504 0.068141 0.068141 0.120925 # min 0.0 0.000000 0.000000 0.000000 0.000000 # 25% 0.0 0.000000 0.000000 0.000000 0.000000 # 50% 0.0 0.000000 0.000000 0.000000 0.000000 # 75% 0.0 0.000000 0.000000 0.000000 0.000000 # max 0.0 1.000000 1.000000 1.000000 1.000000 # # ... 1549 1550 1551 1552 \ # count ... 2359.000000 2359.000000 2359.000000 2359.000000 # mean ... 0.003815 0.001272 0.002120 0.002543 # std ... 0.061662 0.035646 0.045999 0.050379 # min ... 0.000000 0.000000 0.000000 0.000000 # 25% ... 0.000000 0.000000 0.000000 0.000000 # 50% ... 0.000000 0.000000 0.000000 0.000000 # 75% ... 0.000000 0.000000 0.000000 0.000000 # max ... 1.000000 1.000000 1.000000 1.000000 # # 1553 1554 1555 1556 1557 \ # count 2359.000000 2359.000000 2359.000000 2359.00000 2359.000000 # mean 0.008478 0.013989 0.014837 0.00975 0.000848 # std 0.091705 0.117470 0.120925 0.09828 0.029111 # min 0.000000 0.000000 0.000000 0.00000 0.000000 # 25% 0.000000 0.000000 0.000000 0.00000 0.000000 # 50% 0.000000 0.000000 0.000000 0.00000 0.000000 # 75% 0.000000 0.000000 0.000000 0.00000 0.000000 # max 1.000000 1.000000 1.000000 1.00000 1.000000 # # 1558 # count 2359.000000 # mean 0.161509 # std 0.368078 # min 0.000000 # 25% 0.000000 # 50% 0.000000 # 75% 0.000000 # max 1.000000 # # [8 rows x 1559 columns]df_all = ads.values X = df_all[:,:-1] y = df_all[:,-1] from sklearn.tree import DecisionTreeClassifier from sklearn.cross_validation import cross_val_score clf = DecisionTreeClassifier(random_state=14) #交叉验证 scores = cross_val_score(clf,X,y,scoring='accuracy')print("Accuracy:{0:.1f}%".format(np.mean(scores)*100)) #Accuracy:93.3%
if __name__ == "__main__": print("OK")简单的利用python自带的决策树,我们就得到了93.3%的准确率。后面我们会尝试使用多种方法来提升准确率。包括特征选择,变换模型等。